Chapter Introduction
Welcome to Chapter 9 of our journey to build a robust Mermaid analyzer and fixer in Rust! In the previous chapters, we laid the foundational blocks of our parser, successfully tokenizing and building an Abstract Syntax Tree (AST) for basic Mermaid structures like simple nodes and edges. However, real-world Mermaid diagrams often involve more intricate constructs, such as nested subgraphs, multiline labels, escaped characters, and inline comments. These elements introduce significant complexity to our parsing logic and demand careful consideration to ensure strict adherence to the Mermaid specification.
This chapter will guide you through extending our existing lexer and parser to gracefully handle these advanced syntax features and critical edge cases. We will refine our AST to accurately represent nested structures and implement robust parsing rules that account for the nuances of Mermaid’s grammar. Furthermore, we will begin to enhance our parser’s error recovery mechanisms, ensuring that even with incomplete or malformed input, our tool can provide meaningful diagnostics rather than simply crashing.
By the end of this chapter, our Mermaid analyzer will be capable of processing a much broader range of Mermaid diagrams, accurately capturing their structure, and laying the groundwork for more sophisticated validation and fixing in subsequent chapters. You will have a deeper understanding of compiler design principles, particularly in handling recursive grammars and building resilient parsing systems.
Planning & Design
Handling advanced parsing features like nested subgraphs and multiline labels requires careful extensions to our AST and the parser’s logic. Our goal is to maintain the strictness and determinism established in previous chapters, ensuring that every parsed element accurately reflects the Mermaid specification.
Component Architecture for Advanced Parsing
We will primarily be extending the lexer, parser, and ast modules.
lexerModule Enhancements:- Recognize
subgraphandendkeywords. - Handle
%%for inline comments. - Improve string literal parsing to correctly handle escaped quotes (
\") and potentially multiline content (though Mermaid typically uses<br/>for explicit newlines within labels).
- Recognize
astModule Extensions:- Introduce a recursive
Subgraphstructure within our AST. A subgraph is essentially a mini-diagram, containing its own nodes, edges, and potentially other subgraphs. - Modify
NodeLabelandEdgeLabelto explicitly support multiline content (represented by\nor<br/>in the source). - Add an
AstCommentvariant to store inline comments.
- Introduce a recursive
parserModule Refinements:- Implement a recursive parsing function for subgraphs. When
subgraphis encountered, the parser will descend into a new parsing context untilendis found. - Update label parsing logic to correctly extract content, including escaped characters.
- Integrate comment parsing into the main parsing loop, associating comments with their logical position in the AST if possible, or collecting them as top-level elements.
- Enhance error recovery: Instead of immediate failure, attempt to synchronize the parser’s state after an error to continue parsing and collect more diagnostics.
- Implement a recursive parsing function for subgraphs. When
The following Mermaid diagram illustrates the conceptual flow and the recursive nature of subgraph parsing:
File Structure
We’ll primarily be modifying existing files and potentially adding a new helper struct or two within our ast module.
src/
├── main.rs
├── cli.rs
├── lexer/
│ ├── mod.rs
│ └── token.rs
├── parser/
│ ├── mod.rs
│ └── error.rs
├── ast/
│ ├── mod.rs <- Will be heavily modified
│ └── types.rs
├── diagnostics/
│ ├── mod.rs
│ └── emitter.rs
├── validator/
│ ├── mod.rs
│ └── rules.rs
└── utils/
└── span.rs
Step-by-Step Implementation
We’ll tackle these complexities incrementally, starting with the AST changes, then lexer, and finally the parser.
1. AST Enhancements for Nested Structures and Comments
First, let’s update our ast/mod.rs and ast/types.rs to support subgraphs and comments.
src/ast/mod.rs
(No significant changes to the mod.rs itself, but ensure it re-exports types.rs contents.)
src/ast/types.rs
We need to redefine how a GraphItem can be composed. A Subgraph will contain its own Vec<GraphItem>, making it recursive. We also need a way to store comments.
// src/ast/types.rs
use crate::utils::span::Span;
use std::collections::HashMap;
/// Represents the overall structure of a Mermaid diagram.
#[derive(Debug, PartialEq, Clone)]
pub struct Ast {
pub diagram_type: DiagramType,
pub items: Vec<GraphItem>,
pub span: Span,
}
/// Represents the type of Mermaid diagram.
#[derive(Debug, PartialEq, Clone)]
pub enum DiagramType {
Flowchart(FlowchartDirection),
Sequence,
Class,
// Add other diagram types as supported
Unknown, // For diagrams not yet fully parsed or invalid
}
/// Represents the direction of a flowchart.
#[derive(Debug, PartialEq, Clone)]
pub enum FlowchartDirection {
TopToBottom, // TD
BottomToTop, // BT
RightToLeft, // RL
LeftToRight, // LR
}
/// Represents a node in a diagram.
#[derive(Debug, PartialEq, Clone)]
pub struct AstNode {
pub id: String,
pub label: Option<NodeLabel>,
pub shape: NodeShape,
pub span: Span,
}
/// Represents a label for a node, which can be multiline.
#[derive(Debug, PartialEq, Clone)]
pub struct NodeLabel {
pub text: String, // Can contain '\n' or '<br/>' for multiline
pub span: Span,
}
/// Represents the shape of a node.
#[derive(Debug, PartialEq, Clone)]
pub enum NodeShape {
Rectangle, // A[Label]
RoundedRectangle, // A(Label)
Stadium, // A([Label])
Subroutine, // A[[Label]]
Cylinder, // A[(Label)]
Circle, // A((Label))
Asymmetric, // A>Label]
Rhombus, // A{Label}
Hexagon, // A{{Label}}
Parallelogram, // A[/Label/]
ParallelogramAlt, // A[\Label\]
Trapezoid, // A[/Label\]
TrapezoidAlt, // A[\Label/]
Database, // A{{Label}} (often rendered differently, but same syntax as hexagon)
Unknown, // For malformed or unsupported shapes
}
/// Represents an edge (connection) between two nodes.
#[derive(Debug, PartialEq, Clone)]
pub struct AstEdge {
pub source: String,
pub target: String,
pub label: Option<EdgeLabel>,
pub arrow_type: ArrowType,
pub line_type: LineType,
pub span: Span,
}
/// Represents a label for an edge, which can be multiline.
#[derive(Debug, PartialEq, Clone)]
pub struct EdgeLabel {
pub text: String, // Can contain '\n' or '<br/>' for multiline
pub span: Span,
}
/// Represents the type of arrow for an edge.
#[derive(Debug, PartialEq, Clone)]
pub enum ArrowType {
None, // ---
Normal, // -->
Open, // --o
Cross, // --x
BidirectionalNormal, // <-->
BidirectionalOpen, // o--o
BidirectionalCross, // x--x
Thick, // ===>
Dot, // -.->
DotThick, // =.=>
Unknown,
}
/// Represents the type of line for an edge.
#[derive(Debug, PartialEq, Clone)]
pub enum LineType {
Solid,
Dashed,
Thick,
Invisible, // For implicit relationships, e.g., class diagrams
Unknown,
}
/// Represents a subgraph, which can contain its own items.
#[derive(Debug, PartialEq, Clone)]
pub struct AstSubgraph {
pub id: String,
pub label: Option<NodeLabel>, // Subgraph can have an optional label
pub items: Vec<GraphItem>, // Recursive definition!
pub span: Span,
}
/// Represents a comment in the diagram.
#[derive(Debug, PartialEq, Clone)]
pub struct AstComment {
pub text: String,
pub span: Span,
}
/// An enum to hold any top-level item in the graph.
/// This allows for nodes, edges, subgraphs, and comments to coexist.
#[derive(Debug, PartialEq, Clone)]
pub enum GraphItem {
Node(AstNode),
Edge(AstEdge),
Subgraph(AstSubgraph),
Comment(AstComment),
// Future: Directive, Style, etc.
}
impl AstNode {
pub fn new(id: String, label: Option<NodeLabel>, shape: NodeShape, span: Span) -> Self {
AstNode { id, label, shape, span }
}
}
impl AstEdge {
pub fn new(source: String, target: String, label: Option<EdgeLabel>, arrow_type: ArrowType, line_type: LineType, span: Span) -> Self {
AstEdge { source, target, label, arrow_type, line_type, span }
}
}
impl AstSubgraph {
pub fn new(id: String, label: Option<NodeLabel>, items: Vec<GraphItem>, span: Span) -> Self {
AstSubgraph { id, label, items, span }
}
}
impl AstComment {
pub fn new(text: String, span: Span) -> Self {
AstComment { text, span }
}
}
impl NodeLabel {
pub fn new(text: String, span: Span) -> Self {
NodeLabel { text, span }
}
}
impl EdgeLabel {
pub fn new(text: String, span: Span) -> Self {
EdgeLabel { text, span }
}
}
Explanation:
- We introduced
AstSubgraphwhich contains aVec<GraphItem>, making it a recursive structure. This is the core change for nested subgraphs. NodeLabelandEdgeLabelnow explicitly store aStringtextwhich can contain\nor<br/>for multiline content. The lexer will be responsible for correctly identifying and passing this text.AstCommentis added to represent inline comments.- The
GraphItemenum now includesSubgraph(AstSubgraph)andComment(AstComment)variants, allowing them to be top-level items in our AST.
2. Lexer Enhancements for subgraph, end, and %% Comments
Now, we need to update our lexer to recognize these new keywords and comment syntax.
src/lexer/token.rs
Add new token types:
// src/lexer/token.rs
use crate::utils::span::Span;
/// Represents a token produced by the lexer.
#[derive(Debug, PartialEq, Clone)]
pub struct Token {
pub kind: TokenKind,
pub lexeme: String,
pub span: Span,
}
/// Enum representing the different kinds of tokens.
#[derive(Debug, PartialEq, Clone)]
pub enum TokenKind {
// Keywords
Graph,
Flowchart,
Sequence,
Class,
Subgraph, // NEW
End, // NEW
// Directions
TD,
BT,
RL,
LR,
// Identifiers and Literals
Identifier,
StringLiteral, // Quoted string
NumberLiteral, // For future use, e.g., sequence numbers
// Punctuation and Operators
OpenParen, // (
CloseParen, // )
OpenBracket, // [
CloseBracket, // ]
OpenBrace, // {
CloseBrace, // }
OpenAngle, // <
CloseAngle, // >
Colon, // :
Equals, // =
Comma, // ,
Semicolon, // ;
Newline, // \n
Whitespace, // Space, tab (usually ignored, but sometimes relevant for formatting)
// Arrows and Lines
ArrowNormal, // -->
ArrowOpen, // --o
ArrowCross, // --x
ArrowBidirectionalNormal, // <-->
ArrowBidirectionalOpen, // o--o
ArrowBidirectionalCross, // x--x
ArrowThick, // ===>
ArrowDot, // -.->
ArrowDotThick, // =.=>
LineSolid, // ---
LineDashed, // --- (when no arrow)
LineThick, // ===
LineDot, // -.-
// Special
Comment, // NEW: %% comment text
Unknown, // For unrecognised characters/sequences
Eof, // End of file
}
impl Token {
pub fn new(kind: TokenKind, lexeme: String, span: Span) -> Self {
Token { kind, lexeme, span }
}
}
src/lexer/mod.rs
We need to modify the Lexer::next_token method to recognize these new patterns.
// src/lexer/mod.rs
use crate::lexer::token::{Token, TokenKind};
use crate::diagnostics::emitter::DiagnosticEmitter;
use crate::utils::span::Span;
pub struct Lexer<'a> {
source: &'a str,
chars: std::iter::Peekable<std::str::Chars<'a>>,
start: usize,
current: usize,
line: usize,
col: usize,
emitter: &'a mut DiagnosticEmitter,
}
impl<'a> Lexer<'a> {
pub fn new(source: &'a str, emitter: &'a mut DiagnosticEmitter) -> Self {
Lexer {
source,
chars: source.chars().peekable(),
start: 0,
current: 0,
line: 1,
col: 1,
emitter,
}
}
pub fn next_token(&mut self) -> Token {
self.skip_whitespace_and_comments(); // Add comment skipping here
self.start = self.current;
if self.is_at_end() {
return self.make_token(TokenKind::Eof);
}
let c = self.advance().unwrap(); // We know it's not at end
match c {
'(' => self.make_token(TokenKind::OpenParen),
')' => self.make_token(TokenKind::CloseParen),
'[' => self.make_token(TokenKind::OpenBracket),
']' => self.make_token(TokenKind::CloseBracket),
'{' => self.make_token(TokenKind::OpenBrace),
'}' => self.make_token(TokenKind::CloseBrace),
'<' => {
if self.peek() == Some('-') {
self.advance();
if self.peek() == Some('-') {
self.advance();
if self.peek() == Some('>') {
self.advance();
return self.make_token(TokenKind::ArrowBidirectionalNormal); // <-->
} else if self.peek() == Some('o') {
self.advance();
return self.make_token(TokenKind::ArrowBidirectionalOpen); // <--o (not official, but common variant)
} else if self.peek() == Some('x') {
self.advance();
return self.make_token(TokenKind::ArrowBidirectionalCross); // <--x (not official, but common variant)
}
}
}
self.make_token(TokenKind::OpenAngle) // Single <
}
'>' => self.make_token(TokenKind::CloseAngle),
':' => self.make_token(TokenKind::Colon),
'=' => {
if self.peek() == Some('=') {
self.advance();
if self.peek() == Some('=') {
self.advance();
if self.peek() == Some('>') {
self.advance();
return self.make_token(TokenKind::ArrowThick); // ===>
}
return self.make_token(TokenKind::LineThick); // ===
} else if self.peek() == Some('.') {
self.advance();
if self.peek() == Some('>') {
self.advance();
return self.make_token(TokenKind::ArrowDotThick); // =.=>
}
}
}
self.make_token(TokenKind::Equals) // Single =
}
',' => self.make_token(TokenKind::Comma),
';' => self.make_token(TokenKind::Semicolon),
'/' => {
if self.peek() == Some('/') {
// This is a line comment in some languages, but Mermaid uses `%%`
// For now, treat `//` as an identifier start if not part of a specific pattern.
// Or we could define it as an Unknown token if it's truly invalid.
// For now, let's assume it might be part of a node ID if it's not a known symbol.
self.current -= 1; // Backtrack
self.start = self.current;
return self.identifier_or_keyword();
}
self.make_token(TokenKind::Unknown) // Single / might be part of shape, handle later
}
'%' => { // NEW: Handle comments
if self.peek() == Some('%') {
self.advance(); // Consume the second '%'
self.skip_to_end_of_line();
let comment_text = &self.source[self.start + 2..self.current].trim_end(); // Exclude '%%' and trim
return self.make_token_with_lexeme(TokenKind::Comment, comment_text.to_string());
}
self.make_token(TokenKind::Unknown) // Single %
}
'-' => {
if self.peek() == Some('.') {
self.advance();
if self.peek() == Some('>') {
self.advance();
return self.make_token(TokenKind::ArrowDot); // -.->
}
}
if self.peek() == Some('-') {
self.advance();
if self.peek() == Some('>') {
self.advance();
return self.make_token(TokenKind::ArrowNormal); // -->
} else if self.peek() == Some('o') {
self.advance();
return self.make_token(TokenKind::ArrowOpen); // --o
} else if self.peek() == Some('x') {
self.advance();
return self.make_token(TokenKind::ArrowCross); // --x
}
return self.make_token(TokenKind::LineSolid); // ---
}
self.make_token(TokenKind::Unknown) // Single -
}
'"' => self.string_literal(),
'\n' => {
self.line += 1;
self.col = 1;
self.make_token(TokenKind::Newline)
}
c if c.is_ascii_digit() => self.number_literal(),
c if c.is_ascii_alphabetic() || c == '_' => self.identifier_or_keyword(),
_ => {
self.emitter.error(
format!("Unexpected character: '{}'", c),
self.current - 1,
self.current,
"E0001",
"Remove or correct the unexpected character.",
);
self.make_token(TokenKind::Unknown)
}
}
}
fn skip_whitespace_and_comments(&mut self) {
loop {
match self.peek() {
Some(' ') | Some('\r') | Some('\t') => {
self.advance();
self.start = self.current; // Reset start to ignore skipped whitespace
}
Some('\n') => {
// Newlines are significant for parsing, so we don't skip them here.
// They are tokenized as `Newline`.
return;
}
Some('%') if self.peek_next() == Some('%') => {
self.advance(); // Consume first %
self.advance(); // Consume second %
self.skip_to_end_of_line();
self.start = self.current; // Reset start to ignore skipped comment
}
_ => return,
}
}
}
fn skip_to_end_of_line(&mut self) {
while let Some(c) = self.peek() {
if c == '\n' {
break;
}
self.advance();
}
}
fn string_literal(&mut self) -> Token {
let mut content = String::new();
let mut escaped = false;
while let Some(c) = self.peek() {
if c == '"' && !escaped {
break; // End of string
}
if c == '\n' {
// Mermaid allows multiline labels, but usually with <br/> or explicit newlines.
// If a raw newline is encountered within a quoted string, it's part of the string.
content.push(c);
self.advance();
self.line += 1;
self.col = 1;
continue;
}
if c == '\\' {
self.advance(); // Consume the backslash
match self.peek() {
Some('"') => {
content.push('"'); // Add the escaped quote
escaped = false; // Reset escaped state after consuming
}
Some('\\') => {
content.push('\\'); // Add the escaped backslash
escaped = false;
}
_ => {
// Invalid escape sequence, just add the backslash and the next char
self.emitter.warning(
format!("Invalid escape sequence '\\{}'", self.peek().unwrap_or(' ')),
self.current - 1,
self.current + 1,
"W0001",
"Consider removing the backslash or escaping correctly.",
);
content.push('\\');
if let Some(next_c) = self.peek() {
content.push(next_c);
}
}
}
} else {
content.push(c);
}
self.advance();
escaped = false; // Reset for next character
}
if self.is_at_end() || self.peek() != Some('"') {
self.emitter.error(
"Unterminated string literal.",
self.start,
self.current,
"E0002",
"Add a closing '\"' to terminate the string.",
);
return self.make_token_with_lexeme(TokenKind::StringLiteral, content); // Return partial string
}
self.advance(); // Consume the closing '"'
self.make_token_with_lexeme(TokenKind::StringLiteral, content)
}
fn number_literal(&mut self) -> Token {
while self.peek().map_or(false, |c| c.is_ascii_digit()) {
self.advance();
}
// Mermaid mostly uses numbers for sequence diagram line numbers, etc.
// For flowcharts, node IDs are alphanumeric.
self.make_token(TokenKind::NumberLiteral)
}
fn identifier_or_keyword(&mut self) -> Token {
while self.peek().map_or(false, |c| c.is_ascii_alphanumeric() || c == '_') {
self.advance();
}
let text = &self.source[self.start..self.current];
let kind = match text {
"graph" => TokenKind::Graph,
"flowchart" => TokenKind::Flowchart,
"sequence" => TokenKind::Sequence,
"classDiagram" => TokenKind::Class, // Assuming this as a keyword
"TD" => TokenKind::TD,
"BT" => TokenKind::BT,
"RL" => TokenKind::RL,
"LR" => TokenKind::LR,
"subgraph" => TokenKind::Subgraph, // NEW
"end" => TokenKind::End, // NEW
_ => TokenKind::Identifier,
};
self.make_token(kind)
}
fn advance(&mut self) -> Option<char> {
self.current += 1;
self.col += 1;
self.chars.next()
}
fn peek(&self) -> Option<char> {
self.chars.peek().copied()
}
fn peek_next(&self) -> Option<char> {
let mut temp_chars = self.chars.clone();
temp_chars.next(); // Consume current
temp_chars.next().copied() // Peek next
}
fn is_at_end(&self) -> bool {
self.current >= self.source.len()
}
fn make_token(&self, kind: TokenKind) -> Token {
let lexeme = self.source[self.start..self.current].to_string();
let span = Span::new(self.start, self.current, self.line, self.col - lexeme.len());
Token::new(kind, lexeme, span)
}
fn make_token_with_lexeme(&self, kind: TokenKind, lexeme: String) -> Token {
let span = Span::new(self.start, self.current, self.line, self.col - lexeme.len());
Token::new(kind, lexeme, span)
}
}
Explanation of Lexer Changes:
TokenKind::SubgraphandTokenKind::End: Added these totoken.rs.TokenKind::Comment: Added for%%comments.identifier_or_keyword: Updated to recognize"subgraph"and"end"as keywords.skip_whitespace_and_comments: This new helper function is crucial. It’s called at the beginning ofnext_tokento consume any leading whitespace or full-line comments (%% ...). It’s important that full-line comments are skipped here, so they don’t block subsequent tokenization.next_tokenforComment: When%%is encountered, we now advance past it and then consume characters until a newline or EOF. The text after%%(and before newline/EOF) becomes the lexeme for theCommenttoken.string_literalfor Escaped Characters: Thestring_literalfunction is enhanced to correctly handle\"escape sequences, treating the escaped quote as part of the string content rather than a terminator. It also explicitly handles newlines within quoted strings.
3. Parser Enhancements for Nested Structures and Comments
This is where the main logic for handling recursion and new item types comes in.
src/parser/mod.rs
We will introduce a recursive parsing method for subgraphs and modify the main parsing loop to handle comments and subgraphs.
// src/parser/mod.rs
use crate::lexer::Lexer;
use crate::lexer::token::{Token, TokenKind};
use crate::ast::types::{Ast, DiagramType, FlowchartDirection, AstNode, NodeLabel, NodeShape, AstEdge, EdgeLabel, ArrowType, LineType, GraphItem, AstSubgraph, AstComment};
use crate::diagnostics::emitter::DiagnosticEmitter;
use crate::utils::span::Span;
pub struct Parser<'a> {
lexer: Lexer<'a>,
current_token: Token,
previous_token: Token,
emitter: &'a mut DiagnosticEmitter,
errors_encountered: u32,
// Stack to keep track of open subgraphs or other scopes
scope_stack: Vec<String>,
}
impl<'a> Parser<'a> {
pub fn new(lexer: Lexer<'a>, emitter: &'a mut DiagnosticEmitter) -> Self {
let dummy_token = Token::new(TokenKind::Eof, "".to_string(), Span::new(0, 0, 0, 0));
let mut parser = Parser {
lexer,
current_token: dummy_token.clone(),
previous_token: dummy_token,
emitter,
errors_encountered: 0,
scope_stack: Vec::new(),
};
parser.advance(); // Initialize current_token
parser
}
pub fn parse(&mut self) -> Option<Ast> {
let start_span = self.current_token.span;
// 1. Parse diagram type
let diagram_type = self.parse_diagram_type()?;
// 2. Parse graph items (nodes, edges, subgraphs, comments)
let mut items = Vec::new();
while !self.is_at_end() {
if let Some(item) = self.parse_graph_item() {
items.push(item);
} else {
// If parse_graph_item returns None, it means an error occurred
// and we should try to recover or stop.
// For now, let's just break to avoid infinite loops on unrecoverable errors.
// A more robust recovery would try to synchronize to a known token (e.g., 'graph', 'subgraph', 'node_id')
break;
}
}
if self.errors_encountered > 0 {
None // Don't return AST if there were parsing errors
} else {
Some(Ast {
diagram_type,
items,
span: start_span.merge(&self.previous_token.span),
})
}
}
fn parse_diagram_type(&mut self) -> Option<DiagramType> {
let diagram_type_span_start = self.current_token.span.start;
let diagram_type_lexeme = self.current_token.lexeme.clone();
let kind = self.current_token.kind.clone();
match kind {
TokenKind::Graph | TokenKind::Flowchart => {
self.expect_token_kind_and_advance(&[TokenKind::Graph, TokenKind::Flowchart], "Expected 'graph' or 'flowchart' declaration.")?;
// Parse direction if present
if let Some(direction) = self.parse_flowchart_direction() {
return Some(DiagramType::Flowchart(direction));
}
// Default to TD if no direction specified
Some(DiagramType::Flowchart(FlowchartDirection::TopToBottom))
}
TokenKind::Sequence => {
self.expect_token_kind_and_advance(&[TokenKind::Sequence], "Expected 'sequence' declaration.")?;
Some(DiagramType::Sequence)
}
TokenKind::Class => { // Assuming 'classDiagram' is tokenized as Class
self.expect_token_kind_and_advance(&[TokenKind::Class], "Expected 'classDiagram' declaration.")?;
Some(DiagramType::Class)
}
_ => {
self.emit_error(
format!("Expected a diagram type declaration (e.g., 'graph', 'flowchart', 'sequence', 'classDiagram'), found '{}'.", diagram_type_lexeme),
diagram_type_span_start,
self.current_token.span.end,
"E0101",
"Start your Mermaid diagram with a valid declaration.",
);
None
}
}
}
fn parse_flowchart_direction(&mut self) -> Option<FlowchartDirection> {
match self.current_token.kind {
TokenKind::TD => { self.advance(); Some(FlowchartDirection::TopToBottom) }
TokenKind::BT => { self.advance(); Some(FlowchartDirection::BottomToTop) }
TokenKind::RL => { self.advance(); Some(FlowchartDirection::RightToLeft) }
TokenKind::LR => { self.advance(); Some(FlowchartDirection::LeftToRight) }
_ => None, // No direction specified, or it's another token
}
}
// NEW: Centralized parsing of graph items
fn parse_graph_item(&mut self) -> Option<GraphItem> {
match self.current_token.kind {
TokenKind::Comment => { // NEW: Handle comments
let comment_text = self.current_token.lexeme.clone();
let span = self.current_token.span;
self.advance(); // Consume the comment token
Some(GraphItem::Comment(AstComment::new(comment_text, span)))
}
TokenKind::Subgraph => { // NEW: Handle subgraphs
self.parse_subgraph().map(GraphItem::Subgraph)
}
TokenKind::Identifier => {
// Could be a node declaration or the start of an edge
// We need to peek to decide.
let next_token_kind = self.lexer.peek_kind(); // Assuming lexer has peek_kind
if next_token_kind.map_or(false, |k| k.is_arrow_or_line()) {
self.parse_edge().map(GraphItem::Edge)
} else {
self.parse_node().map(GraphItem::Node)
}
}
TokenKind::Newline => {
self.advance(); // Consume newline, it's usually just separating items
None // Newlines themselves are not graph items
}
TokenKind::Eof => None, // Reached end, no more items
_ => {
// Attempt error recovery: skip tokens until a known item start or EOF
self.emit_error(
format!("Unexpected token '{}'. Expected node, edge, subgraph, or comment.", self.current_token.lexeme),
self.current_token.span.start,
self.current_token.span.end,
"E0102",
"Remove this token or correct the syntax to form a valid diagram element.",
);
self.synchronize();
None
}
}
}
// NEW: Recursive subgraph parsing
fn parse_subgraph(&mut self) -> Option<AstSubgraph> {
let start_span = self.current_token.span;
self.expect_token_kind_and_advance(&[TokenKind::Subgraph], "Expected 'subgraph' keyword.")?;
let id_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier], "Expected subgraph ID after 'subgraph' keyword.")?;
let id = id_token.lexeme.clone();
let label = if self.current_token.kind == TokenKind::OpenBracket {
self.parse_node_label() // Subgraph labels use node label syntax
} else {
None
};
// Push subgraph ID onto scope stack
self.scope_stack.push(id.clone());
let mut items = Vec::new();
// Parse items within the subgraph until 'end' or EOF
while self.current_token.kind != TokenKind::End && !self.is_at_end() {
if let Some(item) = self.parse_graph_item() {
items.push(item);
} else {
// Error in subgraph item, try to recover
if self.current_token.kind != TokenKind::End && !self.is_at_end() {
self.synchronize(); // Try to skip to a valid state
}
}
}
let end_span = self.current_token.span;
self.expect_token_kind_and_advance(&[TokenKind::End], &format!("Expected 'end' to close subgraph '{}'.", id))?;
// Pop subgraph ID from scope stack
self.scope_stack.pop();
Some(AstSubgraph::new(id, label, items, start_span.merge(&end_span)))
}
fn parse_node(&mut self) -> Option<AstNode> {
let start_span = self.current_token.span;
let id_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier], "Expected node ID.")?;
let id = id_token.lexeme.clone();
let mut label: Option<NodeLabel> = None;
let mut shape: NodeShape = NodeShape::Rectangle; // Default shape
// Check for node label and shape
match self.current_token.kind {
TokenKind::OpenBracket => { // [Label] or [[Label]] or [/Label\] etc.
label = self.parse_node_label(); // This handles different bracket types
shape = self.determine_node_shape(&id_token.lexeme, &self.previous_token.kind, &self.current_token.kind);
}
TokenKind::OpenParen => { // (Label) or ((Label)) etc.
label = self.parse_node_label();
shape = self.determine_node_shape(&id_token.lexeme, &self.previous_token.kind, &self.current_token.kind);
}
TokenKind::OpenBrace => { // {Label} or {{Label}}
label = self.parse_node_label();
shape = self.determine_node_shape(&id_token.lexeme, &self.previous_token.kind, &self.current_token.kind);
}
TokenKind::OpenAngle => { // >Label]
label = self.parse_node_label();
shape = self.determine_node_shape(&id_token.lexeme, &self.previous_token.kind, &self.current_token.kind);
}
_ => { /* No label or shape, just a simple node ID */ }
}
Some(AstNode::new(id, label, shape, start_span.merge(&self.previous_token.span)))
}
fn parse_node_label(&mut self) -> Option<NodeLabel> {
let start_span = self.current_token.span;
let open_kind = self.current_token.kind.clone();
let close_kind = match open_kind {
TokenKind::OpenParen => TokenKind::CloseParen,
TokenKind::OpenBracket => TokenKind::CloseBracket,
TokenKind::OpenBrace => TokenKind::CloseBrace,
TokenKind::OpenAngle => TokenKind::CloseBracket, // For asymmetric >Label]
_ => {
self.emit_error(
format!("Expected opening bracket for node label, found '{}'.", self.current_token.lexeme),
start_span.start,
start_span.end,
"E0103",
"Use '(', '[', '{', or '>' to open a node label.",
);
return None;
}
};
self.advance(); // Consume opening bracket
let label_text_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier, TokenKind::StringLiteral], "Expected label text or quoted string.")?;
let label_text = label_text_token.lexeme.clone();
self.expect_token_kind_and_advance(&[close_kind], &format!("Expected closing bracket '{}' for node label.", self.token_kind_to_char(close_kind)))?;
Some(NodeLabel::new(label_text, start_span.merge(&self.previous_token.span)))
}
fn parse_edge(&mut self) -> Option<AstEdge> {
let start_span = self.current_token.span;
let source_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier], "Expected source node ID for an edge.")?;
let source = source_token.lexeme.clone();
let (arrow_type, line_type) = self.parse_arrow_and_line_type()?;
let edge_label = self.parse_edge_label();
let target_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier], "Expected target node ID after arrow.")?;
let target = target_token.lexeme.clone();
Some(AstEdge::new(source, target, edge_label, arrow_type, line_type, start_span.merge(&target_token.span)))
}
fn parse_arrow_and_line_type(&mut self) -> Option<(ArrowType, LineType)> {
let current_kind = self.current_token.kind.clone();
let span = self.current_token.span;
let (arrow_type, line_type) = match current_kind {
TokenKind::ArrowNormal => (ArrowType::Normal, LineType::Solid),
TokenKind::ArrowOpen => (ArrowType::Open, LineType::Solid),
TokenKind::ArrowCross => (ArrowType::Cross, LineType::Solid),
TokenKind::ArrowBidirectionalNormal => (ArrowType::BidirectionalNormal, LineType::Solid),
TokenKind::ArrowBidirectionalOpen => (ArrowType::BidirectionalOpen, LineType::Solid),
TokenKind::ArrowBidirectionalCross => (ArrowType::BidirectionalCross, LineType::Solid),
TokenKind::ArrowThick => (ArrowType::Thick, LineType::Thick),
TokenKind::ArrowDot => (ArrowType::Dot, LineType::Dashed),
TokenKind::ArrowDotThick => (ArrowType::DotThick, LineType::Thick),
TokenKind::LineSolid => (ArrowType::None, LineType::Solid),
TokenKind::LineDashed => (ArrowType::None, LineType::Dashed), // Mermaid often uses --- for both solid line and dashed
TokenKind::LineThick => (ArrowType::None, LineType::Thick),
_ => {
self.emit_error(
format!("Expected an arrow or line type, found '{}'.", self.current_token.lexeme),
span.start,
span.end,
"E0104",
"Use a valid Mermaid arrow or line syntax (e.g., '-->', '---', '===>').",
);
return None;
}
};
self.advance(); // Consume the arrow/line token
Some((arrow_type, line_type))
}
fn parse_edge_label(&mut self) -> Option<EdgeLabel> {
if self.current_token.kind == TokenKind::Pipe { // |Label| syntax for edges
let start_span = self.current_token.span;
self.advance(); // Consume '|'
let label_text_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier, TokenKind::StringLiteral], "Expected label text for edge.")?;
let label_text = label_text_token.lexeme.clone();
self.expect_token_kind_and_advance(&[TokenKind::Pipe], "Expected closing '|' for edge label.")?;
Some(EdgeLabel::new(label_text, start_span.merge(&self.previous_token.span)))
} else {
None
}
}
// Helper to determine node shape based on surrounding tokens
fn determine_node_shape(&self, _id: &str, open_bracket_kind: &TokenKind, close_bracket_kind: &TokenKind) -> NodeShape {
match (open_bracket_kind, close_bracket_kind) {
(TokenKind::OpenBracket, TokenKind::CloseBracket) => {
// Peek to see if it's [[ ]] or [ ]
if self.peek_prev_kind() == Some(TokenKind::OpenBracket) && self.peek_next_kind() == Some(TokenKind::CloseBracket) {
NodeShape::Subroutine // [[Label]]
} else {
NodeShape::Rectangle // [Label]
}
},
(TokenKind::OpenParen, TokenKind::CloseParen) => {
// Peek to see if it's (( )) or ( )
if self.peek_prev_kind() == Some(TokenKind::OpenParen) && self.peek_next_kind() == Some(TokenKind::CloseParen) {
NodeShape::Circle // ((Label))
} else if self.peek_prev_kind() == Some(TokenKind::OpenBracket) && self.peek_next_kind() == Some(TokenKind::CloseParen) {
NodeShape::Cylinder // [(Label)]
} else {
NodeShape::RoundedRectangle // (Label)
}
},
(TokenKind::OpenBrace, TokenKind::CloseBrace) => {
// Peek to see if it's {{ }} or { }
if self.peek_prev_kind() == Some(TokenKind::OpenBrace) && self.peek_next_kind() == Some(TokenKind::CloseBrace) {
NodeShape::Hexagon // {{Label}} or Database
} else {
NodeShape::Rhombus // {Label}
}
},
(TokenKind::OpenAngle, TokenKind::CloseBracket) => NodeShape::Asymmetric, // >Label]
// Add other complex shapes like parallelogram, trapezoid, etc.
// This requires more sophisticated lookahead or token patterns.
// For now, these are basic.
_ => NodeShape::Unknown, // Fallback for unhandled combinations
}
}
// --- Utility Methods ---
fn advance(&mut self) {
self.previous_token = self.current_token.clone();
self.current_token = self.lexer.next_token();
// Skip `Newline` tokens unless they are explicitly part of the grammar we're parsing
// For general item parsing, newlines often just separate statements.
// We'll keep them in the lexer for accurate span reporting, but skip them here
// if they're not syntactically meaningful at the current point.
// This loop ensures we always land on a non-newline token.
while self.current_token.kind == TokenKind::Newline {
self.previous_token = self.current_token.clone();
self.current_token = self.lexer.next_token();
}
}
fn is_at_end(&self) -> bool {
self.current_token.kind == TokenKind::Eof
}
fn expect_token_kind_and_advance(&mut self, expected_kinds: &[TokenKind], message: &str) -> Option<Token> {
if expected_kinds.contains(&self.current_token.kind) {
let token = self.current_token.clone();
self.advance();
Some(token)
} else {
self.emit_error(
format!("{} Found '{}'.", message, self.current_token.lexeme),
self.current_token.span.start,
self.current_token.span.end,
"E0105",
&format!("Ensure the correct syntax is used. Expected one of: {}.", expected_kinds.iter().map(|k| format!("{:?}", k)).collect::<Vec<_>>().join(", ")),
);
self.errors_encountered += 1;
self.synchronize(); // Attempt error recovery
None
}
}
fn emit_error(&mut self, message: String, start: usize, end: usize, code: &str, help: &str) {
self.emitter.error(message, start, end, code, help);
self.errors_encountered += 1;
}
/// Error recovery mechanism.
/// Skips tokens until it finds a token that likely starts a new statement or graph item,
/// or a token that might help synchronize the parser.
fn synchronize(&mut self) {
// Skip until we find a token that typically starts a new statement or block.
// This prevents cascading errors and allows parsing to continue.
while !self.is_at_end() {
match self.current_token.kind {
TokenKind::Graph | TokenKind::Flowchart | TokenKind::Sequence | TokenKind::Class |
TokenKind::Subgraph | TokenKind::End | // Subgraph boundaries are good sync points
TokenKind::Identifier | // Node IDs often start new statements
TokenKind::Newline => { // Newlines can also indicate a new statement
return;
}
_ => {
self.advance();
}
}
}
}
// Helper to get the kind of the token that was just processed before current_token.
// Useful for look-behind in shape determination.
fn peek_prev_kind(&self) -> Option<TokenKind> {
// This is a bit tricky with `advance` skipping newlines.
// For simple cases like `((Label))`, it would be `previous_token`'s kind.
// For more robust look-behind, the lexer might need to expose more.
// For now, let's assume `previous_token` is sufficient for immediate look-behind.
Some(self.previous_token.kind.clone())
}
// Helper to get the kind of the token *after* the current_token.
// This requires the lexer to support peeking beyond the current token.
fn peek_next_kind(&self) -> Option<TokenKind> {
self.lexer.peek_kind() // Assuming lexer has this method
}
fn token_kind_to_char(&self, kind: TokenKind) -> char {
match kind {
TokenKind::OpenParen => '(',
TokenKind::CloseParen => ')',
TokenKind::OpenBracket => '[',
TokenKind::CloseBracket => ']',
TokenKind::OpenBrace => '{',
TokenKind::CloseBrace => '}',
TokenKind::OpenAngle => '<',
TokenKind::CloseAngle => '>',
TokenKind::Pipe => '|',
_ => '?', // Fallback
}
}
}
// Add a method to Lexer to peek the kind of the *next* token without consuming.
// This is essential for the parser's lookahead logic.
impl<'a> Lexer<'a> {
pub fn peek_kind(&mut self) -> Option<TokenKind> {
let current_state = (self.start, self.current, self.line, self.col, self.chars.clone());
let next_token = self.next_token(); // This advances the lexer state
let kind = next_token.kind.clone();
// Restore lexer state
(self.start, self.current, self.line, self.col, self.chars) = current_state;
if kind == TokenKind::Eof { None } else { Some(kind) }
}
}
// Extend TokenKind to easily check if it's an arrow or line
impl TokenKind {
pub fn is_arrow_or_line(&self) -> bool {
matches!(self,
TokenKind::ArrowNormal | TokenKind::ArrowOpen | TokenKind::ArrowCross |
TokenKind::ArrowBidirectionalNormal | TokenKind::ArrowBidirectionalOpen |
TokenKind::ArrowBidirectionalCross | TokenKind::ArrowThick |
TokenKind::ArrowDot | TokenKind::ArrowDotThick |
TokenKind::LineSolid | TokenKind::LineDashed | TokenKind::LineThick
)
}
}
Explanation of Parser Changes:
Parser::new: Initializesscope_stackfor tracking nested structures.parseMethod: The mainparseloop now callsparse_graph_itemto handle any top-level item.parse_graph_item(NEW): This is the central dispatcher.- If
TokenKind::Commentis encountered, it consumes the token and creates anAstComment, wrapping it inGraphItem::Comment. - If
TokenKind::Subgraphis encountered, it callsparse_subgraph(recursive parsing). - If
TokenKind::Identifier, it performs a lookahead usingself.lexer.peek_kind()to differentiate between a node declaration (no arrow/line following) and an edge declaration (arrow/line following). - If
TokenKind::Newline, it’s consumed and ignored as aGraphItemitself, but important for source mapping. - Includes a robust
_(catch-all) branch for unexpected tokens, emitting an error and callingsynchronize()for error recovery.
- If
parse_subgraph(NEW):- Expects
subgraphand itsid. - Optionally parses a label (using
parse_node_labelas subgraphs use similar label syntax). - Crucially: Pushes the subgraph’s ID onto
scope_stackto track nesting. - Enters a loop to
parse_graph_itemrecursively, collecting all inner items untilendorEofis met. - Expects
endto close the subgraph. - Pops the subgraph’s ID from
scope_stack. - Constructs and returns
AstSubgraph.
- Expects
parse_node/parse_node_label:parse_node_labelnow specifically handlesTokenKind::StringLiteralwhich can contain multiline (\n) or escaped characters.determine_node_shapeis extended to usepeek_prev_kindandpeek_next_kindfor more complex shapes like[[Label]]or((Label)). (Note:peek_prev_kindis simple;peek_next_kindrequires a newlexer.peek_kind()method).
- Error Recovery (
synchronize): A basic but effective error recovery strategy. When an error is encountered, it skips tokens until it finds a token that’s likely to start a new, valid statement (likegraph,subgraph,end,identifier,newline). This helps the parser continue and collect more errors instead of stopping at the first one. Lexer::peek_kind()(NEW): A new method in the Lexer to allow the parser to look ahead at the next token’s kind without consuming it. This is vital for disambiguating grammar rules (e.g., node vs. edge). It works by saving the lexer’s state, performingnext_token, getting the kind, then restoring the state.
4. Production Considerations
- Error Handling for Nested Structures: Missing
endfor asubgraphis a common error. Ourparse_subgraphnow explicitly checks forendand emits a diagnostic if it’s missing. Thesynchronizemethod helps recover from such structural errors. - Performance Optimization: Recursive parsing can be deep for heavily nested diagrams. Rust’s call stack is generally efficient, but extremely deep recursion could lead to stack overflow on constrained systems. For Mermaid, typical nesting isn’t deep enough for this to be a major concern, but it’s good to be aware. Our current approach uses
Vecforitems, which is heap-allocated, preventing stack overflow from large ASTs. - Security: Parsing user-provided input always has security implications. Ensuring our lexer and parser are robust against malformed input, infinite loops (e.g., in string parsing without termination), or excessive resource consumption is key. Our current string parsing handles unterminated strings by emitting an error and returning a partial string, preventing infinite loops.
skip_to_end_of_linealso prevents runaway comment parsing. - Logging and Monitoring: Diagnostics emitted by the
DiagnosticEmittershould be logged with appropriate severity (errors, warnings). For complex parsing issues, including thescope_stackcontent in debug logs could provide valuable context.
5. Code Review Checkpoint
At this point, we have significantly upgraded our parser:
Files Modified:
src/ast/types.rs: AddedAstSubgraph,AstComment, and updatedGraphItemenum.src/lexer/token.rs: AddedTokenKind::Subgraph,TokenKind::End,TokenKind::Comment.src/lexer/mod.rs:- Updated
next_tokento recognizesubgraph,end,%%comments. - Enhanced
string_literalfor escaped quotes and multiline content. - Added
skip_whitespace_and_commentsandskip_to_end_of_line. - Added
peek_kindmethod for parser lookahead.
- Updated
src/parser/mod.rs:- Introduced
scope_stackfor context. - Implemented
parse_graph_itemas a dispatcher. - Implemented
parse_subgraphfor recursive parsing. - Enhanced
parse_nodeandparse_node_labelfor more complex shapes and multiline content. - Added
determine_node_shapefor richer shape recognition. - Implemented
synchronizefor basic error recovery. - Updated
advanceto consistently skipNewlinetokens unless specifically needed.
- Introduced
Integration: These changes are highly integrated. The
lexerfeeds theparserwith new token types, and theparserbuilds a more complexASTusing the new structures. TheDiagnosticEmitteris used throughout the parser to report errors.
The parser can now build an AST for diagrams containing nested subgraphs, multiline node/edge labels, escaped characters within strings, and correctly ignores or stores inline comments.
6. Common Issues & Solutions
Issue: Parser gets stuck in an infinite loop or crashes when encountering malformed input, especially with missing closing
endforsubgraph.- Debugging: Use a debugger to step through
parse_subgraphandparse_graph_item. Check thecurrent_token.kindandis_at_end()conditions. - Solution: Ensure
synchronize()is robust. For subgraphs, ifendis missing, thewhileloop condition!self.is_at_end()will eventually terminate, but an error should be emitted. Thesynchronize()call after anexpect_token_kind_and_advancefailure is critical for recovery. - Prevention: Always have a default or
Eofcheck in loops, and designexpect_token_kind_and_advanceto always advance or recover, never leaving the parser stuck.
- Debugging: Use a debugger to step through
Issue: Multiline labels or labels with escaped quotes are incorrectly tokenized or parsed, leading to syntax errors or incorrect AST content.
- Debugging: Inspect the
Tokenproduced by the lexer for such labels. Check thelexemeandspan. If the token is wrong, the lexer is at fault. If the token is correct but the ASTNodeLabelorEdgeLabelis wrong, the parser is at fault. - Solution: Double-check
string_literallogic in the lexer for handling\",\\, and\ncharacters. Ensure the parser’sparse_node_labelandparse_edge_labelcorrectly extract thetextfrom theStringLiteraltoken. Mermaid typically uses<br/>for explicit newlines in labels, which would be part of the string literal content. Raw\nin a quoted string might be rendered as a newline by some Mermaid renderers, but<br/>is canonical. Our lexer should preserve\nif present in a quoted string. - Prevention: Write specific golden tests for labels with
\",\\, and multiline content to validate lexer and parser behavior.
- Debugging: Inspect the
Issue: Node shapes for complex patterns like
[[Label]]or((Label))are misidentified.- Debugging: Trace the execution of
determine_node_shape. Check the values ofopen_bracket_kind,close_bracket_kind,peek_prev_kind(), andpeek_next_kind(). The lookahead logic might be flawed or thepeek_kindimplementation in the lexer might not be providing the correct lookahead. - Solution: Refine the
determine_node_shapelogic. EnsureLexer::peek_kind()correctly restores the lexer state to avoid unintended token consumption. A robustpeek_kindneeds to be purely non-consuming. - Prevention: Create dedicated tests for each complex node shape, verifying the
NodeShapein the resulting AST.
- Debugging: Trace the execution of
7. Testing & Verification
We need to add new tests to our tests directory to cover the advanced parsing features.
tests/parser_advanced.rs (Create this new file)
// tests/parser_advanced.rs
use mermaid_parser::lexer::Lexer;
use mermaid_parser::parser::Parser;
use mermaid_parser::diagnostics::emitter::DiagnosticEmitter;
use mermaid_parser::ast::types::{Ast, DiagramType, FlowchartDirection, GraphItem, AstNode, NodeLabel, NodeShape, AstEdge, ArrowType, LineType, AstSubgraph, AstComment};
use mermaid_parser::utils::span::Span;
#[test]
fn test_parse_nested_subgraph() {
let source = r#"
graph TD
A --> B
subgraph Sub1["My Subgraph 1"]
B --> C
subgraph Sub2[Inner Subgraph]
C --> D
end
D --> E
end
E --> F
"#;
let mut emitter = DiagnosticEmitter::new(source);
let lexer = Lexer::new(source, &mut emitter);
let mut parser = Parser::new(lexer, &mut emitter);
let ast = parser.parse().expect("Failed to parse nested subgraph");
assert_eq!(ast.diagram_type, DiagramType::Flowchart(FlowchartDirection::TopToBottom));
assert_eq!(ast.items.len(), 3); // A-->B, Sub1, E-->F
// Verify A --> B
if let GraphItem::Edge(edge) = &ast.items[0] {
assert_eq!(edge.source, "A");
assert_eq!(edge.target, "B");
} else {
panic!("Expected an edge at index 0");
}
// Verify Sub1
if let GraphItem::Subgraph(sub1) = &ast.items[1] {
assert_eq!(sub1.id, "Sub1");
assert_eq!(sub1.label, Some(NodeLabel::new("My Subgraph 1".to_string(), Span::new(60, 75, 4, 25)))); // Check span carefully
assert_eq!(sub1.items.len(), 3); // B-->C, Sub2, D-->E
// Verify B --> C
if let GraphItem::Edge(edge) = &sub1.items[0] {
assert_eq!(edge.source, "B");
assert_eq!(edge.target, "C");
} else {
panic!("Expected an edge in Sub1 at index 0");
}
// Verify Sub2
if let GraphItem::Subgraph(sub2) = &sub1.items[1] {
assert_eq!(sub2.id, "Sub2");
assert_eq!(sub2.label, Some(NodeLabel::new("Inner Subgraph".to_string(), Span::new(140, 155, 6, 30)))); // Check span carefully
assert_eq!(sub2.items.len(), 1); // C-->D
// Verify C --> D
if let GraphItem::Edge(edge) = &sub2.items[0] {
assert_eq!(edge.source, "C");
assert_eq!(edge.target, "D");
} else {
panic!("Expected an edge in Sub2 at index 0");
}
} else {
panic!("Expected Subgraph 2 in Sub1 at index 1");
}
// Verify D --> E
if let GraphItem::Edge(edge) = &sub1.items[2] {
assert_eq!(edge.source, "D");
assert_eq!(edge.target, "E");
} else {
panic!("Expected an edge in Sub1 at index 2");
}
} else {
panic!("Expected Subgraph 1 at index 1");
}
// Verify E --> F
if let GraphItem::Edge(edge) = &ast.items[2] {
assert_eq!(edge.source, "E");
assert_eq!(edge.target, "F");
} else {
panic!("Expected an edge at index 2");
}
assert_eq!(emitter.errors().len(), 0);
}
#[test]
fn test_parse_multiline_label_and_escaped_chars() {
let source = r#"
graph TD
A["This is a\nMultiline\nLabel with \"quotes\""] --> B("Another <br/> Line")
"#;
let mut emitter = DiagnosticEmitter::new(source);
let lexer = Lexer::new(source, &mut emitter);
let mut parser = Parser::new(lexer, &mut emitter);
let ast = parser.parse().expect("Failed to parse multiline label");
assert_eq!(ast.items.len(), 1);
if let GraphItem::Edge(edge) = &ast.items[0] {
assert_eq!(edge.source, "A");
assert_eq!(edge.target, "B");
// Check source node label
// The lexer should have processed the escaped quotes and newlines
// The parser's `parse_node` and `parse_node_label` will use this raw string.
let node_a_label = &ast.items[0].get_node_by_id("A").unwrap().label.as_ref().unwrap().text;
assert_eq!(node_a_label, "This is a\nMultiline\nLabel with \"quotes\"");
// Check target node label (with <br/>)
let node_b_label = &ast.items[0].get_node_by_id("B").unwrap().label.as_ref().unwrap().text;
assert_eq!(node_b_label, "Another <br/> Line");
} else {
panic!("Expected an edge item");
}
assert_eq!(emitter.errors().len(), 0);
}
// Helper to find a node by ID in the AST items. This is a simplification for testing.
// In a real validator, you'd traverse the AST properly.
impl GraphItem {
fn get_node_by_id(&self, id: &str) -> Option<&AstNode> {
match self {
GraphItem::Node(node) if node.id == id => Some(node),
GraphItem::Subgraph(sub) => {
for item in &sub.items {
if let Some(node) = item.get_node_by_id(id) {
return Some(node);
}
}
None
}
GraphItem::Edge(edge) => {
// For edges, we need to find the actual node declarations.
// This is a limitation of this simple helper; a real AST traversal
// would link nodes and edges to their declarations.
// For this test, we assume the nodes are implicitly declared by being part of an edge.
// More robust test would require parsing *all* nodes explicitly.
None
}
_ => None,
}
}
}
#[test]
fn test_parse_inline_comments() {
let source = r#"
graph TD
%% This is a top-level comment
A --> B %% Inline comment on edge
C[Node C] %% Comment on node
%% Another comment block
D --> E
"#;
let mut emitter = DiagnosticEmitter::new(source);
let lexer = Lexer::new(source, &mut emitter);
let mut parser = Parser::new(lexer, &mut emitter);
let ast = parser.parse().expect("Failed to parse comments");
assert_eq!(ast.items.len(), 5); // Comment, Edge, Node, Comment, Edge
if let GraphItem::Comment(comment) = &ast.items[0] {
assert_eq!(comment.text, "This is a top-level comment");
} else { panic!("Expected comment 1"); }
if let GraphItem::Edge(edge) = &ast.items[1] {
assert_eq!(edge.source, "A");
assert_eq!(edge.target, "B");
} else { panic!("Expected edge A-->B"); }
if let GraphItem::Node(node) = &ast.items[2] {
assert_eq!(node.id, "C");
assert_eq!(node.label.as_ref().unwrap().text, "Node C");
} else { panic!("Expected node C"); }
if let GraphItem::Comment(comment) = &ast.items[3] {
assert_eq!(comment.text, "Another comment block");
} else { panic!("Expected comment 2"); }
if let GraphItem::Edge(edge) = &ast.items[4] {
assert_eq!(edge.source, "D");
assert_eq!(edge.target, "E");
} else { panic!("Expected edge D-->E"); }
assert_eq!(emitter.errors().len(), 0);
}
#[test]
fn test_parse_missing_subgraph_end_error_recovery() {
let source = r#"
graph TD
A --> B
subgraph Sub1[My Subgraph 1]
B --> C
D --> E
end-missing-keyword
F --> G
"#;
let mut emitter = DiagnosticEmitter::new(source);
let lexer = Lexer::new(source, &mut emitter);
let mut parser = Parser::new(lexer, &mut emitter);
let ast = parser.parse(); // Expecting some errors, so it might return None
assert!(ast.is_some()); // Should still produce an AST due to recovery
assert!(emitter.errors().len() > 0);
assert!(emitter.errors().iter().any(|e| e.code == "E0105" && e.message.contains("Expected 'end' to close subgraph")));
let ast_unwrapped = ast.unwrap();
assert_eq!(ast_unwrapped.items.len(), 3); // A-->B, Sub1, F-->G (assuming F-->G is parsed after recovery)
// Verify A --> B
if let GraphItem::Edge(edge) = &ast_unwrapped.items[0] {
assert_eq!(edge.source, "A");
assert_eq!(edge.target, "B");
} else {
panic!("Expected an edge at index 0");
}
// Verify Sub1 (should be parsed, but with an error reported)
if let GraphItem::Subgraph(sub1) = &ast_unwrapped.items[1] {
assert_eq!(sub1.id, "Sub1");
assert_eq!(sub1.items.len(), 2); // B-->C, D-->E
} else {
panic!("Expected Subgraph 1 at index 1");
}
// Verify F --> G (should be parsed after recovery)
if let GraphItem::Edge(edge) = &ast_unwrapped.items[2] {
assert_eq!(edge.source, "F");
assert_eq!(edge.target, "G");
} else {
panic!("Expected an edge at index 2");
}
}
// Add this to your `src/main.rs` or `src/lib.rs` to include the test module
// #[cfg(test)]
// mod tests {
// mod parser_advanced;
// // ... other test modules
// }
To run these tests:
- Make sure you have
cargo testconfigured. - Create
tests/parser_advanced.rs. - Add
mod parser_advanced;totests/mod.rs(if you have one) or directly insrc/lib.rsif#[cfg(test)]is used there. - Run
cargo test --test parser_advancedorcargo test.
What should work now:
- Parsing of basic
graph TDdiagrams with nodes and edges. - Correct recognition and AST representation of
subgraphblocks, including arbitrary nesting. - Accurate parsing of node and edge labels containing newlines (
\n,<br/>) and escaped quotes (\"). - Lexing and AST representation of
%%inline comments. - The parser should be more resilient to certain malformed inputs (like missing
endfor a subgraph) by reporting an error and attempting to continue parsing.
8. Summary & Next Steps
In this comprehensive chapter, we significantly advanced the capabilities of our Mermaid parser. We extended our AST to natively support nested subgraphs and comments, enhancing its ability to represent complex diagram structures. Our lexer gained the intelligence to recognize subgraph, end, and %% comment tokens, along with improved handling of escaped characters and multiline content within string literals. The parser was then meticulously refactored to incorporate recursive parsing for subgraphs, a dispatcher for various graph items, and a foundational error recovery mechanism, ensuring robustness against malformed input.
This work is crucial as it allows our tool to process a much wider array of real-world Mermaid diagrams, moving us closer to a production-ready analyzer. The ability to correctly interpret nested structures is fundamental for accurate validation and formatting.
In the next chapter, Chapter 10: Semantic Analysis and Advanced Validation, we will leverage this enriched AST to implement a powerful validation layer. We will focus on detecting semantic errors that go beyond basic syntax, such as duplicate node definitions, undefined nodes in edge declarations, and inconsistent diagram structures. This will involve traversing our AST and applying rules to ensure the logical correctness and integrity of the Mermaid diagram, providing precise, Rust-style diagnostics for any issues found.