Chapter Introduction

Welcome to Chapter 9 of our journey to build a robust Mermaid analyzer and fixer in Rust! In the previous chapters, we laid the foundational blocks of our parser, successfully tokenizing and building an Abstract Syntax Tree (AST) for basic Mermaid structures like simple nodes and edges. However, real-world Mermaid diagrams often involve more intricate constructs, such as nested subgraphs, multiline labels, escaped characters, and inline comments. These elements introduce significant complexity to our parsing logic and demand careful consideration to ensure strict adherence to the Mermaid specification.

This chapter will guide you through extending our existing lexer and parser to gracefully handle these advanced syntax features and critical edge cases. We will refine our AST to accurately represent nested structures and implement robust parsing rules that account for the nuances of Mermaid’s grammar. Furthermore, we will begin to enhance our parser’s error recovery mechanisms, ensuring that even with incomplete or malformed input, our tool can provide meaningful diagnostics rather than simply crashing.

By the end of this chapter, our Mermaid analyzer will be capable of processing a much broader range of Mermaid diagrams, accurately capturing their structure, and laying the groundwork for more sophisticated validation and fixing in subsequent chapters. You will have a deeper understanding of compiler design principles, particularly in handling recursive grammars and building resilient parsing systems.

Planning & Design

Handling advanced parsing features like nested subgraphs and multiline labels requires careful extensions to our AST and the parser’s logic. Our goal is to maintain the strictness and determinism established in previous chapters, ensuring that every parsed element accurately reflects the Mermaid specification.

Component Architecture for Advanced Parsing

We will primarily be extending the lexer, parser, and ast modules.

  1. lexer Module Enhancements:

    • Recognize subgraph and end keywords.
    • Handle %% for inline comments.
    • Improve string literal parsing to correctly handle escaped quotes (\") and potentially multiline content (though Mermaid typically uses <br/> for explicit newlines within labels).
  2. ast Module Extensions:

    • Introduce a recursive Subgraph structure within our AST. A subgraph is essentially a mini-diagram, containing its own nodes, edges, and potentially other subgraphs.
    • Modify NodeLabel and EdgeLabel to explicitly support multiline content (represented by \n or <br/> in the source).
    • Add an AstComment variant to store inline comments.
  3. parser Module Refinements:

    • Implement a recursive parsing function for subgraphs. When subgraph is encountered, the parser will descend into a new parsing context until end is found.
    • Update label parsing logic to correctly extract content, including escaped characters.
    • Integrate comment parsing into the main parsing loop, associating comments with their logical position in the AST if possible, or collecting them as top-level elements.
    • Enhance error recovery: Instead of immediate failure, attempt to synchronize the parser’s state after an error to continue parsing and collect more diagnostics.

The following Mermaid diagram illustrates the conceptual flow and the recursive nature of subgraph parsing:

flowchart TD Start[Start Parsing] --> ParseGraph[Parse Graph Declaration] ParseGraph --> ConsumeTokens[Consume Tokens] ConsumeTokens --> IsSubgraph{Is 'subgraph' Token?} IsSubgraph -- No --> ParseNodeOrEdge[Parse Node or Edge] ParseNodeOrEdge --> ConsumeTokens IsSubgraph -- Yes --> HandleSubgraph[Handle 'subgraph' Block] HandleSubgraph --> RecurseSubgraph[Recursively Parse Subgraph Content] RecurseSubgraph --> FindSubgraphEnd{Find 'end' Token?} FindSubgraphEnd -- Yes --> ReturnFromSubgraph[Return from Subgraph Parsing] FindSubgraphEnd -- No --> EmitMissingEndError[Emit Missing 'end' Error] ReturnFromSubgraph --> ConsumeTokens ConsumeTokens --> IsEOF{Is End of File?} IsEOF -- No --> ConsumeTokens IsEOF -- Yes --> End[End Parsing] HandleSubgraph --> EmitMissingEndError

File Structure

We’ll primarily be modifying existing files and potentially adding a new helper struct or two within our ast module.

src/
├── main.rs
├── cli.rs
├── lexer/
│   ├── mod.rs
│   └── token.rs
├── parser/
│   ├── mod.rs
│   └── error.rs
├── ast/
│   ├── mod.rs  <- Will be heavily modified
│   └── types.rs
├── diagnostics/
│   ├── mod.rs
│   └── emitter.rs
├── validator/
│   ├── mod.rs
│   └── rules.rs
└── utils/
    └── span.rs

Step-by-Step Implementation

We’ll tackle these complexities incrementally, starting with the AST changes, then lexer, and finally the parser.

1. AST Enhancements for Nested Structures and Comments

First, let’s update our ast/mod.rs and ast/types.rs to support subgraphs and comments.

src/ast/mod.rs (No significant changes to the mod.rs itself, but ensure it re-exports types.rs contents.)

src/ast/types.rs

We need to redefine how a GraphItem can be composed. A Subgraph will contain its own Vec<GraphItem>, making it recursive. We also need a way to store comments.

// src/ast/types.rs

use crate::utils::span::Span;
use std::collections::HashMap;

/// Represents the overall structure of a Mermaid diagram.
#[derive(Debug, PartialEq, Clone)]
pub struct Ast {
    pub diagram_type: DiagramType,
    pub items: Vec<GraphItem>,
    pub span: Span,
}

/// Represents the type of Mermaid diagram.
#[derive(Debug, PartialEq, Clone)]
pub enum DiagramType {
    Flowchart(FlowchartDirection),
    Sequence,
    Class,
    // Add other diagram types as supported
    Unknown, // For diagrams not yet fully parsed or invalid
}

/// Represents the direction of a flowchart.
#[derive(Debug, PartialEq, Clone)]
pub enum FlowchartDirection {
    TopToBottom, // TD
    BottomToTop, // BT
    RightToLeft, // RL
    LeftToRight, // LR
}

/// Represents a node in a diagram.
#[derive(Debug, PartialEq, Clone)]
pub struct AstNode {
    pub id: String,
    pub label: Option<NodeLabel>,
    pub shape: NodeShape,
    pub span: Span,
}

/// Represents a label for a node, which can be multiline.
#[derive(Debug, PartialEq, Clone)]
pub struct NodeLabel {
    pub text: String, // Can contain '\n' or '<br/>' for multiline
    pub span: Span,
}

/// Represents the shape of a node.
#[derive(Debug, PartialEq, Clone)]
pub enum NodeShape {
    Rectangle,        // A[Label]
    RoundedRectangle, // A(Label)
    Stadium,          // A([Label])
    Subroutine,       // A[[Label]]
    Cylinder,         // A[(Label)]
    Circle,           // A((Label))
    Asymmetric,       // A>Label]
    Rhombus,          // A{Label}
    Hexagon,          // A{{Label}}
    Parallelogram,    // A[/Label/]
    ParallelogramAlt, // A[\Label\]
    Trapezoid,        // A[/Label\]
    TrapezoidAlt,     // A[\Label/]
    Database,         // A{{Label}} (often rendered differently, but same syntax as hexagon)
    Unknown,          // For malformed or unsupported shapes
}

/// Represents an edge (connection) between two nodes.
#[derive(Debug, PartialEq, Clone)]
pub struct AstEdge {
    pub source: String,
    pub target: String,
    pub label: Option<EdgeLabel>,
    pub arrow_type: ArrowType,
    pub line_type: LineType,
    pub span: Span,
}

/// Represents a label for an edge, which can be multiline.
#[derive(Debug, PartialEq, Clone)]
pub struct EdgeLabel {
    pub text: String, // Can contain '\n' or '<br/>' for multiline
    pub span: Span,
}

/// Represents the type of arrow for an edge.
#[derive(Debug, PartialEq, Clone)]
pub enum ArrowType {
    None,    // ---
    Normal,  // -->
    Open,    // --o
    Cross,   // --x
    BidirectionalNormal, // <-->
    BidirectionalOpen,   // o--o
    BidirectionalCross,  // x--x
    Thick,   // ===>
    Dot,     // -.->
    DotThick, // =.=>
    Unknown,
}

/// Represents the type of line for an edge.
#[derive(Debug, PartialEq, Clone)]
pub enum LineType {
    Solid,
    Dashed,
    Thick,
    Invisible, // For implicit relationships, e.g., class diagrams
    Unknown,
}

/// Represents a subgraph, which can contain its own items.
#[derive(Debug, PartialEq, Clone)]
pub struct AstSubgraph {
    pub id: String,
    pub label: Option<NodeLabel>, // Subgraph can have an optional label
    pub items: Vec<GraphItem>, // Recursive definition!
    pub span: Span,
}

/// Represents a comment in the diagram.
#[derive(Debug, PartialEq, Clone)]
pub struct AstComment {
    pub text: String,
    pub span: Span,
}

/// An enum to hold any top-level item in the graph.
/// This allows for nodes, edges, subgraphs, and comments to coexist.
#[derive(Debug, PartialEq, Clone)]
pub enum GraphItem {
    Node(AstNode),
    Edge(AstEdge),
    Subgraph(AstSubgraph),
    Comment(AstComment),
    // Future: Directive, Style, etc.
}

impl AstNode {
    pub fn new(id: String, label: Option<NodeLabel>, shape: NodeShape, span: Span) -> Self {
        AstNode { id, label, shape, span }
    }
}

impl AstEdge {
    pub fn new(source: String, target: String, label: Option<EdgeLabel>, arrow_type: ArrowType, line_type: LineType, span: Span) -> Self {
        AstEdge { source, target, label, arrow_type, line_type, span }
    }
}

impl AstSubgraph {
    pub fn new(id: String, label: Option<NodeLabel>, items: Vec<GraphItem>, span: Span) -> Self {
        AstSubgraph { id, label, items, span }
    }
}

impl AstComment {
    pub fn new(text: String, span: Span) -> Self {
        AstComment { text, span }
    }
}

impl NodeLabel {
    pub fn new(text: String, span: Span) -> Self {
        NodeLabel { text, span }
    }
}

impl EdgeLabel {
    pub fn new(text: String, span: Span) -> Self {
        EdgeLabel { text, span }
    }
}

Explanation:

  • We introduced AstSubgraph which contains a Vec<GraphItem>, making it a recursive structure. This is the core change for nested subgraphs.
  • NodeLabel and EdgeLabel now explicitly store a String text which can contain \n or <br/> for multiline content. The lexer will be responsible for correctly identifying and passing this text.
  • AstComment is added to represent inline comments.
  • The GraphItem enum now includes Subgraph(AstSubgraph) and Comment(AstComment) variants, allowing them to be top-level items in our AST.

2. Lexer Enhancements for subgraph, end, and %% Comments

Now, we need to update our lexer to recognize these new keywords and comment syntax.

src/lexer/token.rs

Add new token types:

// src/lexer/token.rs

use crate::utils::span::Span;

/// Represents a token produced by the lexer.
#[derive(Debug, PartialEq, Clone)]
pub struct Token {
    pub kind: TokenKind,
    pub lexeme: String,
    pub span: Span,
}

/// Enum representing the different kinds of tokens.
#[derive(Debug, PartialEq, Clone)]
pub enum TokenKind {
    // Keywords
    Graph,
    Flowchart,
    Sequence,
    Class,
    Subgraph, // NEW
    End,      // NEW

    // Directions
    TD,
    BT,
    RL,
    LR,

    // Identifiers and Literals
    Identifier,
    StringLiteral, // Quoted string
    NumberLiteral, // For future use, e.g., sequence numbers

    // Punctuation and Operators
    OpenParen,    // (
    CloseParen,   // )
    OpenBracket,  // [
    CloseBracket, // ]
    OpenBrace,    // {
    CloseBrace,   // }
    OpenAngle,    // <
    CloseAngle,   // >
    Colon,        // :
    Equals,       // =
    Comma,        // ,
    Semicolon,    // ;
    Newline,      // \n
    Whitespace,   // Space, tab (usually ignored, but sometimes relevant for formatting)

    // Arrows and Lines
    ArrowNormal,          // -->
    ArrowOpen,            // --o
    ArrowCross,           // --x
    ArrowBidirectionalNormal, // <-->
    ArrowBidirectionalOpen,   // o--o
    ArrowBidirectionalCross,  // x--x
    ArrowThick,           // ===>
    ArrowDot,             // -.->
    ArrowDotThick,        // =.=>

    LineSolid,            // ---
    LineDashed,           // --- (when no arrow)
    LineThick,            // ===
    LineDot,              // -.-

    // Special
    Comment,      // NEW: %% comment text
    Unknown,      // For unrecognised characters/sequences
    Eof,          // End of file
}

impl Token {
    pub fn new(kind: TokenKind, lexeme: String, span: Span) -> Self {
        Token { kind, lexeme, span }
    }
}

src/lexer/mod.rs

We need to modify the Lexer::next_token method to recognize these new patterns.

// src/lexer/mod.rs

use crate::lexer::token::{Token, TokenKind};
use crate::diagnostics::emitter::DiagnosticEmitter;
use crate::utils::span::Span;

pub struct Lexer<'a> {
    source: &'a str,
    chars: std::iter::Peekable<std::str::Chars<'a>>,
    start: usize,
    current: usize,
    line: usize,
    col: usize,
    emitter: &'a mut DiagnosticEmitter,
}

impl<'a> Lexer<'a> {
    pub fn new(source: &'a str, emitter: &'a mut DiagnosticEmitter) -> Self {
        Lexer {
            source,
            chars: source.chars().peekable(),
            start: 0,
            current: 0,
            line: 1,
            col: 1,
            emitter,
        }
    }

    pub fn next_token(&mut self) -> Token {
        self.skip_whitespace_and_comments(); // Add comment skipping here

        self.start = self.current;

        if self.is_at_end() {
            return self.make_token(TokenKind::Eof);
        }

        let c = self.advance().unwrap(); // We know it's not at end

        match c {
            '(' => self.make_token(TokenKind::OpenParen),
            ')' => self.make_token(TokenKind::CloseParen),
            '[' => self.make_token(TokenKind::OpenBracket),
            ']' => self.make_token(TokenKind::CloseBracket),
            '{' => self.make_token(TokenKind::OpenBrace),
            '}' => self.make_token(TokenKind::CloseBrace),
            '<' => {
                if self.peek() == Some('-') {
                    self.advance();
                    if self.peek() == Some('-') {
                        self.advance();
                        if self.peek() == Some('>') {
                            self.advance();
                            return self.make_token(TokenKind::ArrowBidirectionalNormal); // <-->
                        } else if self.peek() == Some('o') {
                            self.advance();
                            return self.make_token(TokenKind::ArrowBidirectionalOpen); // <--o (not official, but common variant)
                        } else if self.peek() == Some('x') {
                            self.advance();
                            return self.make_token(TokenKind::ArrowBidirectionalCross); // <--x (not official, but common variant)
                        }
                    }
                }
                self.make_token(TokenKind::OpenAngle) // Single <
            }
            '>' => self.make_token(TokenKind::CloseAngle),
            ':' => self.make_token(TokenKind::Colon),
            '=' => {
                if self.peek() == Some('=') {
                    self.advance();
                    if self.peek() == Some('=') {
                        self.advance();
                        if self.peek() == Some('>') {
                            self.advance();
                            return self.make_token(TokenKind::ArrowThick); // ===>
                        }
                        return self.make_token(TokenKind::LineThick); // ===
                    } else if self.peek() == Some('.') {
                        self.advance();
                        if self.peek() == Some('>') {
                            self.advance();
                            return self.make_token(TokenKind::ArrowDotThick); // =.=>
                        }
                    }
                }
                self.make_token(TokenKind::Equals) // Single =
            }
            ',' => self.make_token(TokenKind::Comma),
            ';' => self.make_token(TokenKind::Semicolon),
            '/' => {
                if self.peek() == Some('/') {
                    // This is a line comment in some languages, but Mermaid uses `%%`
                    // For now, treat `//` as an identifier start if not part of a specific pattern.
                    // Or we could define it as an Unknown token if it's truly invalid.
                    // For now, let's assume it might be part of a node ID if it's not a known symbol.
                    self.current -= 1; // Backtrack
                    self.start = self.current;
                    return self.identifier_or_keyword();
                }
                self.make_token(TokenKind::Unknown) // Single / might be part of shape, handle later
            }
            '%' => { // NEW: Handle comments
                if self.peek() == Some('%') {
                    self.advance(); // Consume the second '%'
                    self.skip_to_end_of_line();
                    let comment_text = &self.source[self.start + 2..self.current].trim_end(); // Exclude '%%' and trim
                    return self.make_token_with_lexeme(TokenKind::Comment, comment_text.to_string());
                }
                self.make_token(TokenKind::Unknown) // Single %
            }
            '-' => {
                if self.peek() == Some('.') {
                    self.advance();
                    if self.peek() == Some('>') {
                        self.advance();
                        return self.make_token(TokenKind::ArrowDot); // -.->
                    }
                }
                if self.peek() == Some('-') {
                    self.advance();
                    if self.peek() == Some('>') {
                        self.advance();
                        return self.make_token(TokenKind::ArrowNormal); // -->
                    } else if self.peek() == Some('o') {
                        self.advance();
                        return self.make_token(TokenKind::ArrowOpen); // --o
                    } else if self.peek() == Some('x') {
                        self.advance();
                        return self.make_token(TokenKind::ArrowCross); // --x
                    }
                    return self.make_token(TokenKind::LineSolid); // ---
                }
                self.make_token(TokenKind::Unknown) // Single -
            }
            '"' => self.string_literal(),
            '\n' => {
                self.line += 1;
                self.col = 1;
                self.make_token(TokenKind::Newline)
            }
            c if c.is_ascii_digit() => self.number_literal(),
            c if c.is_ascii_alphabetic() || c == '_' => self.identifier_or_keyword(),
            _ => {
                self.emitter.error(
                    format!("Unexpected character: '{}'", c),
                    self.current - 1,
                    self.current,
                    "E0001",
                    "Remove or correct the unexpected character.",
                );
                self.make_token(TokenKind::Unknown)
            }
        }
    }

    fn skip_whitespace_and_comments(&mut self) {
        loop {
            match self.peek() {
                Some(' ') | Some('\r') | Some('\t') => {
                    self.advance();
                    self.start = self.current; // Reset start to ignore skipped whitespace
                }
                Some('\n') => {
                    // Newlines are significant for parsing, so we don't skip them here.
                    // They are tokenized as `Newline`.
                    return;
                }
                Some('%') if self.peek_next() == Some('%') => {
                    self.advance(); // Consume first %
                    self.advance(); // Consume second %
                    self.skip_to_end_of_line();
                    self.start = self.current; // Reset start to ignore skipped comment
                }
                _ => return,
            }
        }
    }

    fn skip_to_end_of_line(&mut self) {
        while let Some(c) = self.peek() {
            if c == '\n' {
                break;
            }
            self.advance();
        }
    }

    fn string_literal(&mut self) -> Token {
        let mut content = String::new();
        let mut escaped = false;

        while let Some(c) = self.peek() {
            if c == '"' && !escaped {
                break; // End of string
            }
            if c == '\n' {
                // Mermaid allows multiline labels, but usually with <br/> or explicit newlines.
                // If a raw newline is encountered within a quoted string, it's part of the string.
                content.push(c);
                self.advance();
                self.line += 1;
                self.col = 1;
                continue;
            }

            if c == '\\' {
                self.advance(); // Consume the backslash
                match self.peek() {
                    Some('"') => {
                        content.push('"'); // Add the escaped quote
                        escaped = false; // Reset escaped state after consuming
                    }
                    Some('\\') => {
                        content.push('\\'); // Add the escaped backslash
                        escaped = false;
                    }
                    _ => {
                        // Invalid escape sequence, just add the backslash and the next char
                        self.emitter.warning(
                            format!("Invalid escape sequence '\\{}'", self.peek().unwrap_or(' ')),
                            self.current - 1,
                            self.current + 1,
                            "W0001",
                            "Consider removing the backslash or escaping correctly.",
                        );
                        content.push('\\');
                        if let Some(next_c) = self.peek() {
                            content.push(next_c);
                        }
                    }
                }
            } else {
                content.push(c);
            }
            self.advance();
            escaped = false; // Reset for next character
        }

        if self.is_at_end() || self.peek() != Some('"') {
            self.emitter.error(
                "Unterminated string literal.",
                self.start,
                self.current,
                "E0002",
                "Add a closing '\"' to terminate the string.",
            );
            return self.make_token_with_lexeme(TokenKind::StringLiteral, content); // Return partial string
        }

        self.advance(); // Consume the closing '"'
        self.make_token_with_lexeme(TokenKind::StringLiteral, content)
    }

    fn number_literal(&mut self) -> Token {
        while self.peek().map_or(false, |c| c.is_ascii_digit()) {
            self.advance();
        }
        // Mermaid mostly uses numbers for sequence diagram line numbers, etc.
        // For flowcharts, node IDs are alphanumeric.
        self.make_token(TokenKind::NumberLiteral)
    }

    fn identifier_or_keyword(&mut self) -> Token {
        while self.peek().map_or(false, |c| c.is_ascii_alphanumeric() || c == '_') {
            self.advance();
        }

        let text = &self.source[self.start..self.current];
        let kind = match text {
            "graph" => TokenKind::Graph,
            "flowchart" => TokenKind::Flowchart,
            "sequence" => TokenKind::Sequence,
            "classDiagram" => TokenKind::Class, // Assuming this as a keyword
            "TD" => TokenKind::TD,
            "BT" => TokenKind::BT,
            "RL" => TokenKind::RL,
            "LR" => TokenKind::LR,
            "subgraph" => TokenKind::Subgraph, // NEW
            "end" => TokenKind::End,           // NEW
            _ => TokenKind::Identifier,
        };
        self.make_token(kind)
    }

    fn advance(&mut self) -> Option<char> {
        self.current += 1;
        self.col += 1;
        self.chars.next()
    }

    fn peek(&self) -> Option<char> {
        self.chars.peek().copied()
    }

    fn peek_next(&self) -> Option<char> {
        let mut temp_chars = self.chars.clone();
        temp_chars.next(); // Consume current
        temp_chars.next().copied() // Peek next
    }

    fn is_at_end(&self) -> bool {
        self.current >= self.source.len()
    }

    fn make_token(&self, kind: TokenKind) -> Token {
        let lexeme = self.source[self.start..self.current].to_string();
        let span = Span::new(self.start, self.current, self.line, self.col - lexeme.len());
        Token::new(kind, lexeme, span)
    }

    fn make_token_with_lexeme(&self, kind: TokenKind, lexeme: String) -> Token {
        let span = Span::new(self.start, self.current, self.line, self.col - lexeme.len());
        Token::new(kind, lexeme, span)
    }
}

Explanation of Lexer Changes:

  • TokenKind::Subgraph and TokenKind::End: Added these to token.rs.
  • TokenKind::Comment: Added for %% comments.
  • identifier_or_keyword: Updated to recognize "subgraph" and "end" as keywords.
  • skip_whitespace_and_comments: This new helper function is crucial. It’s called at the beginning of next_token to consume any leading whitespace or full-line comments (%% ...). It’s important that full-line comments are skipped here, so they don’t block subsequent tokenization.
  • next_token for Comment: When %% is encountered, we now advance past it and then consume characters until a newline or EOF. The text after %% (and before newline/EOF) becomes the lexeme for the Comment token.
  • string_literal for Escaped Characters: The string_literal function is enhanced to correctly handle \" escape sequences, treating the escaped quote as part of the string content rather than a terminator. It also explicitly handles newlines within quoted strings.

3. Parser Enhancements for Nested Structures and Comments

This is where the main logic for handling recursion and new item types comes in.

src/parser/mod.rs

We will introduce a recursive parsing method for subgraphs and modify the main parsing loop to handle comments and subgraphs.

// src/parser/mod.rs

use crate::lexer::Lexer;
use crate::lexer::token::{Token, TokenKind};
use crate::ast::types::{Ast, DiagramType, FlowchartDirection, AstNode, NodeLabel, NodeShape, AstEdge, EdgeLabel, ArrowType, LineType, GraphItem, AstSubgraph, AstComment};
use crate::diagnostics::emitter::DiagnosticEmitter;
use crate::utils::span::Span;

pub struct Parser<'a> {
    lexer: Lexer<'a>,
    current_token: Token,
    previous_token: Token,
    emitter: &'a mut DiagnosticEmitter,
    errors_encountered: u32,
    // Stack to keep track of open subgraphs or other scopes
    scope_stack: Vec<String>,
}

impl<'a> Parser<'a> {
    pub fn new(lexer: Lexer<'a>, emitter: &'a mut DiagnosticEmitter) -> Self {
        let dummy_token = Token::new(TokenKind::Eof, "".to_string(), Span::new(0, 0, 0, 0));
        let mut parser = Parser {
            lexer,
            current_token: dummy_token.clone(),
            previous_token: dummy_token,
            emitter,
            errors_encountered: 0,
            scope_stack: Vec::new(),
        };
        parser.advance(); // Initialize current_token
        parser
    }

    pub fn parse(&mut self) -> Option<Ast> {
        let start_span = self.current_token.span;

        // 1. Parse diagram type
        let diagram_type = self.parse_diagram_type()?;

        // 2. Parse graph items (nodes, edges, subgraphs, comments)
        let mut items = Vec::new();
        while !self.is_at_end() {
            if let Some(item) = self.parse_graph_item() {
                items.push(item);
            } else {
                // If parse_graph_item returns None, it means an error occurred
                // and we should try to recover or stop.
                // For now, let's just break to avoid infinite loops on unrecoverable errors.
                // A more robust recovery would try to synchronize to a known token (e.g., 'graph', 'subgraph', 'node_id')
                break;
            }
        }

        if self.errors_encountered > 0 {
            None // Don't return AST if there were parsing errors
        } else {
            Some(Ast {
                diagram_type,
                items,
                span: start_span.merge(&self.previous_token.span),
            })
        }
    }

    fn parse_diagram_type(&mut self) -> Option<DiagramType> {
        let diagram_type_span_start = self.current_token.span.start;
        let diagram_type_lexeme = self.current_token.lexeme.clone();

        let kind = self.current_token.kind.clone();
        match kind {
            TokenKind::Graph | TokenKind::Flowchart => {
                self.expect_token_kind_and_advance(&[TokenKind::Graph, TokenKind::Flowchart], "Expected 'graph' or 'flowchart' declaration.")?;
                // Parse direction if present
                if let Some(direction) = self.parse_flowchart_direction() {
                    return Some(DiagramType::Flowchart(direction));
                }
                // Default to TD if no direction specified
                Some(DiagramType::Flowchart(FlowchartDirection::TopToBottom))
            }
            TokenKind::Sequence => {
                self.expect_token_kind_and_advance(&[TokenKind::Sequence], "Expected 'sequence' declaration.")?;
                Some(DiagramType::Sequence)
            }
            TokenKind::Class => { // Assuming 'classDiagram' is tokenized as Class
                self.expect_token_kind_and_advance(&[TokenKind::Class], "Expected 'classDiagram' declaration.")?;
                Some(DiagramType::Class)
            }
            _ => {
                self.emit_error(
                    format!("Expected a diagram type declaration (e.g., 'graph', 'flowchart', 'sequence', 'classDiagram'), found '{}'.", diagram_type_lexeme),
                    diagram_type_span_start,
                    self.current_token.span.end,
                    "E0101",
                    "Start your Mermaid diagram with a valid declaration.",
                );
                None
            }
        }
    }

    fn parse_flowchart_direction(&mut self) -> Option<FlowchartDirection> {
        match self.current_token.kind {
            TokenKind::TD => { self.advance(); Some(FlowchartDirection::TopToBottom) }
            TokenKind::BT => { self.advance(); Some(FlowchartDirection::BottomToTop) }
            TokenKind::RL => { self.advance(); Some(FlowchartDirection::RightToLeft) }
            TokenKind::LR => { self.advance(); Some(FlowchartDirection::LeftToRight) }
            _ => None, // No direction specified, or it's another token
        }
    }

    // NEW: Centralized parsing of graph items
    fn parse_graph_item(&mut self) -> Option<GraphItem> {
        match self.current_token.kind {
            TokenKind::Comment => { // NEW: Handle comments
                let comment_text = self.current_token.lexeme.clone();
                let span = self.current_token.span;
                self.advance(); // Consume the comment token
                Some(GraphItem::Comment(AstComment::new(comment_text, span)))
            }
            TokenKind::Subgraph => { // NEW: Handle subgraphs
                self.parse_subgraph().map(GraphItem::Subgraph)
            }
            TokenKind::Identifier => {
                // Could be a node declaration or the start of an edge
                // We need to peek to decide.
                let next_token_kind = self.lexer.peek_kind(); // Assuming lexer has peek_kind
                if next_token_kind.map_or(false, |k| k.is_arrow_or_line()) {
                    self.parse_edge().map(GraphItem::Edge)
                } else {
                    self.parse_node().map(GraphItem::Node)
                }
            }
            TokenKind::Newline => {
                self.advance(); // Consume newline, it's usually just separating items
                None // Newlines themselves are not graph items
            }
            TokenKind::Eof => None, // Reached end, no more items
            _ => {
                // Attempt error recovery: skip tokens until a known item start or EOF
                self.emit_error(
                    format!("Unexpected token '{}'. Expected node, edge, subgraph, or comment.", self.current_token.lexeme),
                    self.current_token.span.start,
                    self.current_token.span.end,
                    "E0102",
                    "Remove this token or correct the syntax to form a valid diagram element.",
                );
                self.synchronize();
                None
            }
        }
    }

    // NEW: Recursive subgraph parsing
    fn parse_subgraph(&mut self) -> Option<AstSubgraph> {
        let start_span = self.current_token.span;
        self.expect_token_kind_and_advance(&[TokenKind::Subgraph], "Expected 'subgraph' keyword.")?;

        let id_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier], "Expected subgraph ID after 'subgraph' keyword.")?;
        let id = id_token.lexeme.clone();

        let label = if self.current_token.kind == TokenKind::OpenBracket {
            self.parse_node_label() // Subgraph labels use node label syntax
        } else {
            None
        };

        // Push subgraph ID onto scope stack
        self.scope_stack.push(id.clone());

        let mut items = Vec::new();
        // Parse items within the subgraph until 'end' or EOF
        while self.current_token.kind != TokenKind::End && !self.is_at_end() {
            if let Some(item) = self.parse_graph_item() {
                items.push(item);
            } else {
                // Error in subgraph item, try to recover
                if self.current_token.kind != TokenKind::End && !self.is_at_end() {
                    self.synchronize(); // Try to skip to a valid state
                }
            }
        }

        let end_span = self.current_token.span;
        self.expect_token_kind_and_advance(&[TokenKind::End], &format!("Expected 'end' to close subgraph '{}'.", id))?;

        // Pop subgraph ID from scope stack
        self.scope_stack.pop();

        Some(AstSubgraph::new(id, label, items, start_span.merge(&end_span)))
    }

    fn parse_node(&mut self) -> Option<AstNode> {
        let start_span = self.current_token.span;
        let id_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier], "Expected node ID.")?;
        let id = id_token.lexeme.clone();

        let mut label: Option<NodeLabel> = None;
        let mut shape: NodeShape = NodeShape::Rectangle; // Default shape

        // Check for node label and shape
        match self.current_token.kind {
            TokenKind::OpenBracket => { // [Label] or [[Label]] or [/Label\] etc.
                label = self.parse_node_label(); // This handles different bracket types
                shape = self.determine_node_shape(&id_token.lexeme, &self.previous_token.kind, &self.current_token.kind);
            }
            TokenKind::OpenParen => { // (Label) or ((Label)) etc.
                label = self.parse_node_label();
                shape = self.determine_node_shape(&id_token.lexeme, &self.previous_token.kind, &self.current_token.kind);
            }
            TokenKind::OpenBrace => { // {Label} or {{Label}}
                label = self.parse_node_label();
                shape = self.determine_node_shape(&id_token.lexeme, &self.previous_token.kind, &self.current_token.kind);
            }
            TokenKind::OpenAngle => { // >Label]
                label = self.parse_node_label();
                shape = self.determine_node_shape(&id_token.lexeme, &self.previous_token.kind, &self.current_token.kind);
            }
            _ => { /* No label or shape, just a simple node ID */ }
        }

        Some(AstNode::new(id, label, shape, start_span.merge(&self.previous_token.span)))
    }

    fn parse_node_label(&mut self) -> Option<NodeLabel> {
        let start_span = self.current_token.span;
        let open_kind = self.current_token.kind.clone();
        let close_kind = match open_kind {
            TokenKind::OpenParen => TokenKind::CloseParen,
            TokenKind::OpenBracket => TokenKind::CloseBracket,
            TokenKind::OpenBrace => TokenKind::CloseBrace,
            TokenKind::OpenAngle => TokenKind::CloseBracket, // For asymmetric >Label]
            _ => {
                self.emit_error(
                    format!("Expected opening bracket for node label, found '{}'.", self.current_token.lexeme),
                    start_span.start,
                    start_span.end,
                    "E0103",
                    "Use '(', '[', '{', or '>' to open a node label.",
                );
                return None;
            }
        };

        self.advance(); // Consume opening bracket

        let label_text_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier, TokenKind::StringLiteral], "Expected label text or quoted string.")?;
        let label_text = label_text_token.lexeme.clone();

        self.expect_token_kind_and_advance(&[close_kind], &format!("Expected closing bracket '{}' for node label.", self.token_kind_to_char(close_kind)))?;

        Some(NodeLabel::new(label_text, start_span.merge(&self.previous_token.span)))
    }

    fn parse_edge(&mut self) -> Option<AstEdge> {
        let start_span = self.current_token.span;
        let source_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier], "Expected source node ID for an edge.")?;
        let source = source_token.lexeme.clone();

        let (arrow_type, line_type) = self.parse_arrow_and_line_type()?;
        let edge_label = self.parse_edge_label();

        let target_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier], "Expected target node ID after arrow.")?;
        let target = target_token.lexeme.clone();

        Some(AstEdge::new(source, target, edge_label, arrow_type, line_type, start_span.merge(&target_token.span)))
    }

    fn parse_arrow_and_line_type(&mut self) -> Option<(ArrowType, LineType)> {
        let current_kind = self.current_token.kind.clone();
        let span = self.current_token.span;

        let (arrow_type, line_type) = match current_kind {
            TokenKind::ArrowNormal => (ArrowType::Normal, LineType::Solid),
            TokenKind::ArrowOpen => (ArrowType::Open, LineType::Solid),
            TokenKind::ArrowCross => (ArrowType::Cross, LineType::Solid),
            TokenKind::ArrowBidirectionalNormal => (ArrowType::BidirectionalNormal, LineType::Solid),
            TokenKind::ArrowBidirectionalOpen => (ArrowType::BidirectionalOpen, LineType::Solid),
            TokenKind::ArrowBidirectionalCross => (ArrowType::BidirectionalCross, LineType::Solid),
            TokenKind::ArrowThick => (ArrowType::Thick, LineType::Thick),
            TokenKind::ArrowDot => (ArrowType::Dot, LineType::Dashed),
            TokenKind::ArrowDotThick => (ArrowType::DotThick, LineType::Thick),
            TokenKind::LineSolid => (ArrowType::None, LineType::Solid),
            TokenKind::LineDashed => (ArrowType::None, LineType::Dashed), // Mermaid often uses --- for both solid line and dashed
            TokenKind::LineThick => (ArrowType::None, LineType::Thick),
            _ => {
                self.emit_error(
                    format!("Expected an arrow or line type, found '{}'.", self.current_token.lexeme),
                    span.start,
                    span.end,
                    "E0104",
                    "Use a valid Mermaid arrow or line syntax (e.g., '-->', '---', '===>').",
                );
                return None;
            }
        };
        self.advance(); // Consume the arrow/line token
        Some((arrow_type, line_type))
    }

    fn parse_edge_label(&mut self) -> Option<EdgeLabel> {
        if self.current_token.kind == TokenKind::Pipe { // |Label| syntax for edges
            let start_span = self.current_token.span;
            self.advance(); // Consume '|'

            let label_text_token = self.expect_token_kind_and_advance(&[TokenKind::Identifier, TokenKind::StringLiteral], "Expected label text for edge.")?;
            let label_text = label_text_token.lexeme.clone();

            self.expect_token_kind_and_advance(&[TokenKind::Pipe], "Expected closing '|' for edge label.")?;
            Some(EdgeLabel::new(label_text, start_span.merge(&self.previous_token.span)))
        } else {
            None
        }
    }


    // Helper to determine node shape based on surrounding tokens
    fn determine_node_shape(&self, _id: &str, open_bracket_kind: &TokenKind, close_bracket_kind: &TokenKind) -> NodeShape {
        match (open_bracket_kind, close_bracket_kind) {
            (TokenKind::OpenBracket, TokenKind::CloseBracket) => {
                // Peek to see if it's [[ ]] or [ ]
                if self.peek_prev_kind() == Some(TokenKind::OpenBracket) && self.peek_next_kind() == Some(TokenKind::CloseBracket) {
                    NodeShape::Subroutine // [[Label]]
                } else {
                    NodeShape::Rectangle // [Label]
                }
            },
            (TokenKind::OpenParen, TokenKind::CloseParen) => {
                // Peek to see if it's (( )) or ( )
                if self.peek_prev_kind() == Some(TokenKind::OpenParen) && self.peek_next_kind() == Some(TokenKind::CloseParen) {
                    NodeShape::Circle // ((Label))
                } else if self.peek_prev_kind() == Some(TokenKind::OpenBracket) && self.peek_next_kind() == Some(TokenKind::CloseParen) {
                    NodeShape::Cylinder // [(Label)]
                } else {
                    NodeShape::RoundedRectangle // (Label)
                }
            },
            (TokenKind::OpenBrace, TokenKind::CloseBrace) => {
                // Peek to see if it's {{ }} or { }
                if self.peek_prev_kind() == Some(TokenKind::OpenBrace) && self.peek_next_kind() == Some(TokenKind::CloseBrace) {
                    NodeShape::Hexagon // {{Label}} or Database
                } else {
                    NodeShape::Rhombus // {Label}
                }
            },
            (TokenKind::OpenAngle, TokenKind::CloseBracket) => NodeShape::Asymmetric, // >Label]
            // Add other complex shapes like parallelogram, trapezoid, etc.
            // This requires more sophisticated lookahead or token patterns.
            // For now, these are basic.
            _ => NodeShape::Unknown, // Fallback for unhandled combinations
        }
    }

    // --- Utility Methods ---

    fn advance(&mut self) {
        self.previous_token = self.current_token.clone();
        self.current_token = self.lexer.next_token();
        // Skip `Newline` tokens unless they are explicitly part of the grammar we're parsing
        // For general item parsing, newlines often just separate statements.
        // We'll keep them in the lexer for accurate span reporting, but skip them here
        // if they're not syntactically meaningful at the current point.
        // This loop ensures we always land on a non-newline token.
        while self.current_token.kind == TokenKind::Newline {
            self.previous_token = self.current_token.clone();
            self.current_token = self.lexer.next_token();
        }
    }

    fn is_at_end(&self) -> bool {
        self.current_token.kind == TokenKind::Eof
    }

    fn expect_token_kind_and_advance(&mut self, expected_kinds: &[TokenKind], message: &str) -> Option<Token> {
        if expected_kinds.contains(&self.current_token.kind) {
            let token = self.current_token.clone();
            self.advance();
            Some(token)
        } else {
            self.emit_error(
                format!("{} Found '{}'.", message, self.current_token.lexeme),
                self.current_token.span.start,
                self.current_token.span.end,
                "E0105",
                &format!("Ensure the correct syntax is used. Expected one of: {}.", expected_kinds.iter().map(|k| format!("{:?}", k)).collect::<Vec<_>>().join(", ")),
            );
            self.errors_encountered += 1;
            self.synchronize(); // Attempt error recovery
            None
        }
    }

    fn emit_error(&mut self, message: String, start: usize, end: usize, code: &str, help: &str) {
        self.emitter.error(message, start, end, code, help);
        self.errors_encountered += 1;
    }

    /// Error recovery mechanism.
    /// Skips tokens until it finds a token that likely starts a new statement or graph item,
    /// or a token that might help synchronize the parser.
    fn synchronize(&mut self) {
        // Skip until we find a token that typically starts a new statement or block.
        // This prevents cascading errors and allows parsing to continue.
        while !self.is_at_end() {
            match self.current_token.kind {
                TokenKind::Graph | TokenKind::Flowchart | TokenKind::Sequence | TokenKind::Class |
                TokenKind::Subgraph | TokenKind::End | // Subgraph boundaries are good sync points
                TokenKind::Identifier | // Node IDs often start new statements
                TokenKind::Newline => { // Newlines can also indicate a new statement
                    return;
                }
                _ => {
                    self.advance();
                }
            }
        }
    }

    // Helper to get the kind of the token that was just processed before current_token.
    // Useful for look-behind in shape determination.
    fn peek_prev_kind(&self) -> Option<TokenKind> {
        // This is a bit tricky with `advance` skipping newlines.
        // For simple cases like `((Label))`, it would be `previous_token`'s kind.
        // For more robust look-behind, the lexer might need to expose more.
        // For now, let's assume `previous_token` is sufficient for immediate look-behind.
        Some(self.previous_token.kind.clone())
    }

    // Helper to get the kind of the token *after* the current_token.
    // This requires the lexer to support peeking beyond the current token.
    fn peek_next_kind(&self) -> Option<TokenKind> {
        self.lexer.peek_kind() // Assuming lexer has this method
    }

    fn token_kind_to_char(&self, kind: TokenKind) -> char {
        match kind {
            TokenKind::OpenParen => '(',
            TokenKind::CloseParen => ')',
            TokenKind::OpenBracket => '[',
            TokenKind::CloseBracket => ']',
            TokenKind::OpenBrace => '{',
            TokenKind::CloseBrace => '}',
            TokenKind::OpenAngle => '<',
            TokenKind::CloseAngle => '>',
            TokenKind::Pipe => '|',
            _ => '?', // Fallback
        }
    }
}

// Add a method to Lexer to peek the kind of the *next* token without consuming.
// This is essential for the parser's lookahead logic.
impl<'a> Lexer<'a> {
    pub fn peek_kind(&mut self) -> Option<TokenKind> {
        let current_state = (self.start, self.current, self.line, self.col, self.chars.clone());
        let next_token = self.next_token(); // This advances the lexer state
        let kind = next_token.kind.clone();
        // Restore lexer state
        (self.start, self.current, self.line, self.col, self.chars) = current_state;
        if kind == TokenKind::Eof { None } else { Some(kind) }
    }
}

// Extend TokenKind to easily check if it's an arrow or line
impl TokenKind {
    pub fn is_arrow_or_line(&self) -> bool {
        matches!(self,
            TokenKind::ArrowNormal | TokenKind::ArrowOpen | TokenKind::ArrowCross |
            TokenKind::ArrowBidirectionalNormal | TokenKind::ArrowBidirectionalOpen |
            TokenKind::ArrowBidirectionalCross | TokenKind::ArrowThick |
            TokenKind::ArrowDot | TokenKind::ArrowDotThick |
            TokenKind::LineSolid | TokenKind::LineDashed | TokenKind::LineThick
        )
    }
}

Explanation of Parser Changes:

  • Parser::new: Initializes scope_stack for tracking nested structures.
  • parse Method: The main parse loop now calls parse_graph_item to handle any top-level item.
  • parse_graph_item (NEW): This is the central dispatcher.
    • If TokenKind::Comment is encountered, it consumes the token and creates an AstComment, wrapping it in GraphItem::Comment.
    • If TokenKind::Subgraph is encountered, it calls parse_subgraph (recursive parsing).
    • If TokenKind::Identifier, it performs a lookahead using self.lexer.peek_kind() to differentiate between a node declaration (no arrow/line following) and an edge declaration (arrow/line following).
    • If TokenKind::Newline, it’s consumed and ignored as a GraphItem itself, but important for source mapping.
    • Includes a robust _ (catch-all) branch for unexpected tokens, emitting an error and calling synchronize() for error recovery.
  • parse_subgraph (NEW):
    • Expects subgraph and its id.
    • Optionally parses a label (using parse_node_label as subgraphs use similar label syntax).
    • Crucially: Pushes the subgraph’s ID onto scope_stack to track nesting.
    • Enters a loop to parse_graph_item recursively, collecting all inner items until end or Eof is met.
    • Expects end to close the subgraph.
    • Pops the subgraph’s ID from scope_stack.
    • Constructs and returns AstSubgraph.
  • parse_node / parse_node_label:
    • parse_node_label now specifically handles TokenKind::StringLiteral which can contain multiline (\n) or escaped characters.
    • determine_node_shape is extended to use peek_prev_kind and peek_next_kind for more complex shapes like [[Label]] or ((Label)). (Note: peek_prev_kind is simple; peek_next_kind requires a new lexer.peek_kind() method).
  • Error Recovery (synchronize): A basic but effective error recovery strategy. When an error is encountered, it skips tokens until it finds a token that’s likely to start a new, valid statement (like graph, subgraph, end, identifier, newline). This helps the parser continue and collect more errors instead of stopping at the first one.
  • Lexer::peek_kind() (NEW): A new method in the Lexer to allow the parser to look ahead at the next token’s kind without consuming it. This is vital for disambiguating grammar rules (e.g., node vs. edge). It works by saving the lexer’s state, performing next_token, getting the kind, then restoring the state.

4. Production Considerations

  • Error Handling for Nested Structures: Missing end for a subgraph is a common error. Our parse_subgraph now explicitly checks for end and emits a diagnostic if it’s missing. The synchronize method helps recover from such structural errors.
  • Performance Optimization: Recursive parsing can be deep for heavily nested diagrams. Rust’s call stack is generally efficient, but extremely deep recursion could lead to stack overflow on constrained systems. For Mermaid, typical nesting isn’t deep enough for this to be a major concern, but it’s good to be aware. Our current approach uses Vec for items, which is heap-allocated, preventing stack overflow from large ASTs.
  • Security: Parsing user-provided input always has security implications. Ensuring our lexer and parser are robust against malformed input, infinite loops (e.g., in string parsing without termination), or excessive resource consumption is key. Our current string parsing handles unterminated strings by emitting an error and returning a partial string, preventing infinite loops. skip_to_end_of_line also prevents runaway comment parsing.
  • Logging and Monitoring: Diagnostics emitted by the DiagnosticEmitter should be logged with appropriate severity (errors, warnings). For complex parsing issues, including the scope_stack content in debug logs could provide valuable context.

5. Code Review Checkpoint

At this point, we have significantly upgraded our parser:

  • Files Modified:

    • src/ast/types.rs: Added AstSubgraph, AstComment, and updated GraphItem enum.
    • src/lexer/token.rs: Added TokenKind::Subgraph, TokenKind::End, TokenKind::Comment.
    • src/lexer/mod.rs:
      • Updated next_token to recognize subgraph, end, %% comments.
      • Enhanced string_literal for escaped quotes and multiline content.
      • Added skip_whitespace_and_comments and skip_to_end_of_line.
      • Added peek_kind method for parser lookahead.
    • src/parser/mod.rs:
      • Introduced scope_stack for context.
      • Implemented parse_graph_item as a dispatcher.
      • Implemented parse_subgraph for recursive parsing.
      • Enhanced parse_node and parse_node_label for more complex shapes and multiline content.
      • Added determine_node_shape for richer shape recognition.
      • Implemented synchronize for basic error recovery.
      • Updated advance to consistently skip Newline tokens unless specifically needed.
  • Integration: These changes are highly integrated. The lexer feeds the parser with new token types, and the parser builds a more complex AST using the new structures. The DiagnosticEmitter is used throughout the parser to report errors.

The parser can now build an AST for diagrams containing nested subgraphs, multiline node/edge labels, escaped characters within strings, and correctly ignores or stores inline comments.

6. Common Issues & Solutions

  1. Issue: Parser gets stuck in an infinite loop or crashes when encountering malformed input, especially with missing closing end for subgraph.

    • Debugging: Use a debugger to step through parse_subgraph and parse_graph_item. Check the current_token.kind and is_at_end() conditions.
    • Solution: Ensure synchronize() is robust. For subgraphs, if end is missing, the while loop condition !self.is_at_end() will eventually terminate, but an error should be emitted. The synchronize() call after an expect_token_kind_and_advance failure is critical for recovery.
    • Prevention: Always have a default or Eof check in loops, and design expect_token_kind_and_advance to always advance or recover, never leaving the parser stuck.
  2. Issue: Multiline labels or labels with escaped quotes are incorrectly tokenized or parsed, leading to syntax errors or incorrect AST content.

    • Debugging: Inspect the Token produced by the lexer for such labels. Check the lexeme and span. If the token is wrong, the lexer is at fault. If the token is correct but the AST NodeLabel or EdgeLabel is wrong, the parser is at fault.
    • Solution: Double-check string_literal logic in the lexer for handling \", \\, and \n characters. Ensure the parser’s parse_node_label and parse_edge_label correctly extract the text from the StringLiteral token. Mermaid typically uses <br/> for explicit newlines in labels, which would be part of the string literal content. Raw \n in a quoted string might be rendered as a newline by some Mermaid renderers, but <br/> is canonical. Our lexer should preserve \n if present in a quoted string.
    • Prevention: Write specific golden tests for labels with \", \\, and multiline content to validate lexer and parser behavior.
  3. Issue: Node shapes for complex patterns like [[Label]] or ((Label)) are misidentified.

    • Debugging: Trace the execution of determine_node_shape. Check the values of open_bracket_kind, close_bracket_kind, peek_prev_kind(), and peek_next_kind(). The lookahead logic might be flawed or the peek_kind implementation in the lexer might not be providing the correct lookahead.
    • Solution: Refine the determine_node_shape logic. Ensure Lexer::peek_kind() correctly restores the lexer state to avoid unintended token consumption. A robust peek_kind needs to be purely non-consuming.
    • Prevention: Create dedicated tests for each complex node shape, verifying the NodeShape in the resulting AST.

7. Testing & Verification

We need to add new tests to our tests directory to cover the advanced parsing features.

tests/parser_advanced.rs (Create this new file)

// tests/parser_advanced.rs

use mermaid_parser::lexer::Lexer;
use mermaid_parser::parser::Parser;
use mermaid_parser::diagnostics::emitter::DiagnosticEmitter;
use mermaid_parser::ast::types::{Ast, DiagramType, FlowchartDirection, GraphItem, AstNode, NodeLabel, NodeShape, AstEdge, ArrowType, LineType, AstSubgraph, AstComment};
use mermaid_parser::utils::span::Span;

#[test]
fn test_parse_nested_subgraph() {
    let source = r#"
        graph TD
            A --> B
            subgraph Sub1["My Subgraph 1"]
                B --> C
                subgraph Sub2[Inner Subgraph]
                    C --> D
                end
                D --> E
            end
            E --> F
    "#;
    let mut emitter = DiagnosticEmitter::new(source);
    let lexer = Lexer::new(source, &mut emitter);
    let mut parser = Parser::new(lexer, &mut emitter);
    let ast = parser.parse().expect("Failed to parse nested subgraph");

    assert_eq!(ast.diagram_type, DiagramType::Flowchart(FlowchartDirection::TopToBottom));
    assert_eq!(ast.items.len(), 3); // A-->B, Sub1, E-->F

    // Verify A --> B
    if let GraphItem::Edge(edge) = &ast.items[0] {
        assert_eq!(edge.source, "A");
        assert_eq!(edge.target, "B");
    } else {
        panic!("Expected an edge at index 0");
    }

    // Verify Sub1
    if let GraphItem::Subgraph(sub1) = &ast.items[1] {
        assert_eq!(sub1.id, "Sub1");
        assert_eq!(sub1.label, Some(NodeLabel::new("My Subgraph 1".to_string(), Span::new(60, 75, 4, 25)))); // Check span carefully
        assert_eq!(sub1.items.len(), 3); // B-->C, Sub2, D-->E

        // Verify B --> C
        if let GraphItem::Edge(edge) = &sub1.items[0] {
            assert_eq!(edge.source, "B");
            assert_eq!(edge.target, "C");
        } else {
            panic!("Expected an edge in Sub1 at index 0");
        }

        // Verify Sub2
        if let GraphItem::Subgraph(sub2) = &sub1.items[1] {
            assert_eq!(sub2.id, "Sub2");
            assert_eq!(sub2.label, Some(NodeLabel::new("Inner Subgraph".to_string(), Span::new(140, 155, 6, 30)))); // Check span carefully
            assert_eq!(sub2.items.len(), 1); // C-->D

            // Verify C --> D
            if let GraphItem::Edge(edge) = &sub2.items[0] {
                assert_eq!(edge.source, "C");
                assert_eq!(edge.target, "D");
            } else {
                panic!("Expected an edge in Sub2 at index 0");
            }
        } else {
            panic!("Expected Subgraph 2 in Sub1 at index 1");
        }

        // Verify D --> E
        if let GraphItem::Edge(edge) = &sub1.items[2] {
            assert_eq!(edge.source, "D");
            assert_eq!(edge.target, "E");
        } else {
            panic!("Expected an edge in Sub1 at index 2");
        }
    } else {
        panic!("Expected Subgraph 1 at index 1");
    }

    // Verify E --> F
    if let GraphItem::Edge(edge) = &ast.items[2] {
        assert_eq!(edge.source, "E");
        assert_eq!(edge.target, "F");
    } else {
        panic!("Expected an edge at index 2");
    }

    assert_eq!(emitter.errors().len(), 0);
}

#[test]
fn test_parse_multiline_label_and_escaped_chars() {
    let source = r#"
        graph TD
            A["This is a\nMultiline\nLabel with \"quotes\""] --> B("Another <br/> Line")
    "#;
    let mut emitter = DiagnosticEmitter::new(source);
    let lexer = Lexer::new(source, &mut emitter);
    let mut parser = Parser::new(lexer, &mut emitter);
    let ast = parser.parse().expect("Failed to parse multiline label");

    assert_eq!(ast.items.len(), 1);
    if let GraphItem::Edge(edge) = &ast.items[0] {
        assert_eq!(edge.source, "A");
        assert_eq!(edge.target, "B");
        
        // Check source node label
        // The lexer should have processed the escaped quotes and newlines
        // The parser's `parse_node` and `parse_node_label` will use this raw string.
        let node_a_label = &ast.items[0].get_node_by_id("A").unwrap().label.as_ref().unwrap().text;
        assert_eq!(node_a_label, "This is a\nMultiline\nLabel with \"quotes\"");

        // Check target node label (with <br/>)
        let node_b_label = &ast.items[0].get_node_by_id("B").unwrap().label.as_ref().unwrap().text;
        assert_eq!(node_b_label, "Another <br/> Line");
    } else {
        panic!("Expected an edge item");
    }
    assert_eq!(emitter.errors().len(), 0);
}

// Helper to find a node by ID in the AST items. This is a simplification for testing.
// In a real validator, you'd traverse the AST properly.
impl GraphItem {
    fn get_node_by_id(&self, id: &str) -> Option<&AstNode> {
        match self {
            GraphItem::Node(node) if node.id == id => Some(node),
            GraphItem::Subgraph(sub) => {
                for item in &sub.items {
                    if let Some(node) = item.get_node_by_id(id) {
                        return Some(node);
                    }
                }
                None
            }
            GraphItem::Edge(edge) => {
                // For edges, we need to find the actual node declarations.
                // This is a limitation of this simple helper; a real AST traversal
                // would link nodes and edges to their declarations.
                // For this test, we assume the nodes are implicitly declared by being part of an edge.
                // More robust test would require parsing *all* nodes explicitly.
                None
            }
            _ => None,
        }
    }
}


#[test]
fn test_parse_inline_comments() {
    let source = r#"
        graph TD
        %% This is a top-level comment
            A --> B %% Inline comment on edge
            C[Node C] %% Comment on node
        %% Another comment block
            D --> E
    "#;
    let mut emitter = DiagnosticEmitter::new(source);
    let lexer = Lexer::new(source, &mut emitter);
    let mut parser = Parser::new(lexer, &mut emitter);
    let ast = parser.parse().expect("Failed to parse comments");

    assert_eq!(ast.items.len(), 5); // Comment, Edge, Node, Comment, Edge

    if let GraphItem::Comment(comment) = &ast.items[0] {
        assert_eq!(comment.text, "This is a top-level comment");
    } else { panic!("Expected comment 1"); }

    if let GraphItem::Edge(edge) = &ast.items[1] {
        assert_eq!(edge.source, "A");
        assert_eq!(edge.target, "B");
    } else { panic!("Expected edge A-->B"); }

    if let GraphItem::Node(node) = &ast.items[2] {
        assert_eq!(node.id, "C");
        assert_eq!(node.label.as_ref().unwrap().text, "Node C");
    } else { panic!("Expected node C"); }

    if let GraphItem::Comment(comment) = &ast.items[3] {
        assert_eq!(comment.text, "Another comment block");
    } else { panic!("Expected comment 2"); }

    if let GraphItem::Edge(edge) = &ast.items[4] {
        assert_eq!(edge.source, "D");
        assert_eq!(edge.target, "E");
    } else { panic!("Expected edge D-->E"); }

    assert_eq!(emitter.errors().len(), 0);
}

#[test]
fn test_parse_missing_subgraph_end_error_recovery() {
    let source = r#"
        graph TD
            A --> B
            subgraph Sub1[My Subgraph 1]
                B --> C
                D --> E
            end-missing-keyword
            F --> G
    "#;
    let mut emitter = DiagnosticEmitter::new(source);
    let lexer = Lexer::new(source, &mut emitter);
    let mut parser = Parser::new(lexer, &mut emitter);
    let ast = parser.parse(); // Expecting some errors, so it might return None

    assert!(ast.is_some()); // Should still produce an AST due to recovery
    assert!(emitter.errors().len() > 0);
    assert!(emitter.errors().iter().any(|e| e.code == "E0105" && e.message.contains("Expected 'end' to close subgraph")));

    let ast_unwrapped = ast.unwrap();
    assert_eq!(ast_unwrapped.items.len(), 3); // A-->B, Sub1, F-->G (assuming F-->G is parsed after recovery)

    // Verify A --> B
    if let GraphItem::Edge(edge) = &ast_unwrapped.items[0] {
        assert_eq!(edge.source, "A");
        assert_eq!(edge.target, "B");
    } else {
        panic!("Expected an edge at index 0");
    }

    // Verify Sub1 (should be parsed, but with an error reported)
    if let GraphItem::Subgraph(sub1) = &ast_unwrapped.items[1] {
        assert_eq!(sub1.id, "Sub1");
        assert_eq!(sub1.items.len(), 2); // B-->C, D-->E
    } else {
        panic!("Expected Subgraph 1 at index 1");
    }

    // Verify F --> G (should be parsed after recovery)
    if let GraphItem::Edge(edge) = &ast_unwrapped.items[2] {
        assert_eq!(edge.source, "F");
        assert_eq!(edge.target, "G");
    } else {
        panic!("Expected an edge at index 2");
    }
}

// Add this to your `src/main.rs` or `src/lib.rs` to include the test module
// #[cfg(test)]
// mod tests {
//     mod parser_advanced;
//     // ... other test modules
// }

To run these tests:

  1. Make sure you have cargo test configured.
  2. Create tests/parser_advanced.rs.
  3. Add mod parser_advanced; to tests/mod.rs (if you have one) or directly in src/lib.rs if #[cfg(test)] is used there.
  4. Run cargo test --test parser_advanced or cargo test.

What should work now:

  • Parsing of basic graph TD diagrams with nodes and edges.
  • Correct recognition and AST representation of subgraph blocks, including arbitrary nesting.
  • Accurate parsing of node and edge labels containing newlines (\n, <br/>) and escaped quotes (\").
  • Lexing and AST representation of %% inline comments.
  • The parser should be more resilient to certain malformed inputs (like missing end for a subgraph) by reporting an error and attempting to continue parsing.

8. Summary & Next Steps

In this comprehensive chapter, we significantly advanced the capabilities of our Mermaid parser. We extended our AST to natively support nested subgraphs and comments, enhancing its ability to represent complex diagram structures. Our lexer gained the intelligence to recognize subgraph, end, and %% comment tokens, along with improved handling of escaped characters and multiline content within string literals. The parser was then meticulously refactored to incorporate recursive parsing for subgraphs, a dispatcher for various graph items, and a foundational error recovery mechanism, ensuring robustness against malformed input.

This work is crucial as it allows our tool to process a much wider array of real-world Mermaid diagrams, moving us closer to a production-ready analyzer. The ability to correctly interpret nested structures is fundamental for accurate validation and formatting.

In the next chapter, Chapter 10: Semantic Analysis and Advanced Validation, we will leverage this enriched AST to implement a powerful validation layer. We will focus on detecting semantic errors that go beyond basic syntax, such as duplicate node definitions, undefined nodes in edge declarations, and inconsistent diagram structures. This will involve traversing our AST and applying rules to ensure the logical correctness and integrity of the Mermaid diagram, providing precise, Rust-style diagnostics for any issues found.