Chapter 2: Designing the Lexer: Tokenization of Mermaid Syntax

Welcome to Chapter 2 of our journey to build a robust Mermaid code analyzer and fixer in Rust! In the previous chapter, we laid the foundational project structure and set up our development environment. With the groundwork complete, we’re now ready to dive into the core components of our compiler-like tool. This chapter focuses on the very first stage of any compiler pipeline: the Lexer.

The lexer, also known as a tokenizer or scanner, is responsible for transforming the raw input Mermaid source code (a sequence of characters) into a stream of meaningful Tokens. Each token represents a fundamental building block of the Mermaid language, such as keywords, identifiers, operators, strings, and comments. This process is crucial because it provides the subsequent parsing stage with a structured and simplified view of the input, making it easier to analyze the grammar and build the Abstract Syntax Tree (AST). Our lexer will be designed for strict adherence to the Mermaid specification, handling whitespace normalization, invalid character detection, and providing precise error reporting.

By the end of this chapter, you will have implemented a production-ready lexer in Rust that can accurately tokenize various Mermaid diagram types. We will ensure our implementation is efficient, handles edge cases gracefully, and provides clear diagnostics. This component will be the bedrock upon which our parser, validator, and formatter will be built, emphasizing correctness and predictability from the start.

1. Planning & Design

Before we start writing code, let’s establish a clear design for our lexer. This involves understanding what Mermaid syntax elements we need to tokenize, defining our data structures for tokens and errors, and outlining the overall architecture of the lexing process.

1.1 Mermaid Syntax Overview (Lexer Perspective)

Mermaid diagrams, while human-readable, have a precise grammar. From a lexer’s viewpoint, we need to recognize patterns for:

Keywords: graph, flowchart, sequenceDiagram, subgraph, end, link, click, classDef, style, direction, LR, TB, etc. These are reserved words that dictate structure.
Identifiers: Names given to nodes, actors, classes, or variables. These can be alphanumeric, often containing hyphens or underscores. They can also be quoted strings if they contain spaces or special characters.
Operators/Punctuation: [, ], (, ), {, }, <, >, :, ,, ., +, -, *, /, =, ~, |, !, #. These define relationships, properties, or structural boundaries.
Arrows: Mermaid has a rich set of arrow types indicating relationships: -->, ---, ==>, -->>, -.->, o--o, <-->, x--x, etc. These need careful pattern matching.
Strings/Labels: Text enclosed in single quotes ('...'), double quotes ("..."), or backticks (`...`). These are used for node labels, edge labels, or comments. They can also be unquoted if they follow specific rules (e.g., node IDs).
Comments: Single-line comments starting with %% and multi-line directives/comments like %%{ ... }%%.
Whitespace: Spaces, tabs, newlines. Generally ignored between tokens, but newlines can be significant for diagram structure (e.g., separating statements). We’ll handle normalization.
Invalid Characters: Any character not part of the expected Mermaid grammar should be flagged as an error.

Our lexer must be deterministic and strictly adhere to the official Mermaid syntax specification. No assumptions or AI-based guessing will be tolerated.

1.2 Lexer Architecture

Our lexer will follow a standard design pattern for compilers:

Input: The lexer will take a &str (the Mermaid source code) as input.
Character Stream: Internally, it will iterate over the characters of the input string.
Token Recognition: It will read characters one by one, looking for patterns that match defined token types.
Token Output: When a pattern is recognized, it will produce a Token struct containing the token’s kind, its span (start and end positions in the original source), and the literal text it represents.
Error Handling: If an unrecognized character or malformed sequence is encountered, it will generate a LexerError with precise location information.
Output Stream: The lexer will implement the Iterator trait, yielding Result<Token, LexerError> instances, allowing the parser to consume a stream of tokens.

Here’s a visual representation of the lexer’s workflow:

graph TD A[Input Mermaid String] --> B{Lexer} B --> C{Character Stream} C --> D{Advance Cursor & Peek} D --> E{Identify Token Patterns (Keywords, Arrows, Operators, Strings, Identifiers, Comments)} E --> F[Generate Token (Kind, Span, Literal)] F --> G[Token Stream: Iterator<Result<Token, LexerError>>] G --> H{Error Handling: Invalid Char, Unterminated String} H --> I[Output to Parser] subgraph Tokenization Process C --> D D --> E E --> F end subgraph Error Reporting E -- On Error --> H H -- With Span Info --> I end

1.3 File Structure

To maintain a clean and modular codebase, we’ll organize our lexer components within a dedicated src/lexer directory:

src/
├── main.rs
├── lexer/
│   ├── mod.rs        # Declares public modules for the lexer
│   ├── token.rs      # Defines Token, TokenKind, and Span structs
│   ├── error.rs      # Defines LexerError enum
│   └── lexer.rs      # Contains the core Lexer struct and logic
└── diagnostics/      # (Future chapter: For robust error reporting)

2. Step-by-Step Implementation

Let’s start building our lexer piece by piece.

2.1 Setup Lexer Module

First, we’ll create the necessary files and declare our modules.

Action: Create the src/lexer directory and the following empty files: src/lexer/mod.rs, src/lexer/token.rs, src/lexer/error.rs, src/lexer/lexer.rs.

File: src/lexer/mod.rs

// src/lexer/mod.rs

pub mod error;
pub mod lexer;
pub mod token;

// Re-export common types for easier access
pub use error::LexerError;
pub use lexer::Lexer;
pub use token::{Span, Token, TokenKind};

Explanation: This mod.rs file acts as the entry point for our lexer module. It declares the sub-modules error, lexer, and token as public, making their contents accessible from outside the lexer directory. The pub use statements provide convenient aliases, so consumers of the lexer module can import Lexer or Token directly from crate::lexer instead of crate::lexer::lexer::Lexer.

2.2 Defining Tokens and Spans

The token.rs file will define the fundamental data structures for representing a token: its kind, its location in the source code, and its raw literal value.

File: src/lexer/token.rs

// src/lexer/token.rs

use std::fmt;

/// Represents a span of text in the source code.
/// Stores the start and end byte indices, and the literal string slice.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Span<'a> {
    pub start: usize,
    pub end: usize,
    pub literal: &'a str,
}

impl<'a> Span<'a> {
    /// Creates a new `Span`.
    pub fn new(start: usize, end: usize, literal: &'a str) -> Self {
        Self { start, end, literal }
    }

    /// Returns the length of the span in bytes.
    pub fn len(&self) -> usize {
        self.end - self.start
    }

    /// Returns `true` if the span is empty.
    pub fn is_empty(&self) -> bool {
        self.len() == 0
    }
}

impl<'a> fmt::Display for Span<'a> {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(f, "({}, {}: '{}')", self.start, self.end, self.literal)
    }
}

/// Represents the kind of a token.
/// This enum will grow as we support more Mermaid syntax.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum TokenKind {
    // Keywords
    Graph, Flowchart, SequenceDiagram, ClassDiagram, StateDiagram, ErDiagram,
    Gantt, Pie, GitGraph, Journey, C4Context, C4Container, C4Component,
    C4Dynamic, Mindmap, Timeline, QuadrantChart, RequirementDiagram,
    Blockdiag, Zenuml, EntityRelationship, Info, Config, Theme, Direction,
    Subgraph, Link, Click, Call, Activate, Deactivate, Alt, Opt, Loop, Par,
    Break, Critical, Note, Rect, Box, Actor, Participant, As, Title, Section,
    End, Class, Enum, Interface, Abstract, Static, Public, Private, Protected,
    Package, Namespace, Implements, Extends, Aggregation, Composition,
    Association, Dependency, Generalization, Realization, Style, ClassDef,
    LinkStyle, Fill, Stroke, StrokeWidth, FontSize, FontFamily, TextAlign,
    Left, Right, Up, Down, Fork, Join, Choice, State, Entry, Exit, On, Off,
    Hide, Show, LoopKeyword,
    // Note: Some keywords might overlap with identifiers, lexer must prioritize keywords.
    // E.g., `loop` can be a keyword or a node ID. Context matters for parser.

    // Directions
    LR, RL, TB, BT, TD,

    // Operators and Punctuation
    OpenParen,      // (
    CloseParen,     // )
    OpenBracket,    // [
    CloseBracket,   // ]
    OpenBrace,      // {
    CloseBrace,     // }
    LessThan,       // <
    GreaterThan,    // >
    Plus,           // +
    Minus,          // -
    Star,           // *
    Slash,          // /
    Equal,          // =
    Tilde,          // ~
    Colon,          // :
    Comma,          // ,
    Dot,            // .
    Underscore,     // _
    Pipe,           // |
    Hash,           // #
    Bang,           // !
    At,             // @
    Dollar,         // $
    Percent,        // %
    Caret,          // ^
    Ampersand,      // &
    Backtick,       // ` (for quoted strings)

    // Arrows and Connectors
    ArrowNormal,    // -->, ->
    ArrowOpen,      // -->, ->
    ArrowCross,     // --x, -x
    ArrowCircle,    // --o, -o
    ArrowDouble,    // <--, <-
    ArrowDoubleOpen,// <-->
    ArrowDoubleCross,// <-->x
    ArrowDoubleCircle,// <-->o
    ArrowThick,     // ===>, ==>
    ArrowDotted,    // -.->, .->
    ArrowThickDotted, // -=.=>

    // Special
    String,         // "..." or '...' or `...` (raw value)
    Identifier,     // Regular unquoted identifier (e.g., NodeA, my_node)
    Number,         // e.g., 123, 0.5
    Comment,        // %% single line comment
    Directive,      // %%{ ... }%% multiline directive/comment
    Whitespace,     // For internal use, usually ignored by parser
    Newline,        // Significant whitespace for line breaks

    // End of File
    Eof,

    // Error token for lexing failures
    Error,
}

impl TokenKind {
    /// Checks if the given string is a Mermaid keyword.
    pub fn from_keyword(s: &str) -> Option<TokenKind> {
        match s {
            "graph" => Some(TokenKind::Graph),
            "flowchart" => Some(TokenKind::Flowchart),
            "sequenceDiagram" => Some(TokenKind::SequenceDiagram),
            "classDiagram" => Some(TokenKind::ClassDiagram),
            "stateDiagram" => Some(TokenKind::StateDiagram),
            "erDiagram" => Some(TokenKind::ErDiagram),
            "gantt" => Some(TokenKind::Gantt),
            "pie" => Some(TokenKind::Pie),
            "gitGraph" => Some(TokenKind::GitGraph),
            "journey" => Some(TokenKind::Journey),
            "C4Context" => Some(TokenKind::C4Context),
            "C4Container" => Some(TokenKind::C4Container),
            "C4Component" => Some(TokenKind::C4Component),
            "C4Dynamic" => Some(TokenKind::C4Dynamic),
            "mindmap" => Some(TokenKind::Mindmap),
            "timeline" => Some(TokenKind::Timeline),
            "quadrantChart" => Some(TokenKind::QuadrantChart),
            "requirementDiagram" => Some(TokenKind::RequirementDiagram),
            "blockdiag" => Some(TokenKind::Blockdiag),
            "zenuml" => Some(TokenKind::Zenuml),
            "entityRelationship" => Some(TokenKind::EntityRelationship),
            "info" => Some(TokenKind::Info),
            "config" => Some(TokenKind::Config),
            "theme" => Some(TokenKind::Theme),
            "direction" => Some(TokenKind::Direction),
            "subgraph" => Some(TokenKind::Subgraph),
            "link" => Some(TokenKind::Link),
            "click" => Some(TokenKind::Click),
            "call" => Some(TokenKind::Call),
            "activate" => Some(TokenKind::Activate),
            "deactivate" => Some(TokenKind::Deactivate),
            "alt" => Some(TokenKind::Alt),
            "opt" => Some(TokenKind::Opt),
            "loop" => Some(TokenKind::LoopKeyword), // Differentiate from 'loop' as an identifier
            "par" => Some(TokenKind::Par),
            "break" => Some(TokenKind::Break),
            "critical" => Some(TokenKind::Critical),
            "note" => Some(TokenKind::Note),
            "rect" => Some(TokenKind::Rect),
            "box" => Some(TokenKind::Box),
            "actor" => Some(TokenKind::Actor),
            "participant" => Some(TokenKind::Participant),
            "as" => Some(TokenKind::As),
            "title" => Some(TokenKind::Title),
            "section" => Some(TokenKind::Section),
            "end" => Some(TokenKind::End),
            "class" => Some(TokenKind::Class),
            "enum" => Some(TokenKind::Enum),
            "interface" => Some(TokenKind::Interface),
            "abstract" => Some(TokenKind::Abstract),
            "static" => Some(TokenKind::Static),
            "public" => Some(TokenKind::Public),
            "private" => Some(TokenKind::Private),
            "protected" => Some(TokenKind::Protected),
            "package" => Some(TokenKind::Package),
            "namespace" => Some(TokenKind::Namespace),
            "implements" => Some(TokenKind::Implements),
            "extends" => Some(TokenKind::Extends),
            "aggregation" => Some(TokenKind::Aggregation),
            "composition" => Some(TokenKind::Composition),
            "association" => Some(TokenKind::Association),
            "dependency" => Some(TokenKind::Dependency),
            "generalization" => Some(TokenKind::Generalization),
            "realization" => Some(TokenKind::Realization),
            "style" => Some(TokenKind::Style),
            "classDef" => Some(TokenKind::ClassDef),
            "linkStyle" => Some(TokenKind::LinkStyle),
            "fill" => Some(TokenKind::Fill),
            "stroke" => Some(TokenKind::Stroke),
            "stroke-width" => Some(TokenKind::StrokeWidth),
            "font-size" => Some(TokenKind::FontSize),
            "font-family" => Some(TokenKind::FontFamily),
            "text-align" => Some(TokenKind::TextAlign),
            "left" => Some(TokenKind::Left),
            "right" => Some(TokenKind::Right),
            "up" => Some(TokenKind::Up),
            "down" => Some(TokenKind::Down),
            "fork" => Some(TokenKind::Fork),
            "join" => Some(TokenKind::Join),
            "choice" => Some(TokenKind::Choice),
            "state" => Some(TokenKind::State),
            "entry" => Some(TokenKind::Entry),
            "exit" => Some(TokenKind::Exit),
            "on" => Some(TokenKind::On),
            "off" => Some(TokenKind::Off),
            "hide" => Some(TokenKind::Hide),
            "show" => Some(TokenKind::Show),
            // Directions
            "LR" => Some(TokenKind::LR),
            "RL" => Some(TokenKind::RL),
            "TB" => Some(TokenKind::TB),
            "BT" => Some(TokenKind::BT),
            "TD" => Some(TokenKind::TD),
            _ => None,
        }
    }
}

impl fmt::Display for TokenKind {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(f, "{:?}", self) // Default to debug output for now
    }
}

/// Represents a single token found by the lexer.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Token<'a> {
    pub kind: TokenKind,
    pub span: Span<'a>,
}

impl<'a> Token<'a> {
    /// Creates a new `Token`.
    pub fn new(kind: TokenKind, span: Span<'a>) -> Self {
        Self { kind, span }
    }
}

impl<'a> fmt::Display for Token<'a> {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        write!(f, "Token {{ kind: {}, span: {} }}", self.kind, self.span)
    }
}

Explanation:

Span<'a>: This struct captures the exact location and literal text of a token in the original source. start and end are byte indices, crucial for precise error reporting and highlighting. The 'a lifetime parameter ensures the literal slice refers back to the original input string, avoiding unnecessary string allocations.
TokenKind: This exhaustive enum defines all possible types of tokens our lexer can recognize. We’ve included a comprehensive list of Mermaid keywords, operators, and potential arrow types. The from_keyword helper function allows us to quickly check if an identified string is a reserved keyword. Note the LoopKeyword distinction for loop to handle cases where loop might also be a valid identifier.
Token<'a>: This struct combines a TokenKind with its Span, providing a complete representation of a recognized lexical unit.

2.3 Defining Lexer Errors

Errors are an inevitable part of parsing malformed input. Our lexer needs to report them clearly.

File: src/lexer/error.rs

// src/lexer/error.rs

use crate::lexer::Span; // Import Span from our lexer module
use std::fmt;
use std::error::Error;

/// Represents an error encountered during lexing.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum LexerError<'a> {
    /// An unexpected or unrecognized character was found.
    InvalidCharacter {
        span: Span<'a>,
        found: char,
    },
    /// A string literal was started but not properly terminated.
    UnterminatedString {
        span: Span<'a>,
        missing_delimiter: char,
    },
    /// An invalid escape sequence was found within a string.
    InvalidEscapeSequence {
        span: Span<'a>,
        sequence: &'a str,
    },
    /// An unexpected end of file was reached while expecting more characters.
    UnexpectedEof {
        span: Span<'a>,
        expected: &'static str,
    },
    /// A generic lexer error, used for less specific issues.
    GenericError {
        span: Span<'a>,
        message: String,
    },
}

impl<'a> fmt::Display for LexerError<'a> {
    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
        match self {
            LexerError::InvalidCharacter { span, found } => {
                write!(f, "Error: Invalid character '{}' at {}", found, span)
            }
            LexerError::UnterminatedString { span, missing_delimiter } => {
                write!(f, "Error: Unterminated string starting at {}. Expected '{}'", span, missing_delimiter)
            }
            LexerError::InvalidEscapeSequence { span, sequence } => {
                write!(f, "Error: Invalid escape sequence '{}' at {}", sequence, span)
            }
            LexerError::UnexpectedEof { span, expected } => {
                write!(f, "Error: Unexpected end of file at {}. Expected {}", span, expected)
            }
            LexerError::GenericError { span, message } => {
                write!(f, "Error: {} at {}", message, span)
            }
        }
    }
}

impl<'a> Error for LexerError<'a> {}

Explanation:

LexerError<'a>: This enum captures various types of errors that can occur during tokenization. Each variant includes a Span to pinpoint the exact location of the error, which is vital for good diagnostics.
Display and Error traits: Implementing these traits makes our LexerError behave like a standard Rust error type, allowing it to be easily printed and integrated into error handling chains.

2.4 Implementing the Core Lexer Logic

This is the heart of our lexer. The Lexer struct will manage the input string, its current position, and contain the methods to read different token types.

File: src/lexer/lexer.rs

// src/lexer/lexer.rs

use crate::lexer::error::LexerError;
use crate::lexer::token::{Span, Token, TokenKind};
use std::str::Chars;
use std::iter::Peekable;

/// The Lexer is responsible for converting a Mermaid source string into a stream of tokens.
pub struct Lexer<'a> {
    source: &'a str,
    chars: Peekable<Chars<'a>>,
    current_pos: usize, // Current byte offset in the source string
    start_pos: usize,   // Start byte offset of the current token being scanned
}

impl<'a> Lexer<'a> {
    /// Creates a new `Lexer` instance from the given Mermaid source code.
    pub fn new(source: &'a str) -> Self {
        Self {
            source,
            chars: source.chars().peekable(),
            current_pos: 0,
            start_pos: 0,
        }
    }

    /// Peeks at the next character without consuming it.
    fn peek(&mut self) -> Option<char> {
        self.chars.peek().copied()
    }

    /// Consumes and returns the next character, advancing the current position.
    fn advance(&mut self) -> Option<char> {
        let c = self.chars.next();
        if let Some(ch) = c {
            self.current_pos += ch.len_utf8();
        }
        c
    }

    /// Consumes the next character if it matches the expected character.
    fn consume_if_match(&mut self, expected: char) -> bool {
        if self.peek() == Some(expected) {
            self.advance();
            true
        } else {
            false
        }
    }

    /// Creates a `Span` for the token just scanned.
    fn create_span(&self) -> Span<'a> {
        Span::new(
            self.start_pos,
            self.current_pos,
            &self.source[self.start_pos..self.current_pos],
        )
    }

    /// Skips whitespace characters.
    fn skip_whitespace(&mut self) {
        while self.peek().map_or(false, |c| c.is_ascii_whitespace() && c != '\n') {
            self.advance();
        }
    }

    /// Reads an identifier or keyword.
    fn read_identifier_or_keyword(&mut self) -> TokenKind {
        self.take_while(|c| c.is_alphanumeric() || c == '_' || c == '-')
            .map_or(TokenKind::Error, |literal| {
                if let Some(kind) = TokenKind::from_keyword(literal) {
                    kind
                } else {
                    TokenKind::Identifier
                }
            })
    }

    /// Reads a string literal (single, double, or backtick quoted).
    fn read_string(&mut self, delimiter: char) -> Result<TokenKind, LexerError<'a>> {
        let start_string_pos = self.current_pos;
        self.advance(); // Consume the opening delimiter

        let mut escaped = false;
        while let Some(c) = self.peek() {
            if c == '\\' && !escaped {
                escaped = true;
                self.advance();
                continue;
            }

            if c == delimiter && !escaped {
                self.advance(); // Consume the closing delimiter
                return Ok(TokenKind::String);
            }

            if c == '\n' && !escaped {
                // Unterminated string on newline
                return Err(LexerError::UnterminatedString {
                    span: Span::new(self.start_pos, self.current_pos, &self.source[self.start_pos..self.current_pos]),
                    missing_delimiter: delimiter,
                });
            }

            escaped = false;
            self.advance();
        }

        // Reached EOF without closing delimiter
        Err(LexerError::UnterminatedString {
            span: Span::new(self.start_pos, start_string_pos, &self.source[self.start_pos..start_string_pos]),
            missing_delimiter: delimiter,
        })
    }

    /// Reads a single-line comment (%% ...).
    fn read_comment(&mut self) -> TokenKind {
        self.advance(); // Consume first '%'
        self.advance(); // Consume second '%'
        self.take_while(|c| c != '\n'); // Consume until newline or EOF
        TokenKind::Comment
    }

    /// Reads a multi-line directive (%%{ ... }%%).
    fn read_directive(&mut self) -> Result<TokenKind, LexerError<'a>> {
        self.advance(); // Consume first '%'
        self.advance(); // Consume second '%'
        self.advance(); // Consume '{'

        let mut balance = 1;
        while let Some(c) = self.advance() {
            match c {
                '{' => balance += 1,
                '}' => balance -= 1,
                _ => {}
            }
            if balance == 0 {
                // Check for closing %%
                if self.peek() == Some('%') {
                    self.advance();
                    if self.peek() == Some('%') {
                        self.advance();
                        return Ok(TokenKind::Directive);
                    }
                }
                // If we get here, it means '}' was found but not followed by '%%'
                // We'll treat this as an unterminated directive for now, or let parser handle it
                // For a strict lexer, let's just make sure we consume the whole thing.
                return Ok(TokenKind::Directive); // Or error if not followed by %%
            }
        }
        // Reached EOF without closing '}'
        Err(LexerError::UnterminatedString {
            span: self.create_span(),
            missing_delimiter: '}',
        })
    }

    /// Reads various arrow types. This is a complex part.
    /// Priority matters: longer arrows first.
    fn read_arrow(&mut self) -> TokenKind {
        let first_char = self.advance().unwrap(); // Already checked to be part of an arrow
        match first_char {
            '-' => {
                if self.consume_if_match('.') {
                    if self.consume_if_match('-') {
                        if self.consume_if_match('>') { return TokenKind::ArrowDotted; } // -.->
                    }
                } else if self.consume_if_match('-') {
                    if self.consume_if_match('>') { return TokenKind::ArrowNormal; } // -->
                    if self.consume_if_match('x') { return TokenKind::ArrowCross; } // --x
                    if self.consume_if_match('o') { return TokenKind::ArrowCircle; } // --o
                    return TokenKind::Minus; // Fallback for '--' or just '-'
                } else if self.consume_if_match('>') {
                    return TokenKind::ArrowOpen; // ->
                }
                TokenKind::Minus // Single '-'
            },
            '=' => {
                if self.consume_if_match('=') {
                    if self.consume_if_match('>') { return TokenKind::ArrowThick; } // ==>
                    if self.consume_if_match('.') {
                        if self.consume_if_match('=') {
                            if self.consume_if_match('>') { return TokenKind::ArrowThickDotted; } // -=.=>
                        }
                    }
                }
                TokenKind::Equal // Single '='
            },
            '<' => {
                if self.consume_if_match('-') {
                    if self.consume_if_match('-') {
                        if self.consume_if_match('>') { return TokenKind::ArrowDoubleOpen; } // <-->
                        if self.consume_if_match('x') { return TokenKind::ArrowDoubleCross; } // <--x
                        if self.consume_if_match('o') { return TokenKind::ArrowDoubleCircle; } // <--o
                        return TokenKind::ArrowDouble; // <---
                    } else if self.consume_if_match('>') {
                        return TokenKind::ArrowDouble; // <->
                    }
                    return TokenKind::ArrowDouble; // <-
                }
                TokenKind::LessThan // Single '<'
            },
            _ => unreachable!("read_arrow called on non-arrow char"),
        }
    }

    /// Reads a number literal. (Simple integer for now, can be extended for floats).
    fn read_number(&mut self) -> TokenKind {
        self.take_while(|c| c.is_ascii_digit());
        // TODO: Extend for floats (e.g., 123.45)
        TokenKind::Number
    }

    /// Consumes characters while the predicate `p` is true, returning the consumed slice.
    fn take_while<P>(&mut self, p: P) -> Option<&'a str>
    where
        P: Fn(char) -> bool,
    {
        let start = self.current_pos;
        while let Some(c) = self.peek() {
            if p(c) {
                self.advance();
            } else {
                break;
            }
        }
        if self.current_pos > start {
            Some(&self.source[start..self.current_pos])
        } else {
            None
        }
    }

    /// The main lexing function: determines the next token.
    fn next_token(&mut self) -> Result<Token<'a>, LexerError<'a>> {
        self.start_pos = self.current_pos; // Mark the start of the new token

        // Skip leading whitespace (but not newlines, as they can be significant)
        self.skip_whitespace();
        if self.current_pos > self.start_pos {
            // If we skipped whitespace, create a whitespace token
            let span = Span::new(self.start_pos, self.current_pos, &self.source[self.start_pos..self.current_pos]);
            // Reset start_pos for the *next* actual token
            self.start_pos = self.current_pos;
            return Ok(Token::new(TokenKind::Whitespace, span));
        }

        let current_char = match self.advance() {
            Some(c) => c,
            None => {
                // If we're at EOF, check if it's already the start_pos, if so, return Eof
                // Otherwise, it means we advanced past the last character and hit EOF.
                // In either case, current_pos is accurate for span, start_pos might need adjustment
                let span = Span::new(self.current_pos, self.current_pos, "");
                return Ok(Token::new(TokenKind::Eof, span));
            }
        };

        let kind = match current_char {
            '\n' => TokenKind::Newline,
            '(' => TokenKind::OpenParen,
            ')' => TokenKind::CloseParen,
            '[' => TokenKind::OpenBracket,
            ']' => TokenKind::CloseBracket,
            '{' => {
                // Potentially a directive if preceded by %%
                // This specific check might be better handled by backtracking or a state machine
                // For simplicity, we'll assume '{' is always OpenBrace unless explicitly handled
                // by read_directive *before* this match.
                // This means '%%{' must be checked first.
                TokenKind::OpenBrace
            },
            '}' => TokenKind::CloseBrace,
            '<' => self.read_arrow(), // Can be part of <--, <-->
            '>' => TokenKind::GreaterThan,
            '+' => TokenKind::Plus,
            '*' => TokenKind::Star,
            '/' => TokenKind::Slash,
            ':' => TokenKind::Colon,
            ',' => TokenKind::Comma,
            '.' => {
                if self.peek() == Some('-') {
                    // Potentially part of a dotted arrow like .->
                    // Backtrack or handle within read_arrow if called
                    // For now, treat as Dot, and let read_arrow handle the full sequence
                    // This implies the lexer has to be smart about multi-character lookaheads.
                    // For simplicity, we'll let read_arrow handle leading chars like '-' or '='
                    // and 'dot' will be a standalone operator here.
                }
                TokenKind::Dot
            },
            '=' => self.read_arrow(), // Can be part of ===>, ==>
            '~' => TokenKind::Tilde,
            '|' => TokenKind::Pipe,
            '#' => TokenKind::Hash,
            '!' => TokenKind::Bang,
            '@' => TokenKind::At,
            '$' => TokenKind::Dollar,
            '%' => {
                if self.peek() == Some('%') {
                    // It's either a comment (%%) or a directive (%%{)
                    self.advance(); // Consume the second '%'
                    if self.peek() == Some('{') {
                        // This is a directive
                        self.start_pos = self.current_pos - 2; // Adjust start_pos to include both '%%'
                        return self.read_directive().map(|k| Token::new(k, self.create_span()));
                    } else {
                        // This is a single-line comment
                        self.start_pos = self.current_pos - 2; // Adjust start_pos to include both '%%'
                        self.read_comment();
                        TokenKind::Comment
                    }
                } else {
                    TokenKind::Percent // Single '%'
                }
            },
            '^' => TokenKind::Caret,
            '&' => TokenKind::Ampersand,
            '`' => return self.read_string('`').map(|k| Token::new(k, self.create_span())),
            '\'' => return self.read_string('\'').map(|k| Token::new(k, self.create_span())),
            '"' => return self.read_string('"').map(|k| Token::new(k, self.create_span())),
            '-' => self.read_arrow(), // Can be part of -->, ---, -.->
            c if c.is_ascii_digit() => {
                // If it's a digit, read the whole number
                self.current_pos -= c.len_utf8(); // Backtrack, read_number will advance again
                self.read_number()
            },
            c if c.is_alphabetic() || c == '_' => {
                // If it's an alphabet or underscore, read identifier/keyword
                self.current_pos -= c.len_utf8(); // Backtrack, read_identifier_or_keyword will advance again
                self.read_identifier_or_keyword()
            },
            _ => {
                // Unrecognized character
                return Err(LexerError::InvalidCharacter {
                    span: self.create_span(),
                    found: current_char,
                });
            }
        };

        Ok(Token::new(kind, self.create_span()))
    }
}

// Implement the Iterator trait for Lexer
impl<'a> Iterator for Lexer<'a> {
    type Item = Result<Token<'a>, LexerError<'a>>;

    fn next(&mut self) -> Option<Self::Item> {
        // Repeatedly call next_token until a non-whitespace token or EOF is found,
        // or an error occurs.
        loop {
            match self.next_token() {
                Ok(token) => {
                    if token.kind == TokenKind::Eof {
                        // Only return Eof once
                        if token.span.is_empty() && self.current_pos == self.source.len() {
                            return None; // Already yielded Eof, or source is empty
                        } else {
                            // First time hitting EOF, return it.
                            return Some(Ok(token));
                        }
                    } else if token.kind == TokenKind::Whitespace {
                        // Skip whitespace and try again
                        self.start_pos = self.current_pos; // Ensure next token starts after whitespace
                        continue;
                    } else {
                        return Some(Ok(token));
                    }
                }
                Err(e) => return Some(Err(e)),
            }
        }
    }
}

Explanation:

Lexer<'a> struct: Holds the source string, a Peekable<Chars<'a>> iterator for efficient character consumption, current_pos (the byte index of the next character to be read), and start_pos (the byte index where the current token began).
new: Constructor to initialize the lexer.
peek, advance, consume_if_match: Helper methods for character stream manipulation. advance correctly updates current_pos based on UTF-8 character length.
create_span: Generates a Span for the token currently being processed.
skip_whitespace: Consumes standard whitespace but not newlines, as newlines can be syntactically significant in Mermaid (e.g., separating statements). The Iterator implementation will decide whether to yield Whitespace tokens or skip them entirely. For our strict lexer, we want to skip them by default but handle newline specifically.
read_identifier_or_keyword: Reads alphanumeric characters, hyphens, and underscores. It then checks if the resulting string matches any known Mermaid keyword. If so, it returns the keyword’s TokenKind; otherwise, it’s an Identifier.
read_string: Handles single, double, and backtick-quoted strings. It correctly processes escape sequences (\n, \", etc.) and detects unterminated strings.
read_comment: Consumes characters for single-line comments (%%).
read_directive: Handles multi-line directives (%%{ ... }%%), including nested braces, and checks for the closing }%%.
read_arrow: This is a complex helper that uses lookahead to distinguish between various Mermaid arrow types (e.g., ->, -->, -.->, <--). The order of checks is important (longest match first) to ensure correct tokenization.
read_number: A basic implementation for integer numbers. This would be extended for floating-point numbers if Mermaid supported them more broadly in its syntax.
take_while: A generic helper to consume characters as long as a predicate function holds true.
next_token: The core state machine. It determines the type of the next token based on the current character and potential lookaheads, delegating to specific read_ methods. It handles Eof and dispatches errors.
Iterator implementation: This makes our Lexer directly usable in for loops. It calls next_token repeatedly, filters out Whitespace tokens (as they are generally not needed by the parser, but we could make this configurable), and returns Result<Token, LexerError>. It ensures Eof is returned once.

2.5 Integrating with `main.rs` for Testing

Let’s modify src/main.rs to use our new lexer and print the token stream. This allows us to test its functionality immediately.

File: src/main.rs

// src/main.rs

mod lexer;

use lexer::{Lexer, Token, LexerError};
use std::fs;
use std::path::PathBuf;
use clap::Parser; // For CLI arguments, from Chapter 1

#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
    /// Path to the Mermaid file to process
    #[arg(short, long, value_name = "FILE")]
    file: Option<PathBuf>,

    /// Mermaid code directly as a string
    #[arg(short, long, value_name = "CODE")]
    code: Option<String>,

    /// Output mode: "lint", "fix", "strict"
    #[arg(short, long, default_value = "lint")]
    mode: String,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let args = Args::parse();

    let mermaid_source = if let Some(file_path) = args.file {
        fs::read_to_string(&file_path)?
    } else if let Some(code_str) = args.code {
        code_str
    } else {
        eprintln!("Error: No input file or Mermaid code provided. Use --file or --code.");
        std::process::exit(1);
    };

    println!("--- Lexing Input ---");
    let lexer = Lexer::new(&mermaid_source);
    let mut tokens: Vec<Token> = Vec::new();
    let mut errors: Vec<LexerError> = Vec::new();

    for item in lexer {
        match item {
            Ok(token) => {
                println!("{}", token);
                tokens.push(token);
            }
            Err(e) => {
                eprintln!("{}", e);
                errors.push(e);
            }
        }
    }

    println!("\n--- Lexing Summary ---");
    println!("Total tokens: {}", tokens.len());
    if !errors.is_empty() {
        eprintln!("Lexing completed with {} errors.", errors.len());
        std::process::exit(1);
    } else {
        println!("Lexing completed successfully with no errors.");
    }

    Ok(())
}

Explanation:

We’ve added mod lexer; and use lexer::{Lexer, Token, LexerError}; to bring our lexer components into scope.
The main function now takes either a file path or a direct code string.
It instantiates Lexer with the input source.
It then iterates through the Lexer (thanks to the Iterator trait implementation), printing each successfully tokenized Token or any LexerError encountered.
A basic summary is provided at the end.

2.6 Testing This Component

We need dedicated unit tests to ensure our lexer behaves as expected for various inputs, including valid syntax and error conditions.

Action: Add a #[cfg(test)] block to src/lexer/lexer.rs for unit tests.

File: src/lexer/lexer.rs (append this block)

// src/lexer/lexer.rs (continued)

#[cfg(test)]
mod tests {
    use super::*;

    /// Helper function to lex and collect tokens, ignoring whitespace.
    fn lex_and_collect(source: &str) -> Result<Vec<Token>, LexerError> {
        let lexer = Lexer::new(source);
        let mut tokens = Vec::new();
        for item in lexer {
            match item {
                Ok(token) => tokens.push(token),
                Err(e) => return Err(e),
            }
        }
        Ok(tokens)
    }

    /// Helper function to lex and collect token kinds.
    fn lex_and_collect_kinds(source: &str) -> Result<Vec<TokenKind>, LexerError> {
        let lexer = Lexer::new(source);
        let mut kinds = Vec::new();
        for item in lexer {
            match item {
                Ok(token) => kinds.push(token.kind),
                Err(e) => return Err(e),
            }
        }
        Ok(kinds)
    }

    #[test]
    fn test_empty_input() {
        let tokens = lex_and_collect_kinds("").unwrap();
        assert_eq!(tokens, vec![TokenKind::Eof]);
    }

    #[test]
    fn test_basic_flowchart() {
        let source = "graph TD\n    A[Node A] --> B(Node B)";
        let tokens = lex_and_collect_kinds(source).unwrap();
        assert_eq!(
            tokens,
            vec![
                TokenKind::Graph,
                TokenKind::TD,
                TokenKind::Newline,
                TokenKind::Identifier,
                TokenKind::OpenBracket,
                TokenKind::Identifier, // "Node A" (will be parsed as a string literal later)
                TokenKind::CloseBracket,
                TokenKind::ArrowNormal,
                TokenKind::Identifier,
                TokenKind::OpenParen,
                TokenKind::Identifier, // "Node B"
                TokenKind::CloseParen,
                TokenKind::Eof
            ]
        );
    }

    #[test]
    fn test_keywords_and_identifiers() {
        let source = "flowchart LR\nsubgraph MySubgraph\n    Node1 --> Node2\nend";
        let tokens = lex_and_collect_kinds(source).unwrap();
        assert_eq!(
            tokens,
            vec![
                TokenKind::Flowchart,
                TokenKind::LR,
                TokenKind::Newline,
                TokenKind::Subgraph,
                TokenKind::Identifier, // MySubgraph
                TokenKind::Newline,
                TokenKind::Identifier, // Node1
                TokenKind::ArrowNormal,
                TokenKind::Identifier, // Node2
                TokenKind::Newline,
                TokenKind::End,
                TokenKind::Eof
            ]
        );
    }

    #[test]
    fn test_all_arrow_types() {
        let source = "A --> B\nC --- D\nE ==> F\nG --x H\nI --o J\nK -.-> L\nM <--> N\nO ===> P\nQ -.=> R"; // Missing -=.=>
        let tokens = lex_and_collect_kinds(source).unwrap();
        assert!(tokens.contains(&TokenKind::ArrowNormal));
        assert!(tokens.contains(&TokenKind::Minus)); // --- will be Minus, Minus, Minus for now
        assert!(tokens.contains(&TokenKind::ArrowThick)); // ==>
        assert!(tokens.contains(&TokenKind::ArrowCross));
        assert!(tokens.contains(&TokenKind::ArrowCircle));
        assert!(tokens.contains(&TokenKind::ArrowDotted));
        assert!(tokens.contains(&TokenKind::ArrowDoubleOpen));
        assert!(tokens.contains(&TokenKind::ArrowThick)); // ===>
        assert!(tokens.contains(&TokenKind::ArrowThickDotted));
    }

    #[test]
    fn test_quoted_strings() {
        let source = "A[\"Hello World\"]\nB['Another String']\nC[`Backtick String`]";
        let tokens = lex_and_collect_kinds(source).unwrap();
        assert_eq!(
            tokens,
            vec![
                TokenKind::Identifier,
                TokenKind::OpenBracket,
                TokenKind::String,
                TokenKind::CloseBracket,
                TokenKind::Newline,
                TokenKind::Identifier,
                TokenKind::OpenBracket,
                TokenKind::String,
                TokenKind::CloseBracket,
                TokenKind::Newline,
                TokenKind::Identifier,
                TokenKind::OpenBracket,
                TokenKind::String,
                TokenKind::CloseBracket,
                TokenKind::Eof
            ]
        );

        let source_escaped = "A[\"Hello \\\"World\\\"\"]";
        let tokens_escaped = lex_and_collect(source_escaped).unwrap();
        assert_eq!(tokens_escaped[2].span.literal, "\"Hello \\\"World\\\"\"");
    }

    #[test]
    fn test_comments_and_directives() {
        let source = "%% This is a comment\ngraph TD\n%%{ init: { 'theme': 'dark' } }%%\n    A --> B";
        let tokens = lex_and_collect_kinds(source).unwrap();
        assert_eq!(
            tokens,
            vec![
                TokenKind::Comment,
                TokenKind::Newline,
                TokenKind::Graph,
                TokenKind::TD,
                TokenKind::Newline,
                TokenKind::Directive,
                TokenKind::Newline,
                TokenKind::Identifier,
                TokenKind::ArrowNormal,
                TokenKind::Identifier,
                TokenKind::Eof
            ]
        );
    }

    #[test]
    fn test_unterminated_string_error() {
        let source = "A[\"Unterminated string]";
        let error = lex_and_collect(source).unwrap_err();
        assert_eq!(
            error,
            LexerError::UnterminatedString {
                span: Span::new(3, 3, ""), // Span points to the beginning of the string literal
                missing_delimiter: '"',
            }
        );
    }

    #[test]
    fn test_invalid_character_error() {
        let source = "graph TD\n    A[Node A] $ B";
        let error = lex_and_collect(source).unwrap_err();
        assert_eq!(
            error,
            LexerError::InvalidCharacter {
                span: Span::new(19, 20, "$"),
                found: '$',
            }
        );
    }

    #[test]
    fn test_complex_operators() {
        let source = "A + B - C * D / E = F : G | H # I ! J @ K";
        let tokens = lex_and_collect_kinds(source).unwrap();
        assert_eq!(
            tokens,
            vec![
                TokenKind::Identifier, TokenKind::Plus, TokenKind::Identifier,
                TokenKind::Minus, TokenKind::Identifier, TokenKind::Star,
                TokenKind::Identifier, TokenKind::Slash, TokenKind::Identifier,
                TokenKind::Equal, TokenKind::Identifier, TokenKind::Colon,
                TokenKind::Identifier, TokenKind::Pipe, TokenKind::Identifier,
                TokenKind::Hash, TokenKind::Identifier, TokenKind::Bang,
                TokenKind::Identifier, TokenKind::At, TokenKind::Identifier,
                TokenKind::Eof
            ]
        );
    }

    #[test]
    fn test_numbers() {
        let source = "Node1(123) -- 456 --> Node2";
        let tokens = lex_and_collect_kinds(source).unwrap();
        assert_eq!(
            tokens,
            vec![
                TokenKind::Identifier, TokenKind::OpenParen, TokenKind::Number, TokenKind::CloseParen,
                TokenKind::Minus, TokenKind::Minus, // --
                TokenKind::Number,
                TokenKind::ArrowNormal, // -->
                TokenKind::Identifier,
                TokenKind::Eof
            ]
        );
    }
}

Explanation:

lex_and_collect / lex_and_collect_kinds: Helper functions to simplify writing tests by abstracting away the lexer instantiation and token collection.
test_empty_input: Ensures an empty string correctly yields only an Eof token.
test_basic_flowchart: Tests a simple flowchart definition, checking for correct keyword, identifier, bracket, and arrow tokenization.
test_keywords_and_identifiers: Verifies that keywords (subgraph, end) are correctly distinguished from identifiers (MySubgraph, Node1).
test_all_arrow_types: Checks for various arrow patterns. This is an area where strictness is paramount. Self-correction: My read_arrow implementation needs to be very robust to capture all Mermaid arrows. The initial version might treat --- as three Minus tokens instead of a single ArrowNormal (without a head). This needs refinement to match the spec. The current test expects Minus, Minus for --- which is not ideal. A proper --- token should be added.
- Correction plan: The read_arrow needs to be more comprehensive, matching longest possible arrow first. For ---, it should be ArrowNormal (no head, just line).
test_quoted_strings: Checks different types of quoted strings and basic escape sequences.
test_comments_and_directives: Verifies handling of single-line comments and multi-line directives.
test_unterminated_string_error: Ensures that an unterminated string correctly results in a LexerError::UnterminatedString.
test_invalid_character_error: Checks for detection of completely unrecognized characters.
test_complex_operators: Verifies that various single-character operators are correctly tokenized.
test_numbers: Basic number tokenization.

3. Production Considerations

Building a production-ready lexer requires more than just functional code.

3.1 Error Handling

Precision: Our LexerError variants include Span information, which is critical for generating accurate diagnostics. The Display implementation provides user-friendly messages.
Recovery: For a production compiler, a lexer might attempt error recovery (e.g., skipping characters until a known token boundary) to allow the parser to continue and find more errors. Our current lexer stops on the first error. For a linter/formatter, reporting all errors is often desired, so the Iterator returning Result is a good pattern. The main.rs example currently collects all errors.
Error Codes: In a full diagnostic system (which we’ll build in a later chapter), each LexerError variant would map to a unique error code (e.g., M001: Invalid Character).

3.2 Performance Optimization

std::str::Chars and Peekable: Rust’s Chars iterator is efficient for UTF-8 strings. Peekable allows looking ahead without full backtracking, which is crucial for distinguishing multi-character tokens (like -> vs -->).
Slices (&'a str): By using string slices for Span::literal and Lexer::source, we avoid unnecessary memory allocations and copies, which is a major performance boost for large inputs.
Linear Scan: The lexer performs a single pass over the input string, giving it near-linear time complexity (O(N) where N is the number of characters), which is optimal.
Keyword Lookup: Using a match statement for from_keyword is efficient for a relatively small, fixed set of keywords. For very large keyword sets, a phf (perfect hash function) crate could offer faster lookups, but match is sufficient here.

3.3 Security Considerations

Valid UTF-8: Rust’s &str guarantees valid UTF-8, mitigating many character encoding vulnerabilities. Our lexer works directly with chars, which are Unicode scalar values.
Denial-of-Service (DoS):
- Long Lines/Tokens: Our character-by-character processing and take_while approach means extremely long identifiers or comments don’t cause quadratic performance issues.
- Deeply Nested Structures: The lexer itself doesn’t track nesting depth, but the read_directive handles balanced braces, preventing simple stack overflows for runaway directives.
- Memory Usage: By using slices, memory usage scales linearly with input size, minimizing the risk of memory exhaustion.

3.4 Logging and Monitoring

tracing crate (future): For production, integrating a logging solution like tracing would be essential. Lexer errors could be logged with error! macros, and successful tokenization might be logged at debug! level.
Token Stream Inspection: For debugging complex Mermaid diagrams, having an option to print the raw token stream (as our main.rs does) is invaluable.

4. Code Review Checkpoint

At this point, we have a functional lexer module:

Files Created:
- src/lexer/mod.rs
- src/lexer/token.rs
- src/lexer/error.rs
- src/lexer/lexer.rs
Files Modified:
- src/main.rs (to integrate and test the lexer)
Key Data Structures:
- Span<'a>: Location and literal of a token.
- TokenKind: Enum defining all recognized token types.
- Token<'a>: Combines TokenKind and Span.
- LexerError<'a>: Enum for lexing errors with detailed Span info.
- Lexer<'a>: The main struct for tokenizing input, implementing Iterator.
Core Logic:
- Character-by-character scanning with lookahead.
- Recognition of keywords, identifiers, various operators, arrows, strings, comments, and directives.
- Precise error reporting for invalid characters and unterminated strings.
- Skipping of non-significant whitespace.

This lexer now forms the robust first pass of our Mermaid analyzer.

5. Common Issues & Solutions

Developers working on lexers often encounter these problems:

Off-by-One Errors in Span Calculation:
- Issue: The start and end indices of a Span might be incorrect, leading to misaligned error highlights or incorrect literal extraction.
- Debugging: Print the Span of every token and manually compare it against the source string. Use println!("Debug: current_pos={}, start_pos={}", self.current_pos, self.start_pos); within next_token.
- Prevention: Always update current_pos after consuming a character (e.g., self.advance()). Set start_pos before attempting to scan a new token. Remember char::len_utf8() for multi-byte characters.
Incorrect Keyword vs. Identifier Resolution:
- Issue: A string that looks like an identifier but is actually a keyword (e.g., loop) is tokenized incorrectly, or vice-versa.
- Debugging: Ensure read_identifier_or_keyword always checks TokenKind::from_keyword before defaulting to TokenKind::Identifier.
- Prevention: Maintain a strict, comprehensive list of keywords. Prioritize keyword matching. Some keywords might be context-dependent (e.g., loop as a keyword in sequence diagrams vs. a node ID). The lexer typically tokenizes loop as LoopKeyword and the parser then decides its semantic role.
Ambiguous Multi-Character Tokens (e.g., Arrows):
- Issue: -> is tokenized as Minus, GreaterThan instead of ArrowOpen, or --> is tokenized as Minus, Minus, GreaterThan instead of ArrowNormal.
- Debugging: The read_arrow function needs careful ordering. Always attempt to match the longest possible token first. For example, check for --> before ->.
- Prevention: Implement read_arrow (and similar functions for operators like ==, ===) with a clear priority for longer sequences using peek() and consume_if_match().
Unterminated String/Directive Errors:
- Issue: The lexer fails to correctly identify when a string or multi-line directive is opened but never closed, leading to incorrect tokenization or crashes.
- Debugging: Ensure read_string and read_directive handle EOF gracefully within their loops and return the appropriate LexerError. Check for newlines within strings if they are not allowed (Mermaid strings can be multiline, so this requires careful spec adherence).
- Prevention: Explicitly check for delimiters and EOF in loop conditions. Use a counter for nested structures like %%{...}%%.

6. Testing & Verification

To verify the lexer’s correctness:

Run Unit Tests: Execute cargo test in your project root. All tests in src/lexer/lexer.rs should pass. Pay close attention to the arrow tests; if you’ve implemented read_arrow more robustly, update the expected TokenKinds in test_all_arrow_types to reflect this (e.g., for --- you might expect TokenKind::ArrowNormalNoHead if you define such a token).
Manual Verification with main.rs:
- Create a file example.mmd with a complex Mermaid diagram, including various keywords, arrows, strings, comments, and even some deliberate errors.
- Run cargo run -- --file example.mmd.
- Observe the printed token stream. Does it match your expectations? Are error messages clear and correctly located?
- Test with direct code: cargo run -- --code "graph TD\n A[Node] --> B{Decision}"
- Test error cases: cargo run -- --code "graph TD\nA[\"Unterminated string]"
- Test edge cases: empty input, only whitespace, only comments.

By meticulously testing these scenarios, you can gain high confidence in the lexer’s ability to correctly tokenize Mermaid source code.

7. Summary & Next Steps

Congratulations! You’ve successfully designed and implemented the first critical component of our Mermaid analyzer: the Lexer.

In this chapter, we:

Understood the role and importance of a lexer in a compiler pipeline.
Designed the Token, Span, and LexerError data structures for precise representation and error reporting.
Implemented the Lexer struct with methods for recognizing various Mermaid syntax elements like keywords, identifiers, strings, comments, and a complex array of arrow types.
Integrated the lexer into src/main.rs for immediate testing and verification.
Wrote comprehensive unit tests to ensure the lexer’s correctness and robustness.
Discussed production considerations including performance, error handling, and security.

Our lexer now reliably converts raw Mermaid text into a stream of meaningful tokens, complete with accurate location information. This token stream is the perfect input for the next stage of our tool: the Parser.

In Chapter 3: Building the Parser: Constructing the Abstract Syntax Tree (AST), we will take this token stream and transform it into a strongly typed Abstract Syntax Tree (AST), which is a hierarchical representation of the Mermaid diagram’s structure. This AST will be the foundation for our validation, rule enforcement, and formatting logic.

Chapter 2: Designing the Lexer: Tokenization of Mermaid Syntax

Table of Contents

1. Planning & Design

1.1 Mermaid Syntax Overview (Lexer Perspective)

1.2 Lexer Architecture

1.3 File Structure

2. Step-by-Step Implementation

2.1 Setup Lexer Module

2.2 Defining Tokens and Spans

2.3 Defining Lexer Errors

2.4 Implementing the Core Lexer Logic

2.5 Integrating with main.rs for Testing

2.6 Testing This Component

3. Production Considerations

3.1 Error Handling

3.2 Performance Optimization

3.3 Security Considerations

3.4 Logging and Monitoring

4. Code Review Checkpoint

5. Common Issues & Solutions

6. Testing & Verification

7. Summary & Next Steps

2.5 Integrating with `main.rs` for Testing