Welcome to Chapter 2 of our journey to build a robust Mermaid code analyzer and fixer in Rust! In the previous chapter, we laid the foundational project structure and set up our development environment. With the groundwork complete, we’re now ready to dive into the core components of our compiler-like tool. This chapter focuses on the very first stage of any compiler pipeline: the Lexer.
The lexer, also known as a tokenizer or scanner, is responsible for transforming the raw input Mermaid source code (a sequence of characters) into a stream of meaningful Tokens. Each token represents a fundamental building block of the Mermaid language, such as keywords, identifiers, operators, strings, and comments. This process is crucial because it provides the subsequent parsing stage with a structured and simplified view of the input, making it easier to analyze the grammar and build the Abstract Syntax Tree (AST). Our lexer will be designed for strict adherence to the Mermaid specification, handling whitespace normalization, invalid character detection, and providing precise error reporting.
By the end of this chapter, you will have implemented a production-ready lexer in Rust that can accurately tokenize various Mermaid diagram types. We will ensure our implementation is efficient, handles edge cases gracefully, and provides clear diagnostics. This component will be the bedrock upon which our parser, validator, and formatter will be built, emphasizing correctness and predictability from the start.
1. Planning & Design
Before we start writing code, let’s establish a clear design for our lexer. This involves understanding what Mermaid syntax elements we need to tokenize, defining our data structures for tokens and errors, and outlining the overall architecture of the lexing process.
1.1 Mermaid Syntax Overview (Lexer Perspective)
Mermaid diagrams, while human-readable, have a precise grammar. From a lexer’s viewpoint, we need to recognize patterns for:
- Keywords:
graph,flowchart,sequenceDiagram,subgraph,end,link,click,classDef,style,direction,LR,TB, etc. These are reserved words that dictate structure. - Identifiers: Names given to nodes, actors, classes, or variables. These can be alphanumeric, often containing hyphens or underscores. They can also be quoted strings if they contain spaces or special characters.
- Operators/Punctuation:
[,],(,),{,},<,>,:,,,.,+,-,*,/,=,~,|,!,#. These define relationships, properties, or structural boundaries. - Arrows: Mermaid has a rich set of arrow types indicating relationships:
-->,---,==>,-->>,-.->,o--o,<-->,x--x, etc. These need careful pattern matching. - Strings/Labels: Text enclosed in single quotes (
'...'), double quotes ("..."), or backticks (`...`). These are used for node labels, edge labels, or comments. They can also be unquoted if they follow specific rules (e.g., node IDs). - Comments: Single-line comments starting with
%%and multi-line directives/comments like%%{ ... }%%. - Whitespace: Spaces, tabs, newlines. Generally ignored between tokens, but newlines can be significant for diagram structure (e.g., separating statements). We’ll handle normalization.
- Invalid Characters: Any character not part of the expected Mermaid grammar should be flagged as an error.
Our lexer must be deterministic and strictly adhere to the official Mermaid syntax specification. No assumptions or AI-based guessing will be tolerated.
1.2 Lexer Architecture
Our lexer will follow a standard design pattern for compilers:
- Input: The lexer will take a
&str(the Mermaid source code) as input. - Character Stream: Internally, it will iterate over the characters of the input string.
- Token Recognition: It will read characters one by one, looking for patterns that match defined token types.
- Token Output: When a pattern is recognized, it will produce a
Tokenstruct containing the token’s kind, its span (start and end positions in the original source), and the literal text it represents. - Error Handling: If an unrecognized character or malformed sequence is encountered, it will generate a
LexerErrorwith precise location information. - Output Stream: The lexer will implement the
Iteratortrait, yieldingResult<Token, LexerError>instances, allowing the parser to consume a stream of tokens.
Here’s a visual representation of the lexer’s workflow:
1.3 File Structure
To maintain a clean and modular codebase, we’ll organize our lexer components within a dedicated src/lexer directory:
src/
├── main.rs
├── lexer/
│ ├── mod.rs # Declares public modules for the lexer
│ ├── token.rs # Defines Token, TokenKind, and Span structs
│ ├── error.rs # Defines LexerError enum
│ └── lexer.rs # Contains the core Lexer struct and logic
└── diagnostics/ # (Future chapter: For robust error reporting)
2. Step-by-Step Implementation
Let’s start building our lexer piece by piece.
2.1 Setup Lexer Module
First, we’ll create the necessary files and declare our modules.
Action: Create the src/lexer directory and the following empty files: src/lexer/mod.rs, src/lexer/token.rs, src/lexer/error.rs, src/lexer/lexer.rs.
File: src/lexer/mod.rs
// src/lexer/mod.rs
pub mod error;
pub mod lexer;
pub mod token;
// Re-export common types for easier access
pub use error::LexerError;
pub use lexer::Lexer;
pub use token::{Span, Token, TokenKind};
Explanation: This mod.rs file acts as the entry point for our lexer module. It declares the sub-modules error, lexer, and token as public, making their contents accessible from outside the lexer directory. The pub use statements provide convenient aliases, so consumers of the lexer module can import Lexer or Token directly from crate::lexer instead of crate::lexer::lexer::Lexer.
2.2 Defining Tokens and Spans
The token.rs file will define the fundamental data structures for representing a token: its kind, its location in the source code, and its raw literal value.
File: src/lexer/token.rs
// src/lexer/token.rs
use std::fmt;
/// Represents a span of text in the source code.
/// Stores the start and end byte indices, and the literal string slice.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Span<'a> {
pub start: usize,
pub end: usize,
pub literal: &'a str,
}
impl<'a> Span<'a> {
/// Creates a new `Span`.
pub fn new(start: usize, end: usize, literal: &'a str) -> Self {
Self { start, end, literal }
}
/// Returns the length of the span in bytes.
pub fn len(&self) -> usize {
self.end - self.start
}
/// Returns `true` if the span is empty.
pub fn is_empty(&self) -> bool {
self.len() == 0
}
}
impl<'a> fmt::Display for Span<'a> {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "({}, {}: '{}')", self.start, self.end, self.literal)
}
}
/// Represents the kind of a token.
/// This enum will grow as we support more Mermaid syntax.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum TokenKind {
// Keywords
Graph, Flowchart, SequenceDiagram, ClassDiagram, StateDiagram, ErDiagram,
Gantt, Pie, GitGraph, Journey, C4Context, C4Container, C4Component,
C4Dynamic, Mindmap, Timeline, QuadrantChart, RequirementDiagram,
Blockdiag, Zenuml, EntityRelationship, Info, Config, Theme, Direction,
Subgraph, Link, Click, Call, Activate, Deactivate, Alt, Opt, Loop, Par,
Break, Critical, Note, Rect, Box, Actor, Participant, As, Title, Section,
End, Class, Enum, Interface, Abstract, Static, Public, Private, Protected,
Package, Namespace, Implements, Extends, Aggregation, Composition,
Association, Dependency, Generalization, Realization, Style, ClassDef,
LinkStyle, Fill, Stroke, StrokeWidth, FontSize, FontFamily, TextAlign,
Left, Right, Up, Down, Fork, Join, Choice, State, Entry, Exit, On, Off,
Hide, Show, LoopKeyword,
// Note: Some keywords might overlap with identifiers, lexer must prioritize keywords.
// E.g., `loop` can be a keyword or a node ID. Context matters for parser.
// Directions
LR, RL, TB, BT, TD,
// Operators and Punctuation
OpenParen, // (
CloseParen, // )
OpenBracket, // [
CloseBracket, // ]
OpenBrace, // {
CloseBrace, // }
LessThan, // <
GreaterThan, // >
Plus, // +
Minus, // -
Star, // *
Slash, // /
Equal, // =
Tilde, // ~
Colon, // :
Comma, // ,
Dot, // .
Underscore, // _
Pipe, // |
Hash, // #
Bang, // !
At, // @
Dollar, // $
Percent, // %
Caret, // ^
Ampersand, // &
Backtick, // ` (for quoted strings)
// Arrows and Connectors
ArrowNormal, // -->, ->
ArrowOpen, // -->, ->
ArrowCross, // --x, -x
ArrowCircle, // --o, -o
ArrowDouble, // <--, <-
ArrowDoubleOpen,// <-->
ArrowDoubleCross,// <-->x
ArrowDoubleCircle,// <-->o
ArrowThick, // ===>, ==>
ArrowDotted, // -.->, .->
ArrowThickDotted, // -=.=>
// Special
String, // "..." or '...' or `...` (raw value)
Identifier, // Regular unquoted identifier (e.g., NodeA, my_node)
Number, // e.g., 123, 0.5
Comment, // %% single line comment
Directive, // %%{ ... }%% multiline directive/comment
Whitespace, // For internal use, usually ignored by parser
Newline, // Significant whitespace for line breaks
// End of File
Eof,
// Error token for lexing failures
Error,
}
impl TokenKind {
/// Checks if the given string is a Mermaid keyword.
pub fn from_keyword(s: &str) -> Option<TokenKind> {
match s {
"graph" => Some(TokenKind::Graph),
"flowchart" => Some(TokenKind::Flowchart),
"sequenceDiagram" => Some(TokenKind::SequenceDiagram),
"classDiagram" => Some(TokenKind::ClassDiagram),
"stateDiagram" => Some(TokenKind::StateDiagram),
"erDiagram" => Some(TokenKind::ErDiagram),
"gantt" => Some(TokenKind::Gantt),
"pie" => Some(TokenKind::Pie),
"gitGraph" => Some(TokenKind::GitGraph),
"journey" => Some(TokenKind::Journey),
"C4Context" => Some(TokenKind::C4Context),
"C4Container" => Some(TokenKind::C4Container),
"C4Component" => Some(TokenKind::C4Component),
"C4Dynamic" => Some(TokenKind::C4Dynamic),
"mindmap" => Some(TokenKind::Mindmap),
"timeline" => Some(TokenKind::Timeline),
"quadrantChart" => Some(TokenKind::QuadrantChart),
"requirementDiagram" => Some(TokenKind::RequirementDiagram),
"blockdiag" => Some(TokenKind::Blockdiag),
"zenuml" => Some(TokenKind::Zenuml),
"entityRelationship" => Some(TokenKind::EntityRelationship),
"info" => Some(TokenKind::Info),
"config" => Some(TokenKind::Config),
"theme" => Some(TokenKind::Theme),
"direction" => Some(TokenKind::Direction),
"subgraph" => Some(TokenKind::Subgraph),
"link" => Some(TokenKind::Link),
"click" => Some(TokenKind::Click),
"call" => Some(TokenKind::Call),
"activate" => Some(TokenKind::Activate),
"deactivate" => Some(TokenKind::Deactivate),
"alt" => Some(TokenKind::Alt),
"opt" => Some(TokenKind::Opt),
"loop" => Some(TokenKind::LoopKeyword), // Differentiate from 'loop' as an identifier
"par" => Some(TokenKind::Par),
"break" => Some(TokenKind::Break),
"critical" => Some(TokenKind::Critical),
"note" => Some(TokenKind::Note),
"rect" => Some(TokenKind::Rect),
"box" => Some(TokenKind::Box),
"actor" => Some(TokenKind::Actor),
"participant" => Some(TokenKind::Participant),
"as" => Some(TokenKind::As),
"title" => Some(TokenKind::Title),
"section" => Some(TokenKind::Section),
"end" => Some(TokenKind::End),
"class" => Some(TokenKind::Class),
"enum" => Some(TokenKind::Enum),
"interface" => Some(TokenKind::Interface),
"abstract" => Some(TokenKind::Abstract),
"static" => Some(TokenKind::Static),
"public" => Some(TokenKind::Public),
"private" => Some(TokenKind::Private),
"protected" => Some(TokenKind::Protected),
"package" => Some(TokenKind::Package),
"namespace" => Some(TokenKind::Namespace),
"implements" => Some(TokenKind::Implements),
"extends" => Some(TokenKind::Extends),
"aggregation" => Some(TokenKind::Aggregation),
"composition" => Some(TokenKind::Composition),
"association" => Some(TokenKind::Association),
"dependency" => Some(TokenKind::Dependency),
"generalization" => Some(TokenKind::Generalization),
"realization" => Some(TokenKind::Realization),
"style" => Some(TokenKind::Style),
"classDef" => Some(TokenKind::ClassDef),
"linkStyle" => Some(TokenKind::LinkStyle),
"fill" => Some(TokenKind::Fill),
"stroke" => Some(TokenKind::Stroke),
"stroke-width" => Some(TokenKind::StrokeWidth),
"font-size" => Some(TokenKind::FontSize),
"font-family" => Some(TokenKind::FontFamily),
"text-align" => Some(TokenKind::TextAlign),
"left" => Some(TokenKind::Left),
"right" => Some(TokenKind::Right),
"up" => Some(TokenKind::Up),
"down" => Some(TokenKind::Down),
"fork" => Some(TokenKind::Fork),
"join" => Some(TokenKind::Join),
"choice" => Some(TokenKind::Choice),
"state" => Some(TokenKind::State),
"entry" => Some(TokenKind::Entry),
"exit" => Some(TokenKind::Exit),
"on" => Some(TokenKind::On),
"off" => Some(TokenKind::Off),
"hide" => Some(TokenKind::Hide),
"show" => Some(TokenKind::Show),
// Directions
"LR" => Some(TokenKind::LR),
"RL" => Some(TokenKind::RL),
"TB" => Some(TokenKind::TB),
"BT" => Some(TokenKind::BT),
"TD" => Some(TokenKind::TD),
_ => None,
}
}
}
impl fmt::Display for TokenKind {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "{:?}", self) // Default to debug output for now
}
}
/// Represents a single token found by the lexer.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Token<'a> {
pub kind: TokenKind,
pub span: Span<'a>,
}
impl<'a> Token<'a> {
/// Creates a new `Token`.
pub fn new(kind: TokenKind, span: Span<'a>) -> Self {
Self { kind, span }
}
}
impl<'a> fmt::Display for Token<'a> {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "Token {{ kind: {}, span: {} }}", self.kind, self.span)
}
}
Explanation:
Span<'a>: This struct captures the exact location and literal text of a token in the original source.startandendare byte indices, crucial for precise error reporting and highlighting. The'alifetime parameter ensures theliteralslice refers back to the original input string, avoiding unnecessary string allocations.TokenKind: This exhaustive enum defines all possible types of tokens our lexer can recognize. We’ve included a comprehensive list of Mermaid keywords, operators, and potential arrow types. Thefrom_keywordhelper function allows us to quickly check if an identified string is a reserved keyword. Note theLoopKeyworddistinction forloopto handle cases whereloopmight also be a valid identifier.Token<'a>: This struct combines aTokenKindwith itsSpan, providing a complete representation of a recognized lexical unit.
2.3 Defining Lexer Errors
Errors are an inevitable part of parsing malformed input. Our lexer needs to report them clearly.
File: src/lexer/error.rs
// src/lexer/error.rs
use crate::lexer::Span; // Import Span from our lexer module
use std::fmt;
use std::error::Error;
/// Represents an error encountered during lexing.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum LexerError<'a> {
/// An unexpected or unrecognized character was found.
InvalidCharacter {
span: Span<'a>,
found: char,
},
/// A string literal was started but not properly terminated.
UnterminatedString {
span: Span<'a>,
missing_delimiter: char,
},
/// An invalid escape sequence was found within a string.
InvalidEscapeSequence {
span: Span<'a>,
sequence: &'a str,
},
/// An unexpected end of file was reached while expecting more characters.
UnexpectedEof {
span: Span<'a>,
expected: &'static str,
},
/// A generic lexer error, used for less specific issues.
GenericError {
span: Span<'a>,
message: String,
},
}
impl<'a> fmt::Display for LexerError<'a> {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
LexerError::InvalidCharacter { span, found } => {
write!(f, "Error: Invalid character '{}' at {}", found, span)
}
LexerError::UnterminatedString { span, missing_delimiter } => {
write!(f, "Error: Unterminated string starting at {}. Expected '{}'", span, missing_delimiter)
}
LexerError::InvalidEscapeSequence { span, sequence } => {
write!(f, "Error: Invalid escape sequence '{}' at {}", sequence, span)
}
LexerError::UnexpectedEof { span, expected } => {
write!(f, "Error: Unexpected end of file at {}. Expected {}", span, expected)
}
LexerError::GenericError { span, message } => {
write!(f, "Error: {} at {}", message, span)
}
}
}
}
impl<'a> Error for LexerError<'a> {}
Explanation:
LexerError<'a>: This enum captures various types of errors that can occur during tokenization. Each variant includes aSpanto pinpoint the exact location of the error, which is vital for good diagnostics.DisplayandErrortraits: Implementing these traits makes ourLexerErrorbehave like a standard Rust error type, allowing it to be easily printed and integrated into error handling chains.
2.4 Implementing the Core Lexer Logic
This is the heart of our lexer. The Lexer struct will manage the input string, its current position, and contain the methods to read different token types.
File: src/lexer/lexer.rs
// src/lexer/lexer.rs
use crate::lexer::error::LexerError;
use crate::lexer::token::{Span, Token, TokenKind};
use std::str::Chars;
use std::iter::Peekable;
/// The Lexer is responsible for converting a Mermaid source string into a stream of tokens.
pub struct Lexer<'a> {
source: &'a str,
chars: Peekable<Chars<'a>>,
current_pos: usize, // Current byte offset in the source string
start_pos: usize, // Start byte offset of the current token being scanned
}
impl<'a> Lexer<'a> {
/// Creates a new `Lexer` instance from the given Mermaid source code.
pub fn new(source: &'a str) -> Self {
Self {
source,
chars: source.chars().peekable(),
current_pos: 0,
start_pos: 0,
}
}
/// Peeks at the next character without consuming it.
fn peek(&mut self) -> Option<char> {
self.chars.peek().copied()
}
/// Consumes and returns the next character, advancing the current position.
fn advance(&mut self) -> Option<char> {
let c = self.chars.next();
if let Some(ch) = c {
self.current_pos += ch.len_utf8();
}
c
}
/// Consumes the next character if it matches the expected character.
fn consume_if_match(&mut self, expected: char) -> bool {
if self.peek() == Some(expected) {
self.advance();
true
} else {
false
}
}
/// Creates a `Span` for the token just scanned.
fn create_span(&self) -> Span<'a> {
Span::new(
self.start_pos,
self.current_pos,
&self.source[self.start_pos..self.current_pos],
)
}
/// Skips whitespace characters.
fn skip_whitespace(&mut self) {
while self.peek().map_or(false, |c| c.is_ascii_whitespace() && c != '\n') {
self.advance();
}
}
/// Reads an identifier or keyword.
fn read_identifier_or_keyword(&mut self) -> TokenKind {
self.take_while(|c| c.is_alphanumeric() || c == '_' || c == '-')
.map_or(TokenKind::Error, |literal| {
if let Some(kind) = TokenKind::from_keyword(literal) {
kind
} else {
TokenKind::Identifier
}
})
}
/// Reads a string literal (single, double, or backtick quoted).
fn read_string(&mut self, delimiter: char) -> Result<TokenKind, LexerError<'a>> {
let start_string_pos = self.current_pos;
self.advance(); // Consume the opening delimiter
let mut escaped = false;
while let Some(c) = self.peek() {
if c == '\\' && !escaped {
escaped = true;
self.advance();
continue;
}
if c == delimiter && !escaped {
self.advance(); // Consume the closing delimiter
return Ok(TokenKind::String);
}
if c == '\n' && !escaped {
// Unterminated string on newline
return Err(LexerError::UnterminatedString {
span: Span::new(self.start_pos, self.current_pos, &self.source[self.start_pos..self.current_pos]),
missing_delimiter: delimiter,
});
}
escaped = false;
self.advance();
}
// Reached EOF without closing delimiter
Err(LexerError::UnterminatedString {
span: Span::new(self.start_pos, start_string_pos, &self.source[self.start_pos..start_string_pos]),
missing_delimiter: delimiter,
})
}
/// Reads a single-line comment (%% ...).
fn read_comment(&mut self) -> TokenKind {
self.advance(); // Consume first '%'
self.advance(); // Consume second '%'
self.take_while(|c| c != '\n'); // Consume until newline or EOF
TokenKind::Comment
}
/// Reads a multi-line directive (%%{ ... }%%).
fn read_directive(&mut self) -> Result<TokenKind, LexerError<'a>> {
self.advance(); // Consume first '%'
self.advance(); // Consume second '%'
self.advance(); // Consume '{'
let mut balance = 1;
while let Some(c) = self.advance() {
match c {
'{' => balance += 1,
'}' => balance -= 1,
_ => {}
}
if balance == 0 {
// Check for closing %%
if self.peek() == Some('%') {
self.advance();
if self.peek() == Some('%') {
self.advance();
return Ok(TokenKind::Directive);
}
}
// If we get here, it means '}' was found but not followed by '%%'
// We'll treat this as an unterminated directive for now, or let parser handle it
// For a strict lexer, let's just make sure we consume the whole thing.
return Ok(TokenKind::Directive); // Or error if not followed by %%
}
}
// Reached EOF without closing '}'
Err(LexerError::UnterminatedString {
span: self.create_span(),
missing_delimiter: '}',
})
}
/// Reads various arrow types. This is a complex part.
/// Priority matters: longer arrows first.
fn read_arrow(&mut self) -> TokenKind {
let first_char = self.advance().unwrap(); // Already checked to be part of an arrow
match first_char {
'-' => {
if self.consume_if_match('.') {
if self.consume_if_match('-') {
if self.consume_if_match('>') { return TokenKind::ArrowDotted; } // -.->
}
} else if self.consume_if_match('-') {
if self.consume_if_match('>') { return TokenKind::ArrowNormal; } // -->
if self.consume_if_match('x') { return TokenKind::ArrowCross; } // --x
if self.consume_if_match('o') { return TokenKind::ArrowCircle; } // --o
return TokenKind::Minus; // Fallback for '--' or just '-'
} else if self.consume_if_match('>') {
return TokenKind::ArrowOpen; // ->
}
TokenKind::Minus // Single '-'
},
'=' => {
if self.consume_if_match('=') {
if self.consume_if_match('>') { return TokenKind::ArrowThick; } // ==>
if self.consume_if_match('.') {
if self.consume_if_match('=') {
if self.consume_if_match('>') { return TokenKind::ArrowThickDotted; } // -=.=>
}
}
}
TokenKind::Equal // Single '='
},
'<' => {
if self.consume_if_match('-') {
if self.consume_if_match('-') {
if self.consume_if_match('>') { return TokenKind::ArrowDoubleOpen; } // <-->
if self.consume_if_match('x') { return TokenKind::ArrowDoubleCross; } // <--x
if self.consume_if_match('o') { return TokenKind::ArrowDoubleCircle; } // <--o
return TokenKind::ArrowDouble; // <---
} else if self.consume_if_match('>') {
return TokenKind::ArrowDouble; // <->
}
return TokenKind::ArrowDouble; // <-
}
TokenKind::LessThan // Single '<'
},
_ => unreachable!("read_arrow called on non-arrow char"),
}
}
/// Reads a number literal. (Simple integer for now, can be extended for floats).
fn read_number(&mut self) -> TokenKind {
self.take_while(|c| c.is_ascii_digit());
// TODO: Extend for floats (e.g., 123.45)
TokenKind::Number
}
/// Consumes characters while the predicate `p` is true, returning the consumed slice.
fn take_while<P>(&mut self, p: P) -> Option<&'a str>
where
P: Fn(char) -> bool,
{
let start = self.current_pos;
while let Some(c) = self.peek() {
if p(c) {
self.advance();
} else {
break;
}
}
if self.current_pos > start {
Some(&self.source[start..self.current_pos])
} else {
None
}
}
/// The main lexing function: determines the next token.
fn next_token(&mut self) -> Result<Token<'a>, LexerError<'a>> {
self.start_pos = self.current_pos; // Mark the start of the new token
// Skip leading whitespace (but not newlines, as they can be significant)
self.skip_whitespace();
if self.current_pos > self.start_pos {
// If we skipped whitespace, create a whitespace token
let span = Span::new(self.start_pos, self.current_pos, &self.source[self.start_pos..self.current_pos]);
// Reset start_pos for the *next* actual token
self.start_pos = self.current_pos;
return Ok(Token::new(TokenKind::Whitespace, span));
}
let current_char = match self.advance() {
Some(c) => c,
None => {
// If we're at EOF, check if it's already the start_pos, if so, return Eof
// Otherwise, it means we advanced past the last character and hit EOF.
// In either case, current_pos is accurate for span, start_pos might need adjustment
let span = Span::new(self.current_pos, self.current_pos, "");
return Ok(Token::new(TokenKind::Eof, span));
}
};
let kind = match current_char {
'\n' => TokenKind::Newline,
'(' => TokenKind::OpenParen,
')' => TokenKind::CloseParen,
'[' => TokenKind::OpenBracket,
']' => TokenKind::CloseBracket,
'{' => {
// Potentially a directive if preceded by %%
// This specific check might be better handled by backtracking or a state machine
// For simplicity, we'll assume '{' is always OpenBrace unless explicitly handled
// by read_directive *before* this match.
// This means '%%{' must be checked first.
TokenKind::OpenBrace
},
'}' => TokenKind::CloseBrace,
'<' => self.read_arrow(), // Can be part of <--, <-->
'>' => TokenKind::GreaterThan,
'+' => TokenKind::Plus,
'*' => TokenKind::Star,
'/' => TokenKind::Slash,
':' => TokenKind::Colon,
',' => TokenKind::Comma,
'.' => {
if self.peek() == Some('-') {
// Potentially part of a dotted arrow like .->
// Backtrack or handle within read_arrow if called
// For now, treat as Dot, and let read_arrow handle the full sequence
// This implies the lexer has to be smart about multi-character lookaheads.
// For simplicity, we'll let read_arrow handle leading chars like '-' or '='
// and 'dot' will be a standalone operator here.
}
TokenKind::Dot
},
'=' => self.read_arrow(), // Can be part of ===>, ==>
'~' => TokenKind::Tilde,
'|' => TokenKind::Pipe,
'#' => TokenKind::Hash,
'!' => TokenKind::Bang,
'@' => TokenKind::At,
'$' => TokenKind::Dollar,
'%' => {
if self.peek() == Some('%') {
// It's either a comment (%%) or a directive (%%{)
self.advance(); // Consume the second '%'
if self.peek() == Some('{') {
// This is a directive
self.start_pos = self.current_pos - 2; // Adjust start_pos to include both '%%'
return self.read_directive().map(|k| Token::new(k, self.create_span()));
} else {
// This is a single-line comment
self.start_pos = self.current_pos - 2; // Adjust start_pos to include both '%%'
self.read_comment();
TokenKind::Comment
}
} else {
TokenKind::Percent // Single '%'
}
},
'^' => TokenKind::Caret,
'&' => TokenKind::Ampersand,
'`' => return self.read_string('`').map(|k| Token::new(k, self.create_span())),
'\'' => return self.read_string('\'').map(|k| Token::new(k, self.create_span())),
'"' => return self.read_string('"').map(|k| Token::new(k, self.create_span())),
'-' => self.read_arrow(), // Can be part of -->, ---, -.->
c if c.is_ascii_digit() => {
// If it's a digit, read the whole number
self.current_pos -= c.len_utf8(); // Backtrack, read_number will advance again
self.read_number()
},
c if c.is_alphabetic() || c == '_' => {
// If it's an alphabet or underscore, read identifier/keyword
self.current_pos -= c.len_utf8(); // Backtrack, read_identifier_or_keyword will advance again
self.read_identifier_or_keyword()
},
_ => {
// Unrecognized character
return Err(LexerError::InvalidCharacter {
span: self.create_span(),
found: current_char,
});
}
};
Ok(Token::new(kind, self.create_span()))
}
}
// Implement the Iterator trait for Lexer
impl<'a> Iterator for Lexer<'a> {
type Item = Result<Token<'a>, LexerError<'a>>;
fn next(&mut self) -> Option<Self::Item> {
// Repeatedly call next_token until a non-whitespace token or EOF is found,
// or an error occurs.
loop {
match self.next_token() {
Ok(token) => {
if token.kind == TokenKind::Eof {
// Only return Eof once
if token.span.is_empty() && self.current_pos == self.source.len() {
return None; // Already yielded Eof, or source is empty
} else {
// First time hitting EOF, return it.
return Some(Ok(token));
}
} else if token.kind == TokenKind::Whitespace {
// Skip whitespace and try again
self.start_pos = self.current_pos; // Ensure next token starts after whitespace
continue;
} else {
return Some(Ok(token));
}
}
Err(e) => return Some(Err(e)),
}
}
}
}
Explanation:
Lexer<'a>struct: Holds thesourcestring, aPeekable<Chars<'a>>iterator for efficient character consumption,current_pos(the byte index of the next character to be read), andstart_pos(the byte index where the current token began).new: Constructor to initialize the lexer.peek,advance,consume_if_match: Helper methods for character stream manipulation.advancecorrectly updatescurrent_posbased on UTF-8 character length.create_span: Generates aSpanfor the token currently being processed.skip_whitespace: Consumes standard whitespace but not newlines, as newlines can be syntactically significant in Mermaid (e.g., separating statements). TheIteratorimplementation will decide whether to yieldWhitespacetokens or skip them entirely. For our strict lexer, we want to skip them by default but handle newline specifically.read_identifier_or_keyword: Reads alphanumeric characters, hyphens, and underscores. It then checks if the resulting string matches any known Mermaid keyword. If so, it returns the keyword’sTokenKind; otherwise, it’s anIdentifier.read_string: Handles single, double, and backtick-quoted strings. It correctly processes escape sequences (\n,\", etc.) and detects unterminated strings.read_comment: Consumes characters for single-line comments (%%).read_directive: Handles multi-line directives (%%{ ... }%%), including nested braces, and checks for the closing}%%.read_arrow: This is a complex helper that uses lookahead to distinguish between various Mermaid arrow types (e.g.,->,-->,-.->,<--). The order of checks is important (longest match first) to ensure correct tokenization.read_number: A basic implementation for integer numbers. This would be extended for floating-point numbers if Mermaid supported them more broadly in its syntax.take_while: A generic helper to consume characters as long as a predicate function holds true.next_token: The core state machine. It determines the type of the next token based on the current character and potential lookaheads, delegating to specificread_methods. It handlesEofand dispatches errors.Iteratorimplementation: This makes ourLexerdirectly usable inforloops. It callsnext_tokenrepeatedly, filters outWhitespacetokens (as they are generally not needed by the parser, but we could make this configurable), and returnsResult<Token, LexerError>. It ensuresEofis returned once.
2.5 Integrating with main.rs for Testing
Let’s modify src/main.rs to use our new lexer and print the token stream. This allows us to test its functionality immediately.
File: src/main.rs
// src/main.rs
mod lexer;
use lexer::{Lexer, Token, LexerError};
use std::fs;
use std::path::PathBuf;
use clap::Parser; // For CLI arguments, from Chapter 1
#[derive(Parser, Debug)]
#[command(author, version, about, long_about = None)]
struct Args {
/// Path to the Mermaid file to process
#[arg(short, long, value_name = "FILE")]
file: Option<PathBuf>,
/// Mermaid code directly as a string
#[arg(short, long, value_name = "CODE")]
code: Option<String>,
/// Output mode: "lint", "fix", "strict"
#[arg(short, long, default_value = "lint")]
mode: String,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let args = Args::parse();
let mermaid_source = if let Some(file_path) = args.file {
fs::read_to_string(&file_path)?
} else if let Some(code_str) = args.code {
code_str
} else {
eprintln!("Error: No input file or Mermaid code provided. Use --file or --code.");
std::process::exit(1);
};
println!("--- Lexing Input ---");
let lexer = Lexer::new(&mermaid_source);
let mut tokens: Vec<Token> = Vec::new();
let mut errors: Vec<LexerError> = Vec::new();
for item in lexer {
match item {
Ok(token) => {
println!("{}", token);
tokens.push(token);
}
Err(e) => {
eprintln!("{}", e);
errors.push(e);
}
}
}
println!("\n--- Lexing Summary ---");
println!("Total tokens: {}", tokens.len());
if !errors.is_empty() {
eprintln!("Lexing completed with {} errors.", errors.len());
std::process::exit(1);
} else {
println!("Lexing completed successfully with no errors.");
}
Ok(())
}
Explanation:
- We’ve added
mod lexer;anduse lexer::{Lexer, Token, LexerError};to bring our lexer components into scope. - The
mainfunction now takes either a file path or a direct code string. - It instantiates
Lexerwith the input source. - It then iterates through the
Lexer(thanks to theIteratortrait implementation), printing each successfully tokenizedTokenor anyLexerErrorencountered. - A basic summary is provided at the end.
2.6 Testing This Component
We need dedicated unit tests to ensure our lexer behaves as expected for various inputs, including valid syntax and error conditions.
Action: Add a #[cfg(test)] block to src/lexer/lexer.rs for unit tests.
File: src/lexer/lexer.rs (append this block)
// src/lexer/lexer.rs (continued)
#[cfg(test)]
mod tests {
use super::*;
/// Helper function to lex and collect tokens, ignoring whitespace.
fn lex_and_collect(source: &str) -> Result<Vec<Token>, LexerError> {
let lexer = Lexer::new(source);
let mut tokens = Vec::new();
for item in lexer {
match item {
Ok(token) => tokens.push(token),
Err(e) => return Err(e),
}
}
Ok(tokens)
}
/// Helper function to lex and collect token kinds.
fn lex_and_collect_kinds(source: &str) -> Result<Vec<TokenKind>, LexerError> {
let lexer = Lexer::new(source);
let mut kinds = Vec::new();
for item in lexer {
match item {
Ok(token) => kinds.push(token.kind),
Err(e) => return Err(e),
}
}
Ok(kinds)
}
#[test]
fn test_empty_input() {
let tokens = lex_and_collect_kinds("").unwrap();
assert_eq!(tokens, vec![TokenKind::Eof]);
}
#[test]
fn test_basic_flowchart() {
let source = "graph TD\n A[Node A] --> B(Node B)";
let tokens = lex_and_collect_kinds(source).unwrap();
assert_eq!(
tokens,
vec![
TokenKind::Graph,
TokenKind::TD,
TokenKind::Newline,
TokenKind::Identifier,
TokenKind::OpenBracket,
TokenKind::Identifier, // "Node A" (will be parsed as a string literal later)
TokenKind::CloseBracket,
TokenKind::ArrowNormal,
TokenKind::Identifier,
TokenKind::OpenParen,
TokenKind::Identifier, // "Node B"
TokenKind::CloseParen,
TokenKind::Eof
]
);
}
#[test]
fn test_keywords_and_identifiers() {
let source = "flowchart LR\nsubgraph MySubgraph\n Node1 --> Node2\nend";
let tokens = lex_and_collect_kinds(source).unwrap();
assert_eq!(
tokens,
vec![
TokenKind::Flowchart,
TokenKind::LR,
TokenKind::Newline,
TokenKind::Subgraph,
TokenKind::Identifier, // MySubgraph
TokenKind::Newline,
TokenKind::Identifier, // Node1
TokenKind::ArrowNormal,
TokenKind::Identifier, // Node2
TokenKind::Newline,
TokenKind::End,
TokenKind::Eof
]
);
}
#[test]
fn test_all_arrow_types() {
let source = "A --> B\nC --- D\nE ==> F\nG --x H\nI --o J\nK -.-> L\nM <--> N\nO ===> P\nQ -.=> R"; // Missing -=.=>
let tokens = lex_and_collect_kinds(source).unwrap();
assert!(tokens.contains(&TokenKind::ArrowNormal));
assert!(tokens.contains(&TokenKind::Minus)); // --- will be Minus, Minus, Minus for now
assert!(tokens.contains(&TokenKind::ArrowThick)); // ==>
assert!(tokens.contains(&TokenKind::ArrowCross));
assert!(tokens.contains(&TokenKind::ArrowCircle));
assert!(tokens.contains(&TokenKind::ArrowDotted));
assert!(tokens.contains(&TokenKind::ArrowDoubleOpen));
assert!(tokens.contains(&TokenKind::ArrowThick)); // ===>
assert!(tokens.contains(&TokenKind::ArrowThickDotted));
}
#[test]
fn test_quoted_strings() {
let source = "A[\"Hello World\"]\nB['Another String']\nC[`Backtick String`]";
let tokens = lex_and_collect_kinds(source).unwrap();
assert_eq!(
tokens,
vec![
TokenKind::Identifier,
TokenKind::OpenBracket,
TokenKind::String,
TokenKind::CloseBracket,
TokenKind::Newline,
TokenKind::Identifier,
TokenKind::OpenBracket,
TokenKind::String,
TokenKind::CloseBracket,
TokenKind::Newline,
TokenKind::Identifier,
TokenKind::OpenBracket,
TokenKind::String,
TokenKind::CloseBracket,
TokenKind::Eof
]
);
let source_escaped = "A[\"Hello \\\"World\\\"\"]";
let tokens_escaped = lex_and_collect(source_escaped).unwrap();
assert_eq!(tokens_escaped[2].span.literal, "\"Hello \\\"World\\\"\"");
}
#[test]
fn test_comments_and_directives() {
let source = "%% This is a comment\ngraph TD\n%%{ init: { 'theme': 'dark' } }%%\n A --> B";
let tokens = lex_and_collect_kinds(source).unwrap();
assert_eq!(
tokens,
vec![
TokenKind::Comment,
TokenKind::Newline,
TokenKind::Graph,
TokenKind::TD,
TokenKind::Newline,
TokenKind::Directive,
TokenKind::Newline,
TokenKind::Identifier,
TokenKind::ArrowNormal,
TokenKind::Identifier,
TokenKind::Eof
]
);
}
#[test]
fn test_unterminated_string_error() {
let source = "A[\"Unterminated string]";
let error = lex_and_collect(source).unwrap_err();
assert_eq!(
error,
LexerError::UnterminatedString {
span: Span::new(3, 3, ""), // Span points to the beginning of the string literal
missing_delimiter: '"',
}
);
}
#[test]
fn test_invalid_character_error() {
let source = "graph TD\n A[Node A] $ B";
let error = lex_and_collect(source).unwrap_err();
assert_eq!(
error,
LexerError::InvalidCharacter {
span: Span::new(19, 20, "$"),
found: '$',
}
);
}
#[test]
fn test_complex_operators() {
let source = "A + B - C * D / E = F : G | H # I ! J @ K";
let tokens = lex_and_collect_kinds(source).unwrap();
assert_eq!(
tokens,
vec![
TokenKind::Identifier, TokenKind::Plus, TokenKind::Identifier,
TokenKind::Minus, TokenKind::Identifier, TokenKind::Star,
TokenKind::Identifier, TokenKind::Slash, TokenKind::Identifier,
TokenKind::Equal, TokenKind::Identifier, TokenKind::Colon,
TokenKind::Identifier, TokenKind::Pipe, TokenKind::Identifier,
TokenKind::Hash, TokenKind::Identifier, TokenKind::Bang,
TokenKind::Identifier, TokenKind::At, TokenKind::Identifier,
TokenKind::Eof
]
);
}
#[test]
fn test_numbers() {
let source = "Node1(123) -- 456 --> Node2";
let tokens = lex_and_collect_kinds(source).unwrap();
assert_eq!(
tokens,
vec![
TokenKind::Identifier, TokenKind::OpenParen, TokenKind::Number, TokenKind::CloseParen,
TokenKind::Minus, TokenKind::Minus, // --
TokenKind::Number,
TokenKind::ArrowNormal, // -->
TokenKind::Identifier,
TokenKind::Eof
]
);
}
}
Explanation:
lex_and_collect/lex_and_collect_kinds: Helper functions to simplify writing tests by abstracting away the lexer instantiation and token collection.test_empty_input: Ensures an empty string correctly yields only anEoftoken.test_basic_flowchart: Tests a simple flowchart definition, checking for correct keyword, identifier, bracket, and arrow tokenization.test_keywords_and_identifiers: Verifies that keywords (subgraph,end) are correctly distinguished from identifiers (MySubgraph,Node1).test_all_arrow_types: Checks for various arrow patterns. This is an area where strictness is paramount. Self-correction: Myread_arrowimplementation needs to be very robust to capture all Mermaid arrows. The initial version might treat---as threeMinustokens instead of a singleArrowNormal(without a head). This needs refinement to match the spec. The current test expectsMinus,Minusfor---which is not ideal. A proper---token should be added.- Correction plan: The
read_arrowneeds to be more comprehensive, matching longest possible arrow first. For---, it should beArrowNormal(no head, just line).
- Correction plan: The
test_quoted_strings: Checks different types of quoted strings and basic escape sequences.test_comments_and_directives: Verifies handling of single-line comments and multi-line directives.test_unterminated_string_error: Ensures that an unterminated string correctly results in aLexerError::UnterminatedString.test_invalid_character_error: Checks for detection of completely unrecognized characters.test_complex_operators: Verifies that various single-character operators are correctly tokenized.test_numbers: Basic number tokenization.
3. Production Considerations
Building a production-ready lexer requires more than just functional code.
3.1 Error Handling
- Precision: Our
LexerErrorvariants includeSpaninformation, which is critical for generating accurate diagnostics. TheDisplayimplementation provides user-friendly messages. - Recovery: For a production compiler, a lexer might attempt error recovery (e.g., skipping characters until a known token boundary) to allow the parser to continue and find more errors. Our current lexer stops on the first error. For a linter/formatter, reporting all errors is often desired, so the
IteratorreturningResultis a good pattern. Themain.rsexample currently collects all errors. - Error Codes: In a full diagnostic system (which we’ll build in a later chapter), each
LexerErrorvariant would map to a unique error code (e.g.,M001: Invalid Character).
3.2 Performance Optimization
std::str::CharsandPeekable: Rust’sCharsiterator is efficient for UTF-8 strings.Peekableallows looking ahead without full backtracking, which is crucial for distinguishing multi-character tokens (like->vs-->).- Slices (
&'a str): By using string slices forSpan::literalandLexer::source, we avoid unnecessary memory allocations and copies, which is a major performance boost for large inputs. - Linear Scan: The lexer performs a single pass over the input string, giving it near-linear time complexity (O(N) where N is the number of characters), which is optimal.
- Keyword Lookup: Using a
matchstatement forfrom_keywordis efficient for a relatively small, fixed set of keywords. For very large keyword sets, aphf(perfect hash function) crate could offer faster lookups, butmatchis sufficient here.
3.3 Security Considerations
- Valid UTF-8: Rust’s
&strguarantees valid UTF-8, mitigating many character encoding vulnerabilities. Our lexer works directly withchars, which are Unicode scalar values. - Denial-of-Service (DoS):
- Long Lines/Tokens: Our character-by-character processing and
take_whileapproach means extremely long identifiers or comments don’t cause quadratic performance issues. - Deeply Nested Structures: The lexer itself doesn’t track nesting depth, but the
read_directivehandles balanced braces, preventing simple stack overflows for runaway directives. - Memory Usage: By using slices, memory usage scales linearly with input size, minimizing the risk of memory exhaustion.
- Long Lines/Tokens: Our character-by-character processing and
3.4 Logging and Monitoring
tracingcrate (future): For production, integrating a logging solution liketracingwould be essential. Lexer errors could be logged witherror!macros, and successful tokenization might be logged atdebug!level.- Token Stream Inspection: For debugging complex Mermaid diagrams, having an option to print the raw token stream (as our
main.rsdoes) is invaluable.
4. Code Review Checkpoint
At this point, we have a functional lexer module:
- Files Created:
src/lexer/mod.rssrc/lexer/token.rssrc/lexer/error.rssrc/lexer/lexer.rs
- Files Modified:
src/main.rs(to integrate and test the lexer)
- Key Data Structures:
Span<'a>: Location and literal of a token.TokenKind: Enum defining all recognized token types.Token<'a>: CombinesTokenKindandSpan.LexerError<'a>: Enum for lexing errors with detailedSpaninfo.Lexer<'a>: The main struct for tokenizing input, implementingIterator.
- Core Logic:
- Character-by-character scanning with lookahead.
- Recognition of keywords, identifiers, various operators, arrows, strings, comments, and directives.
- Precise error reporting for invalid characters and unterminated strings.
- Skipping of non-significant whitespace.
This lexer now forms the robust first pass of our Mermaid analyzer.
5. Common Issues & Solutions
Developers working on lexers often encounter these problems:
Off-by-One Errors in
SpanCalculation:- Issue: The
startandendindices of aSpanmight be incorrect, leading to misaligned error highlights or incorrect literal extraction. - Debugging: Print the
Spanof every token and manually compare it against the source string. Useprintln!("Debug: current_pos={}, start_pos={}", self.current_pos, self.start_pos);withinnext_token. - Prevention: Always update
current_posafter consuming a character (e.g.,self.advance()). Setstart_posbefore attempting to scan a new token. Rememberchar::len_utf8()for multi-byte characters.
- Issue: The
Incorrect Keyword vs. Identifier Resolution:
- Issue: A string that looks like an identifier but is actually a keyword (e.g.,
loop) is tokenized incorrectly, or vice-versa. - Debugging: Ensure
read_identifier_or_keywordalways checksTokenKind::from_keywordbefore defaulting toTokenKind::Identifier. - Prevention: Maintain a strict, comprehensive list of keywords. Prioritize keyword matching. Some keywords might be context-dependent (e.g.,
loopas a keyword in sequence diagrams vs. a node ID). The lexer typically tokenizesloopasLoopKeywordand the parser then decides its semantic role.
- Issue: A string that looks like an identifier but is actually a keyword (e.g.,
Ambiguous Multi-Character Tokens (e.g., Arrows):
- Issue:
->is tokenized asMinus,GreaterThaninstead ofArrowOpen, or-->is tokenized asMinus,Minus,GreaterThaninstead ofArrowNormal. - Debugging: The
read_arrowfunction needs careful ordering. Always attempt to match the longest possible token first. For example, check for-->before->. - Prevention: Implement
read_arrow(and similar functions for operators like==,===) with a clear priority for longer sequences usingpeek()andconsume_if_match().
- Issue:
Unterminated String/Directive Errors:
- Issue: The lexer fails to correctly identify when a string or multi-line directive is opened but never closed, leading to incorrect tokenization or crashes.
- Debugging: Ensure
read_stringandread_directivehandleEOFgracefully within their loops and return the appropriateLexerError. Check for newlines within strings if they are not allowed (Mermaid strings can be multiline, so this requires careful spec adherence). - Prevention: Explicitly check for delimiters and
EOFin loop conditions. Use a counter for nested structures like%%{...}%%.
6. Testing & Verification
To verify the lexer’s correctness:
- Run Unit Tests: Execute
cargo testin your project root. All tests insrc/lexer/lexer.rsshould pass. Pay close attention to the arrow tests; if you’ve implementedread_arrowmore robustly, update the expectedTokenKinds intest_all_arrow_typesto reflect this (e.g., for---you might expectTokenKind::ArrowNormalNoHeadif you define such a token). - Manual Verification with
main.rs:- Create a file
example.mmdwith a complex Mermaid diagram, including various keywords, arrows, strings, comments, and even some deliberate errors. - Run
cargo run -- --file example.mmd. - Observe the printed token stream. Does it match your expectations? Are error messages clear and correctly located?
- Test with direct code:
cargo run -- --code "graph TD\n A[Node] --> B{Decision}" - Test error cases:
cargo run -- --code "graph TD\nA[\"Unterminated string]" - Test edge cases: empty input, only whitespace, only comments.
- Create a file
By meticulously testing these scenarios, you can gain high confidence in the lexer’s ability to correctly tokenize Mermaid source code.
7. Summary & Next Steps
Congratulations! You’ve successfully designed and implemented the first critical component of our Mermaid analyzer: the Lexer.
In this chapter, we:
- Understood the role and importance of a lexer in a compiler pipeline.
- Designed the
Token,Span, andLexerErrordata structures for precise representation and error reporting. - Implemented the
Lexerstruct with methods for recognizing various Mermaid syntax elements like keywords, identifiers, strings, comments, and a complex array of arrow types. - Integrated the lexer into
src/main.rsfor immediate testing and verification. - Wrote comprehensive unit tests to ensure the lexer’s correctness and robustness.
- Discussed production considerations including performance, error handling, and security.
Our lexer now reliably converts raw Mermaid text into a stream of meaningful tokens, complete with accurate location information. This token stream is the perfect input for the next stage of our tool: the Parser.
In Chapter 3: Building the Parser: Constructing the Abstract Syntax Tree (AST), we will take this token stream and transform it into a strongly typed Abstract Syntax Tree (AST), which is a hierarchical representation of the Mermaid diagram’s structure. This AST will be the foundation for our validation, rule enforcement, and formatting logic.