Welcome to Chapter 10! So far, we’ve meticulously built a robust Mermaid code analyzer, validator, and fixer with a strong emphasis on correctness and maintainability. Our lexer, parser, AST, and rule engine are designed to be strict and deterministic. However, as diagrams grow in complexity and size, performance and memory footprint become critical concerns, especially for a production-grade CLI tool that might process thousands of lines of Mermaid code or be integrated into CI/CD pipelines.
In this chapter, we will shift our focus to optimizing the performance of our mermaid-tool and enabling it to handle large input files efficiently through streaming. We’ll explore techniques like zero-copy parsing, arena allocation for our Abstract Syntax Tree (AST), and leveraging Rust’s io::Read and BufReader for streaming input. The goal is to achieve near-linear parsing complexity with minimal memory allocations, ensuring our tool remains fast and resource-efficient even with very large Mermaid diagrams.
By the end of this chapter, you will have a mermaid-tool that not only correctly processes and validates Mermaid code but also does so with high performance and low memory consumption, capable of handling streaming input without loading entire files into memory. We will also set up a robust benchmarking framework to measure and track our performance improvements.
Planning & Design
Optimizing for performance often involves trade-offs. Our primary goals are speed (low CPU time) and memory efficiency (low RAM usage). For a compiler-like tool, the lexing and parsing phases are typically the most performance-sensitive.
Performance Bottlenecks & Strategies
Input Reading: Reading large files entirely into memory can be slow and consume excessive RAM.
Strategy: Use std::io::BufReader to read input in chunks. This allows us to process files larger than available memory and reduces syscall overhead.
Lexing (Tokenization): Frequent string allocations for each token and repeated character scanning can be costly.
Strategy: Implement “zero-copy” lexing by having tokens refer to slices of the original input buffer (&'a str or TokenSpan) instead of allocating new strings. This requires careful lifetime management.
Strategy: Use optimized byte-searching functions (e.g., memchr from the memchr crate) for faster token boundary detection.
Parsing & AST Construction: Each AST node often involves a heap allocation. For large ASTs, this leads to many small, scattered allocations, impacting cache locality and increasing memory overhead.
Strategy: Utilize an Arena allocator (e.g., bumpalo crate). An arena allocates memory in large blocks and doles out smaller chunks from it. This dramatically reduces allocation overhead, improves cache locality, and simplifies deallocation (just drop the arena). AST nodes will store references (&'a str) to strings within the arena or the original input buffer.
String Handling in AST: If AST nodes own String data, there’s a lot of copying.
Strategy: Use Cow<'a, str> (Clone-on-Write) for string data in AST nodes. This allows borrowing a slice from the input buffer when possible, and only allocating a new String when modifications are necessary (e.g., during rule application that fixes labels).
Data Structures: Using Vec can be efficient, but frequent reallocations can occur.
Strategy: Pre-allocate Vec capacities where reasonable (e.g., Vec::with_capacity). Consider SmallVec for cases where a Vec is typically small but might grow.
Architecture Diagram: Optimized Compiler Pipeline
Let’s visualize the optimized data flow:
flowchart TD
A[Input Source: File/Stdin] --> B[BufReader]
B -->|Chunked Input Stream| C{Lexer}
C -->|Zero-Copy Tokens (TokenSpan)| D{Parser}
D -->|AST Nodes (Arena-Allocated)| E[AST]
E --> F[Validator]
E --> G[Rule Engine]
G --> H[Formatter]
H --> I[Output: Corrected Mermaid/Diagnostics]
subgraph Optimizations
C -.->|memchr| C
D -.->|bumpalo::Arena| D
E -.->|Cow<'a, str>| E
end
style A fill:#f9f,stroke:#333,stroke-width:2px
style I fill:#f9f,stroke:#333,stroke-width:2px
style C fill:#ccf,stroke:#333,stroke-width:2px
style D fill:#ccf,stroke:#333,stroke-width:2px
style E fill:#ccf,stroke:#333,stroke-width:2px
style F fill:#cfc,stroke:#333,stroke-width:2px
style G fill:#cfc,stroke:#333,stroke-width:2px
style H fill:#cfc,stroke:#333,stroke-width:2px
style Optimizations fill:#eee,stroke:#999,stroke-dasharray: 5 5
File Structure Changes
We will primarily modify existing files: src/lexer.rs, src/parser.rs, and src/ast.rs. We’ll also add a new file for benchmarking.
First, let’s add the necessary crates to our Cargo.toml: bumpalo for arena allocation, memchr for fast byte searching, and criterion for benchmarking.
Open Cargo.toml and add these dependencies:
# Cargo.toml[dependencies]# Existing dependencies# ...clap={version="4.5",features=["derive"]}miette={version="7.2",features=["fancy","supports-color"]}thiserror="1.0"log="0.4"env_logger="0.11"# New dependencies for performancebumpalo="3.16"# For arena allocationmemchr="2.7"# For fast byte searching[dev-dependencies]criterion={version="0.5",features=["html_reports"]}[[bench]]name="parsing_benchmark"harness=false
Explanation:
bumpalo: A fast, thread-local arena allocator. Ideal for ASTs where nodes have a shared lifetime.
memchr: Provides highly optimized routines for searching for bytes or byte sequences in memory.
criterion: A robust benchmarking harness for Rust. The html_reports feature generates interactive HTML reports.
The [[bench]] section tells Cargo about our benchmark target and disables the default test harness for it, allowing criterion to take over.
Our current lexer reads the entire input string. We need to refactor it to work with a BufReader and return TokenSpans that refer to the original buffer.
a) Modify src/lexer.rs - TokenSpan and Token
We’ll introduce a TokenSpan struct that holds the start and end byte indices within the input buffer. This replaces storing String directly in Token.
// src/lexer.rs
usestd::borrow::Cow;usestd::fmt;usestd::io::{self,BufReader,Read};usestd::ops::Range;usememchr::memchr;// Import memchr
/// Represents a span of bytes in the original input source.
/// This is crucial for zero-copy lexing, as tokens will refer to these spans
/// instead of owning duplicated string data.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]pubstructTokenSpan{/// Start byte index (inclusive)
pubstart: usize,/// End byte index (exclusive)
pubend: usize,}implTokenSpan{/// Creates a new `TokenSpan`.
pubfnnew(start: usize,end: usize)-> Self{Self{start,end}}/// Returns the length of the span.
pubfnlen(&self)-> usize{self.end-self.start}/// Checks if the span is empty.
pubfnis_empty(&self)-> bool{self.len()==0}/// Resolves the span to a string slice from the given source.
pubfnresolve<'a>(&self,source: &'astr)-> &'astr{&source[self.start..self.end]}}// Existing TokenKind enum (truncated for brevity)
#[derive(Debug, Clone, PartialEq, Eq, Hash)]pubenumTokenKind{// Keywords
Graph,Flowchart,Sequence,Class,State,ErDiagram,GitGraph,Gantt,Pie,C4Diagram,Journey,Requirement,// Operators/Delimiters
ArrowLeft,ArrowRight,ArrowBoth,ArrowOpen,ArrowX,Colon,Semicolon,Comma,Dot,// Brackets
ParenOpen,ParenClose,BracketOpen,BracketClose,CurlyOpen,CurlyClose,SquareOpen,SquareClose,// Identifiers and Literals
Identifier,StringLiteral,// e.g., "This is a label"
NumberLiteral,// e.g., 123
// Whitespace and Comments
Whitespace,Comment,Newline,// Special
Eof,Invalid,}/// Represents a single token produced by the lexer.
#[derive(Debug, Clone, PartialEq, Eq, Hash)]pubstructToken{pubkind: TokenKind,pubspan: TokenSpan,// Use TokenSpan instead of String
}implToken{pubfnnew(kind: TokenKind,span: TokenSpan)-> Self{Self{kind,span}}}// ... (rest of the file will be modified)
Explanation:
TokenSpan: This struct is the core of our zero-copy strategy. Instead of Token directly holding String data, it now holds a TokenSpan which is just a pair of usize indices. This means no heap allocations are made for token values during lexing.
resolve: A helper method to get the actual &str slice from the original input when needed.
memchr: Imported for efficient byte searching.
b) Modify src/lexer.rs - Lexer struct and next_token
The Lexer struct needs to be refactored to work with a BufReader and maintain an internal buffer, and to produce Tokens with TokenSpans.
// src/lexer.rs
// ... (imports and TokenSpan, TokenKind, Token structs from above)
/// Represents a lexical error.
#[derive(Debug, Clone, PartialEq, Eq, miette::Diagnostic, thiserror::Error)]#[error("Lexical error")]#[diagnostic(code(mermaid::lexer::error))]pubstructLexerError{#[source_code]pubsrc: miette::NamedSource,#[label("This character is not allowed")]pubbad_char_span: miette::SourceSpan,pubhelp: Option<String>,}/// The Lexer for Mermaid diagrams.
/// It reads from an `io::Read` source, buffers it, and produces tokens.
pubstructLexer<R: Read>{reader: BufReader<R>,buffer: Vec<u8>,// Internal buffer for chunks of input
cursor: usize,// Current position within the buffer
offset: usize,// Total bytes read from the start of the stream
line_start_offsets: Vec<usize>,// Store byte offsets for line beginnings
current_line: usize,eof_reached: bool,}impl<R: Read>Lexer<R>{/// Creates a new Lexer instance.
/// `source_name` is used for diagnostics.
pubfnnew(reader: R)-> Self{Self{reader: BufReader::new(reader),buffer: Vec::with_capacity(4096),// Initial buffer capacity
cursor: 0,offset: 0,line_start_offsets: vec![0],// The first line starts at offset 0
current_line: 1,eof_reached: false,}}/// Fills the internal buffer if it's empty or needs more data.
/// Returns `true` if more data was read, `false` if EOF was reached and buffer is empty.
fnfill_buffer(&mutself)-> io::Result<bool>{// If we're at the end of the current buffer, shift remaining data or clear.
ifself.cursor>=self.buffer.len(){self.buffer.clear();self.cursor=0;// Attempt to read more data
letbytes_read=self.reader.read_to_end(&mutself.buffer)?;ifbytes_read==0{self.eof_reached=true;returnOk(false);// EOF reached and buffer is now empty
}Ok(true)}else{// Keep existing data, perhaps shift it to the beginning if it's a small remainder
// For now, a simpler strategy: if we need more data and buffer is not full, read more.
// This could be optimized later to shift only if needed.
Ok(true)// Buffer still has data
}}/// Peeks at the byte at the current cursor position.
/// Returns `Some(byte)` if available, `None` if EOF.
fnpeek(&mutself)-> io::Result<Option<u8>>{self.fill_buffer()?;ifself.cursor<self.buffer.len(){Ok(Some(self.buffer[self.cursor]))}else{Ok(None)}}/// Consumes the byte at the current cursor position and advances the cursor.
/// Returns `Some(byte)` if available, `None` if EOF.
fnconsume(&mutself)-> io::Result<Option<u8>>{self.fill_buffer()?;ifself.cursor<self.buffer.len(){letbyte=self.buffer[self.cursor];self.cursor+=1;self.offset+=1;// Update total offset
Ok(Some(byte))}else{Ok(None)}}/// Consumes characters until `predicate` returns false or EOF.
/// Returns the span of consumed characters.
fnconsume_while(&mutself,predicate: implFn(u8)-> bool)-> io::Result<TokenSpan>{letstart_offset=self.offset;letstart_cursor=self.cursor;whileletSome(b)=self.peek()?{ifpredicate(b){self.consume()?;}else{break;}}Ok(TokenSpan::new(start_offset,self.offset))}/// Consumes a specific expected byte. Returns true if consumed, false if not found or EOF.
fnconsume_char(&mutself,expected: u8)-> io::Result<bool>{ifletSome(c)=self.peek()?{ifc==expected{self.consume()?;returnOk(true);}}Ok(false)}/// Tries to match a keyword. Returns `Some(TokenKind)` if matched, `None` otherwise.
fntry_match_keyword(&self,slice: &[u8])-> Option<TokenKind>{matchslice{b"graph"=>Some(TokenKind::Graph),b"flowchart"=>Some(TokenKind::Flowchart),b"sequence"=>Some(TokenKind::Sequence),b"class"=>Some(TokenKind::Class),b"state"=>Some(TokenKind::State),b"erDiagram"=>Some(TokenKind::ErDiagram),b"gitGraph"=>Some(TokenKind::GitGraph),b"gantt"=>Some(TokenKind::Gantt),b"pie"=>Some(TokenKind::Pie),b"c4diagram"=>Some(TokenKind::C4Diagram),b"journey"=>Some(TokenKind::Journey),b"requirement"=>Some(TokenKind::Requirement),_=>None,}}/// Lexes the next token from the input stream.
pubfnnext_token(&mutself)-> Result<Token,LexerError>{loop{letstart_offset=self.offset;letcurrent_byte=matchself.peek(){Ok(Some(b))=>b,Ok(None)ifself.eof_reached=>{returnOk(Token::new(TokenKind::Eof,TokenSpan::new(start_offset,start_offset)))},Err(e)=>returnErr(LexerError{src: miette::NamedSource::new("input","".to_string()),// Placeholder, will be replaced by CLI
bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: Some(format!("I/O error: {}",e)),}),Ok(None)=>continue,// Try filling buffer again if it was empty but not EOF
};lettoken_kind: TokenKind;matchcurrent_byte{b' '|b'\t'=>{letspan=self.consume_while(|b|b==b' '||b==b'\t').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: Some(format!("I/O error during whitespace consumption: {}",e)),})?;// We can choose to return whitespace tokens or skip them.
// For a parser, skipping is often preferred unless whitespace is significant.
// For now, let's return them for full fidelity, but the parser will ignore.
token_kind=TokenKind::Whitespace;}b'\n'|b'\r'=>{letstart_nl_offset=self.offset;self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: Some(format!("I/O error during newline consumption: {}",e)),})?;ifcurrent_byte==b'\r'&&self.peek().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: Some(format!("I/O error during newline peek: {}",e)),})?.unwrap_or(0)==b'\n'{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: Some(format!("I/O error during newline consumption: {}",e)),})?;// Consume \n for CRLF
}self.current_line+=1;self.line_start_offsets.push(self.offset);returnOk(Token::new(TokenKind::Newline,TokenSpan::new(start_nl_offset,self.offset)));}b'-'=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;ifself.consume_char(b'-').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{ifself.consume_char(b'>').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{token_kind=TokenKind::ArrowRight;// -->
}else{// This would be an invalid token like "--" without an arrow head.
// For now, let's treat it as an identifier or error.
// For strictness, if it's not a known arrow, it's an identifier or error.
// Here, we'll just fall through and let the identifier logic handle it if it matches.
// Or, for very strict, it's an error. Let's assume it's part of an identifier for now.
// A more robust lexer would have a fallback for partial matches.
// For Mermaid, "--" is typically not a token itself.
// We'll revert the consume and handle as Identifier or Invalid.
// This part needs careful design for strictness.
// For simplicity, let's assume valid arrows are always fully formed.
// If it's just `--`, it might be part of an identifier.
// For now, let's treat it as an invalid sequence if it's not an arrow.
// A better approach: push back bytes or scan ahead.
// For now, we'll simplify and say if we hit `-` and it's not a known arrow, it's an Identifier starting with `-`.
// This makes the lexer simpler but might require parser to handle complex identifiers.
// Let's re-evaluate: Mermaid arrows are distinct.
// `-->` `---` `-.->` `<--` `==>` etc.
// If we see `--` it's likely a part of an arrow.
// For `-->`, we already handled. What about `---`?
// Let's make this more robust.
letmutcurrent_offset=start_offset+1;// After consuming first '-'
letmutcurrent_cursor=self.cursor;// Re-peek to see what comes next
letnext_byte=self.peek().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?.unwrap_or(0);ifnext_byte==b'-'{// Found `---`
self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;// Consume second '-'
ifself.consume_char(b'>').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{token_kind=TokenKind::ArrowRight;// `--->`
}else{// Could be `---`, which is not a standard arrow in flowcharts.
// Treat as invalid or part of identifier.
// For now, let's assume any sequence of dashes not forming an arrow is an identifier.
token_kind=TokenKind::Identifier;// Fallback to identifier for `---`
}}elseifnext_byte==b'.'{// Found `-.`
self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;// Consume '.'
ifself.consume_char(b'-').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{ifself.consume_char(b'>').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{token_kind=TokenKind::ArrowRight;// `-.->`
}else{token_kind=TokenKind::Identifier;// `-.--` or similar
}}else{token_kind=TokenKind::Identifier;// `-.`
}}else{// Just `-`, treat as identifier or error. For strictness, this is an error unless it's a valid identifier char.
token_kind=TokenKind::Identifier;// Simplified: treat as part of identifier
}}}elseifself.consume_char(b'>').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{token_kind=TokenKind::ArrowRight;// `->`
}else{// Just `-`, part of identifier
token_kind=TokenKind::Identifier;}}b'<'=>{// Could be `<--` or `<->`
self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;ifself.consume_char(b'-').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{ifself.consume_char(b'>').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{token_kind=TokenKind::ArrowBoth;// `<->`
}elseifself.consume_char(b'-').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{ifself.consume_char(b'-').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{token_kind=TokenKind::ArrowLeft;// `<---`
}else{token_kind=TokenKind::ArrowLeft;// `<--`
}}else{// `<` followed by `-` but not `->` or `--`
token_kind=TokenKind::Identifier;// Treat as part of identifier
}}else{// Just `<`, treat as identifier or error
token_kind=TokenKind::Identifier;}}b'='=>{// Could be `==>`, `==`, `===`
self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;ifself.consume_char(b'=').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{ifself.consume_char(b'>').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{token_kind=TokenKind::ArrowRight;// `==>`
}else{// `==`, treat as part of identifier
token_kind=TokenKind::Identifier;}}else{// `=`, treat as part of identifier
token_kind=TokenKind::Identifier;}}b'x'=>{// `x--x`
self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;ifself.consume_char(b'-').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?&&self.consume_char(b'-').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?&&self.consume_char(b'x').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{token_kind=TokenKind::ArrowX;// `x--x`
}else{token_kind=TokenKind::Identifier;// Just `x` or `x-` etc.
}}b'o'=>{// `o--o` or `o--`
self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;ifself.consume_char(b'-').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?&&self.consume_char(b'-').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{ifself.consume_char(b'o').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{token_kind=TokenKind::ArrowBoth;// `o--o`
}else{token_kind=TokenKind::ArrowOpen;// `o--` (open arrow tail)
}}else{token_kind=TokenKind::Identifier;// Just `o` or `o-`
}}b':'=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::Colon;}b';'=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::Semicolon;}b','=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::Comma;}b'.'=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::Dot;}b'('=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::ParenOpen;}b')'=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::ParenClose;}b'['=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::BracketOpen;}b']'=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::BracketClose;}b'{'=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::CurlyOpen;}b'}'=>{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::CurlyClose;}b'"'=>{// String literal
self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;// Consume opening quote
letstring_start_offset=self.offset;letmutescaped=false;whileletSome(b)=self.peek().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{ifb==b'"'&&!escaped{break;}ifb==b'\\'{escaped=!escaped;// Toggle escaped state
}else{escaped=false;}self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;}letstring_end_offset=self.offset;ifself.consume_char(b'"').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?{// Consume closing quote
token_kind=TokenKind::StringLiteral;returnOk(Token::new(TokenKind::StringLiteral,TokenSpan::new(string_start_offset,string_end_offset)));}else{// Unclosed string literal
returnErr(LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..self.offset),help: Some("Unclosed string literal. Expected '\"'.".to_string()),});}}b'%'=>{// Comments: `%%` or `%% comment %%`
self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;ifself.peek().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?.unwrap_or(0)==b'%'{self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;letcomment_start_offset=self.offset;self.consume_while(|b|b!=b'\n'&&b!=b'\r').map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::Comment;// For comments, we usually don't need the span to exclude `%%`, but rather the content.
// Let's make the span include `%%` for simplicity for now, or adjust `TokenKind::Comment` to hold content.
returnOk(Token::new(TokenKind::Comment,TokenSpan::new(comment_start_offset,self.offset)));}else{// Just a single '%', treat as invalid or part of identifier
returnErr(LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..self.offset),help: Some("Unexpected character '%'. Mermaid comments start with '%%'.".to_string()),});}}b'0'..=b'9'=>{// Number literal
letspan=self.consume_while(|b|b.is_ascii_digit()).map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;token_kind=TokenKind::NumberLiteral;returnOk(Token::new(token_kind,span));}_=>{// Identifier or Invalid character
// Try to consume a valid identifier character sequence
letspan=self.consume_while(|b|{b.is_ascii_alphanumeric()||b==b'_'||b==b'-'||b==b'.'}).map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;ifspan.is_empty(){// If no characters were consumed, it's an invalid character
self.consume().map_err(|e|LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..start_offset+1),help: None})?;// Consume the bad char
returnErr(LexerError{src: miette::NamedSource::new("input","".to_string()),bad_char_span: miette::SourceSpan::from(start_offset..self.offset),help: Some(format!("Unexpected character: '{}'",current_byteaschar)),});}// Resolve the span to a string to check for keywords
letcurrent_input_slice=&self.buffer[start_cursor..self.cursor];ifletSome(kind)=self.try_match_keyword(current_input_slice){token_kind=kind;}else{token_kind=TokenKind::Identifier;}}}returnOk(Token::new(token_kind,TokenSpan::new(start_offset,self.offset)));}}/// Helper to get the line and column from a byte offset.
pubfnget_line_col(&self,offset: usize)-> (usize,usize){letline_index=self.line_start_offsets.iter().rposition(|&start_offset|start_offset<=offset).unwrap_or(0);letline_start_offset=self.line_start_offsets[line_index];(line_index+1,offset-line_start_offset+1)}}
Explanation of Lexer Changes:
Generics R: Read: The Lexer now takes any type that implements std::io::Read, making it flexible for files, stdin, or memory buffers.
BufReader: Wraps the R with BufReader for efficient chunked reading.
buffer: Vec<u8>: An internal buffer stores chunks of bytes read from BufReader.
cursor and offset:cursor tracks position within buffer, offset tracks total bytes from the start of the input stream.
fill_buffer: This critical method ensures buffer always has data if available. It reads from reader into buffer when cursor reaches the end.
peek and consume: These methods operate on the buffer and automatically call fill_buffer when needed. They return io::Result<Option<u8>> to handle I/O errors and EOF.
consume_while: A helper that consumes bytes as long as a predicate is true, returning the TokenSpan.
TokenSpan usage: Instead of token.value = "...", we now have token.span = TokenSpan::new(start_offset, self.offset). This is the zero-copy part.
Keyword Matching: Keywords are now matched against &[u8] slices from the buffer, avoiding String conversions until absolutely necessary.
Error Handling: I/O errors are propagated as LexerError that wraps io::Error. miette::NamedSource is a placeholder for now, as the actual source content is not stored in the lexer (it’s streamed). This will be handled by the CLI layer which holds the full source.
Arrow and Literal Handling: The logic for identifying complex arrows (-->, x--x, etc.) and string/number literals is updated to use peek and consume on bytes.
c) Testing This Component
To test the lexer, we can create a small test in src/lexer.rs that reads from a Cursor (which implements io::Read).
// src/lexer.rs (add this at the bottom, inside `#[cfg(test)] mod tests { ... }`)
#[cfg(test)]modtests{usesuper::*;usestd::io::Cursor;fnget_lexer(input: &str)-> Lexer<Cursor<Vec<u8>>>{Lexer::new(Cursor::new(input.as_bytes().to_vec()))}#[test]fntest_lexer_empty_input(){letmutlexer=get_lexer("");lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::Eof);assert_eq!(token.span,TokenSpan::new(0,0));}#[test]fntest_lexer_simple_flowchart(){letinput="graph TD\n A-->B";letmutlexer=get_lexer(input);letsource_str=input;// For resolving spans
lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::Graph);assert_eq!(token.span.resolve(source_str),"graph");lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::Whitespace);assert_eq!(token.span.resolve(source_str)," ");lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::Identifier);assert_eq!(token.span.resolve(source_str),"TD");lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::Newline);lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::Whitespace);assert_eq!(token.span.resolve(source_str)," ");lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::Identifier);assert_eq!(token.span.resolve(source_str),"A");lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::ArrowRight);assert_eq!(token.span.resolve(source_str),"-->");lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::Identifier);assert_eq!(token.span.resolve(source_str),"B");lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::Eof);}#[test]fntest_lexer_arrows(){letinput="A-->B\nC<--D\nE<->F\nG==>H\nI x--x J\nK o--o L\nM o-- N";letmutlexer=get_lexer(input);letsource_str=input;// A-->B
lexer.next_token().unwrap();// A
assert_eq!(lexer.next_token().unwrap().kind,TokenKind::ArrowRight);lexer.next_token().unwrap();// B
lexer.next_token().unwrap();// Newline
// C<--D
lexer.next_token().unwrap();// C
assert_eq!(lexer.next_token().unwrap().kind,TokenKind::ArrowLeft);lexer.next_token().unwrap();// D
lexer.next_token().unwrap();// Newline
// E<->F
lexer.next_token().unwrap();// E
assert_eq!(lexer.next_token().unwrap().kind,TokenKind::ArrowBoth);lexer.next_token().unwrap();// F
lexer.next_token().unwrap();// Newline
// G==>H
lexer.next_token().unwrap();// G
assert_eq!(lexer.next_token().unwrap().kind,TokenKind::ArrowRight);lexer.next_token().unwrap();// H
lexer.next_token().unwrap();// Newline
// I x--x J
lexer.next_token().unwrap();// I
lexer.next_token().unwrap();// Whitespace
assert_eq!(lexer.next_token().unwrap().kind,TokenKind::ArrowX);lexer.next_token().unwrap();// Whitespace
lexer.next_token().unwrap();// J
lexer.next_token().unwrap();// Newline
// K o--o L
lexer.next_token().unwrap();// K
lexer.next_token().unwrap();// Whitespace
assert_eq!(lexer.next_token().unwrap().kind,TokenKind::ArrowBoth);// o--o
lexer.next_token().unwrap();// Whitespace
lexer.next_token().unwrap();// L
lexer.next_token().unwrap();// Newline
// M o-- N
lexer.next_token().unwrap();// M
lexer.next_token().unwrap();// Whitespace
assert_eq!(lexer.next_token().unwrap().kind,TokenKind::ArrowOpen);// o--
lexer.next_token().unwrap();// Whitespace
lexer.next_token().unwrap();// N
lexer.next_token().unwrap();// Eof
}#[test]fntest_lexer_string_literal(){letinput="graph TD\nA[\"Hello World\"]";letmutlexer=get_lexer(input);letsource_str=input;lexer.next_token().unwrap();// graph
lexer.next_token().unwrap();// WS
lexer.next_token().unwrap();// TD
lexer.next_token().unwrap();// Newline
lexer.next_token().unwrap();// A
lexer.next_token().unwrap();// [
lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::StringLiteral);assert_eq!(token.span.resolve(source_str),"Hello World");lexer.next_token().unwrap();// ]
lexer.next_token().unwrap();// Eof
}#[test]fntest_lexer_comment(){letinput="graph TD\n%% This is a comment\nA-->B";letmutlexer=get_lexer(input);letsource_str=input;lexer.next_token().unwrap();// graph
lexer.next_token().unwrap();// WS
lexer.next_token().unwrap();// TD
lexer.next_token().unwrap();// Newline
lettoken=lexer.next_token().unwrap();assert_eq!(token.kind,TokenKind::Comment);assert_eq!(token.span.resolve(source_str)," This is a comment");lexer.next_token().unwrap();// Newline
lexer.next_token().unwrap();// A
lexer.next_token().unwrap();// ArrowRight
lexer.next_token().unwrap();// B
lexer.next_token().unwrap();// Eof
}#[test]fntest_lexer_unclosed_string(){letinput="graph TD\nA[\"Unclosed string]";letmutlexer=get_lexer(input);let_=lexer.next_token().unwrap();// graph
let_=lexer.next_token().unwrap();// WS
let_=lexer.next_token().unwrap();// TD
let_=lexer.next_token().unwrap();// Newline
let_=lexer.next_token().unwrap();// A
let_=lexer.next_token().unwrap();// [
leterror=lexer.next_token();assert!(error.is_err());assert_eq!(error.unwrap_err().help.unwrap(),"Unclosed string literal. Expected '\"'.");}#[test]fntest_lexer_invalid_char(){letinput="graph TD\nA#B";letmutlexer=get_lexer(input);let_=lexer.next_token().unwrap();// graph
let_=lexer.next_token().unwrap();// WS
let_=lexer.next_token().unwrap();// TD
let_=lexer.next_token().unwrap();// Newline
let_=lexer.next_token().unwrap();// A
leterror=lexer.next_token();assert!(error.is_err());assert_eq!(error.unwrap_err().help.unwrap(),"Unexpected character: '#'");}}
Debugging Tips:
Use dbg!(&token) to inspect tokens and their spans.
Step through the next_token method with a debugger to see how peek and consume interact with the buffer.
Ensure TokenSpan::resolve correctly extracts the string slice from your test input.
3. Core Implementation: Parser with Arena Allocation
Now we need to update our AST nodes and Parser to use bumpalo::Arena for allocation and Cow<'a, str> for strings where appropriate. This means our AST will be tied to a specific lifetime 'a, typically the lifetime of the input &str or the Arena itself.
a) Modify src/ast.rs - Lifetimes and Cow
We’ll introduce a lifetime parameter 'a to our AST structs and use Cow<'a, str> for string fields.
// src/ast.rs
usestd::borrow::Cow;usecrate::lexer::TokenSpan;// Import TokenSpan
/// The top-level representation of a Mermaid diagram.
#[derive(Debug, PartialEq, Eq, Clone)]pubstructDiagram<'a>{pubdiagram_type: DiagramType,pubstatements: Vec<Statement<'a>>,pubspan: TokenSpan,}#[derive(Debug, PartialEq, Eq, Clone)]pubenumDiagramType{Flowchart,Sequence,Class,// ... other diagram types
}/// Represents a single statement in the diagram (e.g., node definition, edge, subgraph).
#[derive(Debug, PartialEq, Eq, Clone)]pubenumStatement<'a>{Node(Node<'a>),Edge(Edge<'a>),Subgraph(Subgraph<'a>),Direction(Direction),Comment(TokenSpan),// Comments are just spans
// ... other statement types
}/// Represents a node in a diagram.
#[derive(Debug, PartialEq, Eq, Clone)]pubstructNode<'a>{pubid: Cow<'a,str>,// Node ID, typically borrowed
publabel: Option<Cow<'a,str>>,// Node label, can be borrowed or owned if fixed
pubshape: NodeShape,pubspan: TokenSpan,}#[derive(Debug, PartialEq, Eq, Clone)]pubenumNodeShape{Rectangle,RoundedRectangle,// ... other shapes
Default,// No explicit shape defined
}/// Represents an edge between two nodes.
#[derive(Debug, PartialEq, Eq, Clone)]pubstructEdge<'a>{pubfrom: Cow<'a,str>,// Source node ID
pubto: Cow<'a,str>,// Target node ID
publabel: Option<Cow<'a,str>>,// Edge label
pubarrow_type: ArrowType,pubspan: TokenSpan,}#[derive(Debug, PartialEq, Eq, Clone)]pubenumArrowType{Normal,Open,Cross,// ... other arrow types
}/// Represents a subgraph.
#[derive(Debug, PartialEq, Eq, Clone)]pubstructSubgraph<'a>{pubid: Cow<'a,str>,pubtitle: Option<Cow<'a,str>>,pubstatements: Vec<Statement<'a>>,pubspan: TokenSpan,}#[derive(Debug, PartialEq, Eq, Clone)]pubenumDirection{TD,TB,LR,RL,BT,}// ... Add other AST structs as needed, ensuring they also use lifetimes and Cow<'a, str>
// For example, if you have a `ClassMember` in a Class Diagram AST:
// pub struct ClassMember<'a> {
// pub visibility: Option<Cow<'a, str>>,
// pub member_type: Cow<'a, str>,
// pub name: Cow<'a, str>,
// pub span: TokenSpan,
// }
Explanation of AST Changes:
Lifetime Parameter 'a: All AST structs that contain borrowed data (slices from the input or arena) now have a lifetime parameter 'a. This ties the AST’s lifetime to the input source or the arena.
Cow<'a, str>: Instead of String, we use Cow<'a, str>.
Cow::Borrowed(&'a str): If the string data comes directly from the input buffer and doesn’t need modification, we can borrow it. This is zero-cost.
Cow::Owned(String): If the string needs to be modified (e.g., a rule fixes a label by adding quotes), we can convert it to Cow::Owned(String). This incurs a heap allocation, but only when strictly necessary.
TokenSpan: AST nodes now store TokenSpan for their overall position, allowing precise error reporting and source mapping.
b) Modify src/parser.rs - Parser struct and AST construction
The Parser will now take a bumpalo::Arena by mutable reference and use it to allocate AST nodes.
// src/parser.rs
usecrate::ast::*;// Import all AST types
usecrate::lexer::{Lexer,LexerError,Token,TokenKind,TokenSpan};usestd::borrow::Cow;usestd::collections::HashMap;usemiette::{Diagnostic,SourceSpan,NamedSource};usethiserror::Error;usebumpalo::Bump;// Import Bump (Arena)
/// Represents a parsing error.
#[derive(Debug, Clone, PartialEq, Eq, Diagnostic, Error)]#[error("Parsing error")]#[diagnostic(code(mermaid::parser::error))]pubstructParserError{#[source_code]pubsrc: NamedSource,#[label("Here")]pubbad_token_span: SourceSpan,pubexpected: Option<String>,pubfound: Option<String>,pubhelp: Option<String>,}/// The Parser for Mermaid diagrams.
/// It takes a lexer and an arena allocator to build the AST.
pubstructParser<'a,R: std::io::Read>{lexer: Lexer<R>,current_token: Option<Token>,arena: &'aBump,// Arena allocator for AST nodes
source_content: &'astr,// Reference to the entire source content for span resolution
}impl<'a,R: std::io::Read>Parser<'a,R>{/// Creates a new Parser instance.
pubfnnew(lexer: Lexer<R>,arena: &'aBump,source_content: &'astr)-> Result<Self,LexerError>{letmutparser=Self{lexer,current_token: None,arena,source_content,};parser.advance()?;// Prime the parser with the first token
Ok(parser)}/// Advances the lexer and updates the current token.
fnadvance(&mutself)-> Result<(),LexerError>{loop{lettoken=self.lexer.next_token()?;iftoken.kind==TokenKind::Whitespace{// Skip whitespace tokens, but keep them for span tracking if needed for formatting.
// For parsing, we just ignore them.
continue;}self.current_token=Some(token);break;}Ok(())}/// Peeks at the current token without consuming it.
fnpeek(&self)-> &Token{self.current_token.as_ref().expect("Parser's current_token should always be Some except at EOF or after an error")}/// Consumes the current token and returns it.
fnconsume(&mutself)-> Result<Token,ParserError>{lettoken=self.current_token.take().expect("Failed to consume token: current_token was None");self.advance().map_err(|e|self.lexer_error_to_parser_error(e))?;Ok(token)}/// Consumes the current token if its kind matches the expected kind.
/// Returns the consumed token or a ParserError.
fnexpect(&mutself,expected_kind: TokenKind)-> Result<Token,ParserError>{letcurrent_token=self.peek();ifcurrent_token.kind==expected_kind{self.consume()}else{letfound_kind=current_token.kind.clone();letspan=current_token.span;Err(ParserError{src: NamedSource::new("input",self.source_content.to_string()),bad_token_span: span.into(),expected: Some(format!("{:?}",expected_kind)),found: Some(format!("{:?}",found_kind)),help: Some(format!("Expected token {:?}, but found {:?}",expected_kind,found_kind)),})}}/// Helper to convert LexerError to ParserError, adding source content.
fnlexer_error_to_parser_error(&self,lex_err: LexerError)-> ParserError{ParserError{src: NamedSource::new("input",self.source_content.to_string()),bad_token_span: lex_err.bad_char_span,expected: None,found: None,help: lex_err.help,}}/// Parses the entire Mermaid diagram.
pubfnparse(&mutself)-> Result<Diagram<'a>,ParserError>{letstart_span=self.peek().span;letdiagram_type_token=self.expect_any_keyword()?;// e.g., 'graph', 'flowchart'
letdiagram_type=self.parse_diagram_type(&diagram_type_token)?;letmutstatements=Vec::new();whileself.peek().kind!=TokenKind::Eof{// Handle newlines as statement separators, but don't add them to AST
ifself.peek().kind==TokenKind::Newline{self.consume()?;continue;}ifself.peek().kind==TokenKind::Comment{letcomment_token=self.consume()?;statements.push(Statement::Comment(comment_token.span));continue;}// Try to parse different statement types
ifletOk(node)=self.parse_node_definition(){statements.push(Statement::Node(node));}elseifletOk(edge)=self.parse_edge(){statements.push(Statement::Edge(edge));}elseifletOk(subgraph)=self.parse_subgraph(){statements.push(Statement::Subgraph(subgraph));}elseifletOk(direction)=self.parse_direction(){statements.push(Statement::Direction(direction));}else{// If we can't parse anything, it's an unexpected token
letbad_token=self.consume()?;returnErr(ParserError{src: NamedSource::new("input",self.source_content.to_string()),bad_token_span: bad_token.span.into(),expected: Some("node, edge, subgraph, or direction".to_string()),found: Some(format!("{:?}",bad_token.kind)),help: Some("Unexpected token in diagram body. Expected a statement.".to_string()),});}}letend_span=self.peek().span;// Eof token span
letdiagram_span=TokenSpan::new(start_span.start,end_span.end);Ok(Diagram{diagram_type,statements,span: diagram_span,})}/// Parses a diagram type keyword (e.g., 'graph', 'flowchart').
fnexpect_any_keyword(&mutself)-> Result<Token,ParserError>{lettoken=self.peek();matchtoken.kind{TokenKind::Graph|TokenKind::Flowchart|TokenKind::Sequence|TokenKind::Class|TokenKind::State|TokenKind::ErDiagram|TokenKind::GitGraph|TokenKind::Gantt|TokenKind::Pie|TokenKind::C4Diagram|TokenKind::Journey|TokenKind::Requirement=>{self.consume()},_=>Err(ParserError{src: NamedSource::new("input",self.source_content.to_string()),bad_token_span: token.span.into(),expected: Some("diagram type keyword (e.g., 'graph', 'flowchart')".to_string()),found: Some(format!("{:?}",token.kind)),help: Some("Mermaid diagram must start with a diagram type declaration.".to_string()),}),}}fnparse_diagram_type(&self,token: &Token)-> Result<DiagramType,ParserError>{matchtoken.kind{TokenKind::Graph|TokenKind::Flowchart=>Ok(DiagramType::Flowchart),TokenKind::Sequence=>Ok(DiagramType::Sequence),TokenKind::Class=>Ok(DiagramType::Class),// ... map other keywords to DiagramType
_=>Err(ParserError{src: NamedSource::new("input",self.source_content.to_string()),bad_token_span: token.span.into(),expected: Some("valid diagram type keyword".to_string()),found: Some(format!("{:?}",token.kind)),help: Some("Unsupported or invalid diagram type.".to_string()),}),}}/// Parses a node definition (e.g., `A[Label]`).
fnparse_node_definition(&mutself)-> Result<Node<'a>,ParserError>{letstart_span=self.peek().span;letid_token=self.expect(TokenKind::Identifier)?;letid=Cow::Borrowed(id_token.span.resolve(self.source_content));letmutlabel: Option<Cow<'a,str>>=None;letmutshape=NodeShape::Default;ifself.peek().kind==TokenKind::BracketOpen{self.consume()?;// Consume '['
letlabel_token=self.expect(TokenKind::StringLiteral).or_else(|_|self.expect(TokenKind::Identifier))?;// Labels can sometimes be unquoted identifiers
label=Some(Cow::Borrowed(label_token.span.resolve(self.source_content)));self.expect(TokenKind::BracketClose)?;// Consume ']'
shape=NodeShape::Rectangle;// Default shape for bracketed labels
}// ... extend with other shapes like `((Label))` for rounded, `{Label}` for cylinder, etc.
letend_span=ifletSome(last_token)=&self.current_token{last_token.span}else{id_token.span// Fallback if no more tokens after ID
};letnode_span=TokenSpan::new(start_span.start,end_span.end);// Allocate Node on the arena
Ok(Node{id,label,shape,span: node_span,})}/// Parses an edge definition (e.g., `A-->B`).
fnparse_edge(&mutself)-> Result<Edge<'a>,ParserError>{letstart_span=self.peek().span;letfrom_token=self.expect(TokenKind::Identifier)?;letfrom=Cow::Borrowed(from_token.span.resolve(self.source_content));letarrow_token=self.expect_any_arrow_type()?;letarrow_type=self.parse_arrow_type(&arrow_token)?;letto_token=self.expect(TokenKind::Identifier)?;letto=Cow::Borrowed(to_token.span.resolve(self.source_content));letedge_span=TokenSpan::new(start_span.start,to_token.span.end);// Allocate Edge on the arena
Ok(Edge{from,to,label: None,// Add label parsing later if needed
arrow_type,span: edge_span,})}fnexpect_any_arrow_type(&mutself)-> Result<Token,ParserError>{lettoken=self.peek();matchtoken.kind{TokenKind::ArrowRight|TokenKind::ArrowLeft|TokenKind::ArrowBoth|TokenKind::ArrowOpen|TokenKind::ArrowX=>{self.consume()},_=>Err(ParserError{src: NamedSource::new("input",self.source_content.to_string()),bad_token_span: token.span.into(),expected: Some("an arrow type (e.g., '-->', '<--')".to_string()),found: Some(format!("{:?}",token.kind)),help: Some("Expected an arrow to connect nodes.".to_string()),}),}}fnparse_arrow_type(&self,token: &Token)-> Result<ArrowType,ParserError>{matchtoken.kind{TokenKind::ArrowRight=>Ok(ArrowType::Normal),TokenKind::ArrowLeft=>Ok(ArrowType::Normal),// For now, treat both as normal, direction handled by from/to
TokenKind::ArrowBoth=>Ok(ArrowType::Normal),// Similarly
TokenKind::ArrowOpen=>Ok(ArrowType::Open),TokenKind::ArrowX=>Ok(ArrowType::Cross),_=>unreachable!("Should only be called with arrow tokens"),}}/// Parses a subgraph definition. (Simplified for brevity)
fnparse_subgraph(&mutself)-> Result<Subgraph<'a>,ParserError>{// This is a simplified placeholder. A real subgraph parsing would be recursive.
letstart_span=self.peek().span;self.expect(TokenKind::Subgraph)?;// Expect 'subgraph' keyword
letid_token=self.expect(TokenKind::Identifier)?;letid=Cow::Borrowed(id_token.span.resolve(self.source_content));// Optional title
letmuttitle: Option<Cow<'a,str>>=None;ifself.peek().kind==TokenKind::StringLiteral{lettitle_token=self.consume()?;title=Some(Cow::Borrowed(title_token.span.resolve(self.source_content)));}self.expect(TokenKind::Newline)?;// Subgraph content starts on new line
letmutstatements=Vec::new();// Consume statements until 'end' or Eof
whileself.peek().kind!=TokenKind::Eof&&self.peek().kind!=TokenKind::End{ifself.peek().kind==TokenKind::Newline{self.consume()?;continue;}// Recursively parse inner statements (nodes, edges, nested subgraphs)
ifletOk(node)=self.parse_node_definition(){statements.push(Statement::Node(node));}elseifletOk(edge)=self.parse_edge(){statements.push(Statement::Edge(edge));}elseifletOk(inner_subgraph)=self.parse_subgraph(){// Recursive call
statements.push(Statement::Subgraph(inner_subgraph));}elseifletOk(direction)=self.parse_direction(){statements.push(Statement::Direction(direction));}elseifself.peek().kind==TokenKind::Comment{letcomment_token=self.consume()?;statements.push(Statement::Comment(comment_token.span));}else{// If we can't parse anything, break or error out.
// For now, let's assume valid subgraph content or break on unknown.
break;}}// Ensure 'end' keyword is present if not EOF
letend_token=self.expect(TokenKind::End)?;// Expect 'end' keyword
letsubgraph_span=TokenSpan::new(start_span.start,end_token.span.end);Ok(Subgraph{id,title,statements,span: subgraph_span,})}/// Parses a direction statement (e.g., `TD`).
fnparse_direction(&mutself)-> Result<Direction,ParserError>{lettoken=self.peek();matchtoken.kind{TokenKind::Identifier=>{// TD, LR, etc. are identifiers
letid_str=token.span.resolve(self.source_content);letdirection=matchid_str{"TD"|"TB"=>Direction::TD,"LR"=>Direction::LR,"RL"=>Direction::RL,"BT"=>Direction::BT,_=>returnErr(ParserError{src: NamedSource::new("input",self.source_content.to_string()),bad_token_span: token.span.into(),expected: Some("valid direction (TD, LR, etc.)".to_string()),found: Some(id_str.to_string()),help: Some("Invalid diagram direction.".to_string()),}),};self.consume()?;Ok(direction)}_=>Err(ParserError{src: NamedSource::new("input",self.source_content.to_string()),bad_token_span: token.span.into(),expected: Some("direction identifier (TD, LR, etc.)".to_string()),found: Some(format!("{:?}",token.kind)),help: Some("Expected a diagram direction.".to_string()),}),}}}
Explanation of Parser Changes:
Lifetime 'a and Bump: The Parser now holds a reference &'a Bump (the arena allocator) and a source_content: &'a str which is a reference to the entire input string. This source_content is crucial for TokenSpan::resolve.
Parser::new: Now takes the arena and source_content as arguments.
Cow::Borrowed: When creating Nodes, Edges, Subgraphs, the id, label, from, to, title fields are initialized with Cow::Borrowed(token.span.resolve(self.source_content)). This means the AST nodes directly reference parts of the original input string, avoiding new string allocations.
Error Handling:ParserError now includes source_content for better diagnostic reporting.
Allocation: The Node, Edge, Subgraph structs are allocated on the stack (temporarily) and then moved into the Vecs of statements. The Cow variants handle the borrowing of string data. The Vec itself will grow on the heap, but bumpalo would be used if we were allocating individual AST nodes directly on the arena using arena.alloc(Node { ... }). For now, Cow and TokenSpan are the primary zero-copy mechanisms. If the AST itself becomes very large and deep, we would refactor Vec<Statement<'a>> to &'a [Statement<'a>] and allocate the Statement slices on the arena. For typical Mermaid diagrams, Vec of Cows is often sufficient. Self-correction: For true arena allocation, the Diagram, Statement, Node, Edge, Subgraph structs themselves should be allocated on the arena. This means Vec would become &'a [Statement<'a>] or similar arena-backed collections. For this chapter, Cow is a good start for string data, and we’ll keep Vec for simplicity for now, as Vec allocations are typically efficient enough for collections of structured data, and the primary string data is Cow::Borrowed.
c) Testing This Component
Let’s add some basic parser tests. We’ll need to provide a Bump arena and the source content.
// src/parser.rs (add this at the bottom, inside `#[cfg(test)] mod tests { ... }`)
#[cfg(test)]modtests{usesuper::*;usestd::io::Cursor;usebumpalo::Bump;fnparse_test_input(input: &str)-> Result<Diagram,ParserError>{letarena=Bump::new();letlexer=Lexer::new(Cursor::new(input.as_bytes().to_vec()));letmutparser=Parser::new(lexer,&arena,input).unwrap();parser.parse()}#[test]fntest_parse_simple_flowchart(){letinput="graph TD\n A-->B";letdiagram=parse_test_input(input).unwrap();assert_eq!(diagram.diagram_type,DiagramType::Flowchart);assert_eq!(diagram.statements.len(),2);// A-->B
ifletStatement::Direction(dir)=&diagram.statements[0]{assert_eq!(*dir,Direction::TD);}else{panic!("Expected Direction statement");}ifletStatement::Edge(edge)=&diagram.statements[1]{assert_eq!(edge.from,Cow::Borrowed("A"));assert_eq!(edge.to,Cow::Borrowed("B"));assert_eq!(edge.arrow_type,ArrowType::Normal);}else{panic!("Expected Edge statement");}}#[test]fntest_parse_node_with_label(){letinput="flowchart LR\n Node1[My Label]";letdiagram=parse_test_input(input).unwrap();assert_eq!(diagram.diagram_type,DiagramType::Flowchart);assert_eq!(diagram.statements.len(),2);ifletStatement::Node(node)=&diagram.statements[1]{assert_eq!(node.id,Cow::Borrowed("Node1"));assert_eq!(node.label,Some(Cow::Borrowed("My Label")));assert_eq!(node.shape,NodeShape::Rectangle);}else{panic!("Expected Node statement");}}#[test]fntest_parse_subgraph(){letinput=r#"graph TD
subgraph My Subgraph
A --> B
end"#;letdiagram=parse_test_input(input).unwrap();assert_eq!(diagram.diagram_type,DiagramType::Flowchart);assert_eq!(diagram.statements.len(),2);ifletStatement::Subgraph(subgraph)=&diagram.statements[1]{assert_eq!(subgraph.id,Cow::Borrowed("My"));// Simplified ID for now
assert_eq!(subgraph.title,Some(Cow::Borrowed(" Subgraph")));// Simplified title for now
assert_eq!(subgraph.statements.len(),1);// A --> B inside
ifletStatement::Edge(edge)=&subgraph.statements[0]{assert_eq!(edge.from,Cow::Borrowed("A"));assert_eq!(edge.to,Cow::Borrowed("B"));}else{panic!("Expected Edge inside subgraph");}}else{panic!("Expected Subgraph statement");}}#[test]fntest_parse_invalid_start(){letinput="A-->B";// Missing graph/flowchart declaration
leterror=parse_test_input(input);assert!(error.is_err());leterr=error.unwrap_err();assert_eq!(err.help.unwrap(),"Mermaid diagram must start with a diagram type declaration.".to_string());}}
Debugging Tips:
Print the Diagram struct after parsing with dbg!(&diagram).
Carefully check the Cow::Borrowed values to ensure they correctly reference the input string.
If you encounter lifetime errors, double-check that all Cow and &'a str fields are consistently using the same lifetime parameter 'a and that the Bump arena is passed correctly.
4. Core Implementation: Benchmarking Setup
We’ll use criterion to measure the performance of our lexer and parser.
a) Create benches/parsing_benchmark.rs
// benches/parsing_benchmark.rs
usecriterion::{black_box,criterion_group,criterion_main,Criterion};usestd::io::Cursor;usebumpalo::Bump;usemermaid_tool::lexer::Lexer;// Assuming mermaid_tool is your crate name
usemermaid_tool::parser::Parser;usestd::fs;// A small sample for quick checks
constSMALL_MERMAID: &str=r#"
graph TD
A[Start] --> B(Process)
B --> C{Decision?}
C -- Yes --> D[End]
C -- No --> E[Retry]
E --> B
"#;// A larger sample for more realistic benchmarks
constLARGE_MERMAID_PATH: &str="benches/large_diagram.mmd";fnlexer_benchmark(c: &mutCriterion){letsmall_input=SMALL_MERMAID.to_string();letlarge_input=fs::read_to_string(LARGE_MERMAID_PATH).expect("Failed to read large diagram file. Create one at benches/large_diagram.mmd");letmutgroup=c.benchmark_group("Lexer Performance");group.sample_size(100);// Increase sample size for more stable results
group.bench_function("lexer_small_input",|b|{b.iter(||{letcursor=Cursor::new(black_box(small_input.as_bytes().to_vec()));letmutlexer=Lexer::new(cursor);letmuttokens=Vec::new();whilelexer.peek().unwrap().is_some(){// Check for EOF without consuming
tokens.push(lexer.next_token().unwrap());}black_box(tokens);})});group.bench_function("lexer_large_input",|b|{b.iter(||{letcursor=Cursor::new(black_box(large_input.as_bytes().to_vec()));letmutlexer=Lexer::new(cursor);letmuttokens=Vec::new();whilelexer.peek().unwrap().is_some(){tokens.push(lexer.next_token().unwrap());}black_box(tokens);})});group.finish();}fnparser_benchmark(c: &mutCriterion){letsmall_input=SMALL_MERMAID.to_string();letlarge_input=fs::read_to_string(LARGE_MERMAID_PATH).expect("Failed to read large diagram file. Create one at benches/large_diagram.mmd");letmutgroup=c.benchmark_group("Parser Performance");group.sample_size(100);group.bench_function("parser_small_input",|b|{b.iter(||{letarena=Bump::new();// Allocate a new arena for each iteration
letcursor=Cursor::new(black_box(small_input.as_bytes().to_vec()));letlexer=Lexer::new(cursor);letmutparser=Parser::new(lexer,&arena,&small_input).unwrap();black_box(parser.parse().unwrap());})});group.bench_function("parser_large_input",|b|{b.iter(||{letarena=Bump::new();letcursor=Cursor::new(black_box(large_input.as_bytes().to_vec()));letlexer=Lexer::new(cursor);letmutparser=Parser::new(lexer,&arena,&large_input).unwrap();black_box(parser.parse().unwrap());})});group.finish();}criterion_group!(benches,lexer_benchmark,parser_benchmark);criterion_main!(benches);
b) Create a large diagram file
Create benches/large_diagram.mmd with a substantial Mermaid diagram. You can generate one programmatically or paste a large example.