Developer Guide
pulldown-cmark uses a two-pass parsing strategy with a pull parser architecture to efficiently parse Markdown into HTML. This guide explains the internal workings of the library for developers who want to contribute or better understand how it works.
High-Level Architecture
The parser operates in two main passes:
-
First Pass (Block Structure): The first pass scans the document and builds a tree structure representing block-level elements like paragraphs, lists, code blocks, etc. This establishes the hierarchical structure of the document.
-
Second Pass (Inline Processing): The second pass processes inline elements like emphasis, links and code spans within the blocks identified by the first pass. This is done in a streaming fashion as events are requested.
The library uses a pull parser design, which means:
- Instead of pushing events to a callback or building a complete AST, it provides an iterator interface that lets consumers pull events as needed
- It enables flexible transformation of the event stream before rendering
Key components:
Parser
: The main entry point that implements the Iterator trait for EventsTree
: A Vec-based data structure that holds the block structureEvent
: An enum representing the different Markdown elementsHtmlWriter
: Renders the event stream as HTML
Performance Characteristics
The parser is designed for high performance:
- Performance is intended to be linear with respect to the size of the input text
- String handling uses copy-on-write semantics to avoid unnecessary allocations
- SIMD optimizations are available for scanning text on x86_64
Extending the Parser
The parser can be extended in several ways:
- New syntax extensions can be added by implementing new scan functions
- The event stream can be transformed using Iterator adaptors
- Custom renderers can be built by consuming events
- The HTML renderer can be customized through options
Directory Structure
src/
firstpass.rs - First pass block structure parsing
scanners.rs - Low-level text scanning functions
parse.rs - Main parser implementation
html.rs - HTML renderer
tree.rs - Tree data structure
entities.rs - HTML entity handling
strings.rs - String types and utilities
Subsequent chapters cover each of these components in detail: