Block Structure Parsing

The first pass of pulldown-cmark's parsing process handles block-level elements and constructs the basic document structure.

It roughly corresponds to Phase 1 of the CommonMark spec appendix "A Parsing Strategy". This chapter explains how the block parsing works in pulldown-cmark.

Overview

Block parsing is implemented in firstpass.rs and has two main responsibilities:

Identifying block-level elements like paragraphs, lists, and code blocks
Building a tree structure representing the nesting of these blocks

The block parser operates line-by-line, maintaining a stack of currently open blocks (called the "spine") and handling both container blocks (like blockquotes) and leaf blocks (like paragraphs).

Block Types

The main block types handled by the first pass are:

Container blocks:
- Block quotes
- Lists (ordered and unordered)
- List items
- Footnote definitions
Leaf blocks:
- Paragraphs
- Headings (ATX and Setext style)
- Code blocks (fenced and indented)
- HTML blocks
- Thematic breaks (horizontal rules)
- Tables (with GFM extension)

The Parsing Process

The block parsing process works like this:

Input text is processed line by line
For each line:
- Check if it continues any blocks from the current spine
- Scan for the start of any new blocks
- Handle transitions between blocks
- Track indentation and container prefixes
Build tree nodes for each block encountered
Handle tight/loose list detection

Here's a simplified example of how a nested list is parsed:

- First item
  - Nested item
    with continuation
- Second item

The parser:

Recognizes the first - as starting a list and list item
Sees the next - as starting a nested list
Identifies the indented line as continuing the nested item
Recognizes the unindented - as closing the nested list

Tree Construction

The block structure is stored in a Tree<Item> where each node contains:

#![allow(unused)]
fn main() {
struct Item {
    start: usize,      // Start byte offset
    end: usize,        // End byte offset  
    body: ItemBody,    // Type and attributes
}

struct Node<T> {
    child: Option<TreeIndex>,  // First child node
    next: Option<TreeIndex>,   // Next sibling
    item: T,                   // Node data (T = Item in our tree)
}
}

The tree is built incrementally as blocks are parsed. Key operations:

push(): Move down into a new block's children
pop(): Move back up to the parent block
append(): Add a new sibling block
truncate_siblings(): End open blocks at a certain point

Container Block Handling

Container blocks like blockquotes and lists require special handling:

Track container prefixes (>, -, 1., etc)
Calculate correct indentation levels
Handle lazy continuation lines
Determine tight/loose status for lists

The LineStart struct helps manage this by:

Tracking indentation and remaining space
Scanning container markers
Handling tab stops correctly

Leaf Block Processing

Leaf blocks are handled by specific scanner functions that:

Identify the block type
Calculate its bounds
Handle internal structure like table columns
Manage transitions between blocks

For example, table parsing:

#![allow(unused)]
fn main() {
pub(crate) fn scan_table_head(data: &[u8]) -> (usize, Vec<Alignment>) {
    // Check initial conditions
    let (mut i, spaces) = calc_indent(data, 4);
    if spaces > 3 || i == data.len() {
        return (0, vec![]);
    }
    
    // Parse cells and alignments
    let mut cols = vec![];
    let mut active_col = Alignment::None;
    // ...

    // Return parsed structure
    (i, cols)
}
}

Error Recovery

The parser is designed to be robust and recover from invalid syntax:

Malformed containers fall back to paragraphs
Invalid indentation is normalized
Unclosed blocks are implicitly closed
HTML parsing has fallback modes

This ensures it can handle real-world Markdown without failing.

Interfacing with Inline Parsing

The block parser prepares for inline parsing by:

Identifying inline-containing blocks
Marking potential inline boundaries (e.g. MaybeLinkOpen)
Providing context (like table cells)
Tracking source positions

The tree structure is then used by the inline parser to process inline elements within the appropriate blocks.

Implementation Notes

Some key implementation details:

Line scanning is optimized using SIMD on x86_64
The tree structure uses indexed nodes to avoid lifetimes
Container context is maintained in a stack-like structure
Source positions are tracked for use by the inline parser and OffsetIter

The block parser aims to be:

Fast for common cases
Memory efficient
Robust against bad input
Compliant with CommonMark

pulldown-cmark guide