This is a working draft. You can take part in this work, join the Markdown discussion list.
Copyright © 2008 Michel Fortin.
You are free to share and make adaptations of this document under the terms
of the Creative Commons Attribution 2.5 Canada License.
This specification defines how to read a Markdown Extra document, how to construct the document model, and how to translate it to HTML. This document aims at being a superset of the Markdown Syntax Documentation from Daring Fireball.
This is an early draft! Implementors should be reminded that this documentation is not stable. If you with to implement this spec, you should join the Markdown discussion list to be aware of the latest directions and development.
This specification intends to become a reference in how to parse Markdown Extra documents. In the absence of more precise Markdown Syntax Documentation, it is also the intension that this specification can be used as a reference for how to parse a plain Markdown document.
The first goal of this document is not to add new features, nor redefine how Markdown or Markdown Extra documents should be parsed, but to specify the syntax in a way which can improve interoperability between implementations while breaking the smallest number of existing Markdown and Markdown Extra documents.
[To be added]
This section is non-normative.
Markdown is originally two things: a lightweight markup syntax introduced in 2004 by John Gruber for writing on the web, and a converter tool of the same name also written by John Gruber. Markdown Extra, is a syntax based on Markdown and extending it with new "extra" features such as tables and definition lists.
While Markdown and, to some extent, Markdown Extra became widely popular as a formatting tool for blog entries and other web documents, it became apparent that the syntax specification was inadequate to create fully interoperable implementations of Markdown. Using the original implementation as a reference could provide some instincts, but obvious bugs were preventing its output from being a trustworthy reference.
This is the syntax specification for Markdown Extra which aims at fully defining how to parse the initial text document to build the document model. Since Markdown Extra is based on Markdown, it can also be used as a reference as to how to do the same for a Markdown document if desired; this specification aims at making this easy.
This section is non-normative.
This specification describes the Markdown Extra document model and how to parse a stream of characters to create the document model.
This specification imposes no requirement about how the document model should be implemented programatically.
This specification also suggest how to serialize the document model to other formats, such as HTML4. There is no a requirement about the given output, only conventions which implementers are encouraged to follow.
Note: should there be requirements about the output?
This specification comprise three main sections:
Parsers for the Markdown Extra syntax must parse documents as described in this specification, or in a way that produce the same document model. Since this specification has no requirement as to how the document model is represented inside a program, implementors are also free to completely bypass the model and generate the output directly as long as the output accurately represents the model of the given document.
Markdown Extra doesn't define any conformance requirements for documents. Any character input can form a Markdown Extra document and be sent to the parser, which should accept to parse it until the end, and result in a Markdown Extra document model.
The Markdown Extra document model is a tree structure where the root is the document itself, and children of the root are various syntax elements. Most syntax elements may contain other syntax element as their children. For instance a list usually has many list item children; a paragraph may have many text nodes, code span nodes, link nodes, etc. interleaved.
Here is a sample Markdown document:
Some text and [a link][1]
<hr class="section-separator">
* List item 1
* List item 2
[1]: http://example.com "Example web site"
This document starts with a paragraph containing some text and a link with some
more text in it, followed by an HTML block containing a single hr
HTML
element, followed by a list containing two items each having some text in them,
and ending with a link reference.
This document's tree could be illustrated like this:
Each Markdown Extra document has one and only one document root containing the whole content of the document.
Link references do not appear in the final output, but allow reference links and images span elements to be given attributes by referencing them from elsewhere in the document.
A link reference is alone on a line. It begins with the reference name inside square brackets, optionally followed by a space or a no-break-space, a colon, a URI (either enclosed in angle brakets or not), and an optional title enclosed in single or double quotes, or in parenthesis (which can be preceded by a newline).
The reference name is matched case-insensitively.
Abbreviation definitions denote words with are an abbreviated from of another word or group of word. The "abbreviated word" will be matched (case sensitively) against the text in each text node and, if found, enclosed in an abbreviation element.
An abbreviation definition starts with an asterisk, followed by the abbreviated word inside square brakets, optionally followed by one space or no-break-space, a colon and one or more words giving the full meaning of the abbreviation.
Footnote definitions provide the content of a footnote to be used when a footnote maker is found with a matching reference name.
A footnote definition starts with a footnote reference enclosed in square
brakets with a caret character (^
) just after the opening braket, an
optional space or no-break-space character, and one or more block-level
elements each having their first line, and optionally other lines, indented
by one tab-length.
A paragraph starts after a blank line and ends at the first blank line. Newline characters in the paragraph are considered to be soft-wrapped, meaning that they do not bear significance until they're actually Hard line break elements.
A blockquote represents a quotation of a section of text from another source.
Blockquotes are created by prefixing paragraphs, or other block elements, with
a right-pointing angle bracket (>
). You can nest blockquotes by adding more
than one level of right-pointing angle brackets.
Inside a blockquote, you can prefix every line with an angle bracket, or only those lines starting a new block element. Contiguous block elements are considered to be inside the same blockquote if they both share the same number of starting bracket.
Headers come in two forms:
[Description forthcoming]
[Description forthcoming]
[Description forthcoming]
[Description forthcoming]
[Description forthcoming]
[Description forthcoming]
[Description forthcoming]
[Description forthcoming]
[Description forthcoming]
[Description forthcoming]
[Description forthcoming]
Hard line breaks are represented in Markdown source by having two spaces preceding a new line character. This can be useful if you need to force a line break somewhere, as when writing an address:
Santa Claus
North Pole
Canada
H0H 0H0
[Description forthcoming]
Abbreviation elements are found by scanning the content of text elements for abbreviations defined in the document's abbreviation definitions.
[Description forthcoming]
This section explains how to build the document model from a character stream following the Markdown Extra syntax.
The following are definitions for basic syntax concepts which are reused at many places in the parsing section.
One of:
Editor note: Should we extend this to include other unicode spaces as well? Candidates include:
Any character not matched by space
First match of:
The end-of-line construct is part of the line it ends and must not be counted as matching the line following it.
Editor note: Perhaps we should follow the lead of XML 1.1 and add the following to our list:
Any of:
Editor note: This should be updated when/if the space construct is updated to add more characters
Any of:
Editor note: This should be updated when/if the space construct is updated to add more characters
A run of one or more characters having at least one non-space character, and excluding any end-of-line.
A sequence of:
A sequence of:
A run of one or more characters, excluding any end-of-line and U+005D Closing Square Bracket.
Editor note: Should we allow closing square brakets inside when they're correctly balanced with opening ones?
A run of one or more characters, excluding any end-of-line and U+007D Right Curly Bracket.
Editor note: Should we allow closing square brakets inside when they're correctly balanced with opening ones?
[To be defined]
[To be defined]
[To be defined]
First of:
The extracted IRI is created from item 1.2 or item 2, depending of which one could actually match. The extracted IRI is stripped from any end-of-line inside it.
Editor note: 1.2 should be revised to clarify how the no blankline requirement can be parsed.
Using the block element generator, attempt to create one element.
Using the block element generator, attempt to create one element, but change the hard-block-context-line-prefix rule so that it matches the empty string when applied to the first line.
Editor note: This may need some clarification.
A sequence of:
Using the current context-line-prefix stack of the block element generator, attempt to match each rule in the stack in sequence, starting from the first-inserted rule to the last one. If one match fail, matching this rule fails; otherwise, it matches.
If the stack is empty, always matches without consuming any character.
Using the current context-line-prefix stack of the block element generator, attempt to optionally match each rule in the stack in sequence, starting from the first-inserted rule and stopping at first one that doesn't match.
This rule never fail to match.
A Markdown Extra character stream is parsable using three generators, one for each of the three element categories of the document model. Parsing a document is done in three steps:
Editor note: I think the document element generator and the block element generator steps could be merged eventually.
The span element generator step could also be merged if we change parsing of certain span elements to not depend on previously-encountered document elements (link references, footnotes and abbreviation definitions). But doing this without causing breaking changes is perhaps not doable.
With this generator, the whole document is scanned for document elements. At the start of each line, the parser checks if the line matches one of the three following constructs. If it does, it creates the corresponding element, and attempts to match again starting on the first line not part of the previous match.
A sequence of:
Creates a footnote definition element with item 2 filling the reference name attribute and elements generated by parsing item 8 becoming the content.
A sequence of:
Creates an abbrevition definition element with item 2 filling the abbreviated word attribute and item 7 becoming the textual content.
A sequence of:
Creates a link reference element with item 2 filling the reference name attribute, the extracted IRI from item 7 filling the URL attribute, and item 9.2 filling the title attribute.
Editor note: Need rephrasing to exclude optional angle brakets around url and quotes or parenthesis around the title from being part of the actual content of attributes on the link reference element.
After the document element generator, the document source is given to the block element generator after being stripped of all lines which were part of a match that produced a document element in this generator.
The block element generator is run on the whole document after the document element generator. It is also invoked when block elements need to be parsed outside of the main context (such as when processing footnotes).
After each newline, the parser checks if the line matches one of the following constructs. If it does, it creates the corresponding element, and attempt to match again starting on the first line not part of the previous match.
The block element generator possess a context-line-prefix-stack containing a series of rules to be matched before stopping the generator and returning to the previous context. When the generator is told to ignore the context line prefix for the first line (see first-block-element, it means that while parsing the first line, the context-line-prefix-stack should be considered empty.
The block element generator is used as a parsing rule in the grammar of the document element generator and the block element generator. The block element generator matches if it one of the following rule matches and creates an element.
A sequence of:
Note: a code block does not need to end with a blank line: any non-blank line not stating with the proper indent ends the code block.
Editor note: Should we really allow soft block content line prefix here? Applying the lazy syntax of a parent container's on an indented code block seems rather silly and confusing. Here's an illustration:
> Code block, Line 1
Same code block, Line 2
Creates a code block element with the concatenation of all text lines in items 2.3, 2.4.3, and 2.4.4.2 as the content.
A sequence of:
Creates a code block element with the concatenation of all text lines in item 5 as the content.
A sequence of:
Creates a blockquote element with elements generated by parsing item 5 becoming the content.
A sequence of:
Creates a horizontal rule element.
A sequence of:
Creates a header element where the content is set to the result of applying the span element generator on item 3, the header level attribute is set to one if item 7.1 was matched or two if item 7.2 was matched, and the id attribute (extra) is set to item 4.1.
A sequence of:
Creates a header element where the content is set to the result of applying the span element generator on item 4, the header level attribute is set to the number of characters in item 2, and the id attribute (extra) is set to item 6.1.
[To be defined]
[To be defined]
[To be defined]
A sequence of:
Editor note: Need a way to stop a paragraph when seeing certain constructs which should start other block-level elements without having a blank line, such as lists (when inside another list item), blockquotes, and fenced code blocks.
Creates a [paragraph][#paragraph] element where the content is obtained by running the span element generator on the concatenation of all text lines from item 2.
[To be defined]
[To be added]