**Note:** should there be requirements about the output?
### 1.2 Structure of this specification {#structure-of-this-specification}
This specification comprise three main sections:
[Document model](#document-model)
: Describes the general structure of a Markdown document, with its various
syntax elements and their properties.
[Parsing](#document-parsing)
: Defines how a Markdown Extra parser should read a document and extract the
various syntax elements.
[Output](#output)
: Markdown Extra documents are usually meant to be converted to another
format. This non-normative section describe a typical serialization to
HTML/XHTML.
### 1.3 Conformance requirements
Parsers for the Markdown Extra syntax must parse documents as described in this
specification, or in a way that produce the same document model. Since this
specification has no requirement as to how the document model is represented
inside a program, implementors are also free to completely bypass the model
and generate the output directly as long as the output accurately represents
the model of the given document.
Markdown Extra doesn't define any conformance requirements for documents. Any
character input can form a Markdown Extra document and be sent to the parser,
which should accept to parse it until the end, and result in a Markdown Extra
document model.
2. Document model {#document-model}
---------------------
The Markdown Extra document model is a [tree structure][] where the root is
the document itself, and children of the root are various syntax elements. Most
syntax elements may contain other syntax element as their children. For
instance a list usually has many list item children; a paragraph may have many
text nodes, code span nodes, link nodes, etc. interleaved.
Here is a sample Markdown document:
[tree structure]: http://en.wikipedia.org/wiki/Tree_structure
Some text and [a link][1]
* List item 1
* List item 2
[1]: http://example.com "Example web site"
This document starts with a paragraph containing some text and a link with some
more text in it, followed by an HTML block containing a single `hr` HTML
element, followed by a list containing two items each having some text in them,
and ending with a link reference.
This document's tree could be illustrated like this:
### 2.1 The document root {#the-document-root}
Context in which this element may appear:
: At the root of the document model
Content model:
: Any number of document elements and block elements in any order
Special attributes:
: None
Each Markdown Extra document has one and only one document root containing
the whole content of the document.
### 2.2 Document elements {#document-elements}
#### 2.2.1 Link reference {#link-reference}
Context in which this element may appear:
: As a direct child of the document
Content model:
: None
Special attributes:
: Reference name
: URI
: Title (optional)
Link references do not appear in the final output, but allow reference links
and images span elements to be given attributes by referencing them from
elsewhere in the document.
A link reference is alone on a line. It begins with the reference name inside
square brackets, optionally followed by a space or a no-break-space, a colon,
a URI (either enclosed in angle brakets or not), and an optional title
enclosed in single or double quotes, or in parenthesis (which can be preceded
by a newline).
The reference name is matched case-insensitively.
#### 2.2.2 Abbreviation definition (extra) {#abbreviation-definition}
Context in which this element may appear:
: As a direct child of the document
Content model:
: One text node
Special attributes:
: Abbreviated word
Abbreviation definitions denote words with are an abbreviated from of another
word or group of word. The "abbreviated word" will be matched (case sensitively)
against the text in each text node and, if found, enclosed in an abbreviation
element.
An abbreviation definition starts with an asterisk, followed by the
abbreviated word inside square brakets, optionally followed by one space or
no-break-space, a colon and one or more words giving the full meaning of the
abbreviation.
#### 2.2.3 Footnote definition (extra) {#footnote-definition}
Context in which this element may appear:
: As a direct child of the document
Content model:
: One or more block elements
Special attributes:
: Reference name
Footnote definitions provide the content of a footnote to be used when a
footnote maker is found with a matching reference name.
A footnote definition starts with a footnote reference enclosed in square
brakets with a caret character (`^`) just after the opening braket, an
optional space or no-break-space character, and one or more block-level
elements each having their first line, and optionally other lines, indented
by one tab-length.
### 2.3 Block elements {#block-elements}
#### 2.3.1 Paragraph {#paragraph}
Context in which this element may appear:
: Wherever block elements are allowed
Content model:
: One or more span elements
Special attributes:
: None
A paragraph starts after a blank line and ends at the first blank line. Newline
characters in the paragraph are considered to be soft-wrapped, meaning that
they do not bear significance until they're actually [Hard line
break](#hard-line-break) elements.
#### 2.3.2 Blockquote {#blockquote}
Context in which this element may appear:
: Wherever block elements are allowed
Content model:
: One or more of block elements
Special attributes:
: None
A blockquote represents a quotation of a section of text from another source.
Blockquotes are created by prefixing paragraphs, or other block elements, with
a right-pointing angle bracket (`>`). You can nest blockquotes by adding more
than one level of right-pointing angle brackets.
Inside a blockquote, you can prefix every line with an angle bracket, or only
those lines starting a new block element. Contiguous block elements are
considered to be inside the same blockquote if they both share the same
number of starting bracket.
#### 2.3.3 Header {#header}
Context in which this element may appear:
: Wherever block elements are allowed
Content model:
: One or more of span elements
Special attributes:
: Level
: Id (extra)
Headers come in two forms:
[Description forthcoming]
#### 2.3.4 Code block {#code-block}
Context in which this element may appear:
: Wherever block elements are allowed
Content model:
: One text element
Special attributes:
: None
[Description forthcoming]
#### 2.3.5 List {#list}
Context in which this element may appear:
: Wherever block elements are allowed
Content model:
: One or more list item
Special attributes:
: None
Context in which this element may appear:
: Wherever block elements are allowed
Content model:
: One table row containing header table cells followed by zero or more
table rows containing regular table cells.
Special attributes:
: None
[Description forthcoming]
#### 2.3.8 Definition list (extra) {#definition-list}
Context in which this element may appear:
: Wherever block elements are allowed
Content model:
: One table row containing header table cells followed by zero or more
table rows containing regular table cells.
Special attributes:
: None
[Description forthcoming]
### 2.4 Span elements {#span-elements}
#### 2.4.1 Text {#text}
Context in which this element may appear:
: Wherever span elements are allowed
Content model:
: None (A text element doesn't contain other elements, although it contains
text as an attribute).
Special attributes:
: Text value
Context in which this element may appear:
: Wherever span elements are allowed and there is no emphasis element
as an ancestor.
Content model:
: One or more span elements, but no emphasis element.
Special attributes:
: None
Context in which this element may appear:
: Wherever span elements are allowed and there is no strong emphasis element
as an ancestor.
Content model:
: One or more span elements, but no strong emphasis element.
Special attributes:
: None
[Description forthcoming]
#### 2.4.4 Link {#link}
Context in which this element may appear:
: Wherever span elements are allowed and there is no link element
as an ancestor.
Content model:
: One or more span elements, but no strong emphasis element.
Special attributes:
: URI
: Title (optional)
Context in which this element may appear:
: Wherever span elements are allowed.
Content model:
: None.
Special attributes:
: Alternative text
: URI
: Title (optional)
[Description forthcoming]
#### 2.4.6 Hard line break {#hard-line-break}
Context in which this element may appear:
: Wherever span elements are allowed.
Content model:
: None.
Special attributes:
: None
Hard line breaks are represented in Markdown source by having two spaces
preceding a new line character. This can be useful if you need to force a line
break somewhere, as when writing an address:
Santa Claus
North Pole
Canada
H0H 0H0
#### 2.4.7 Character entity {#character-entity}
Context in which this element may appear:
: Wherever span elements are allowed.
Content model:
: None.
Special attributes:
: Represented character
Context in which this element may appear:
: Wherever span elements are allowed.
Content model:
: One text element.
Special attributes:
: Title
Abbreviation elements are found by scanning the content of text elements for
abbreviations defined in the document's [abbreviation definitions](#abbreviation-definitions).
#### 2.4.9 Footnote marker (extra) {#footnote-marker}
Context in which this element may appear:
: Wherever span elements are allowed.
Content model:
: None.
Special attributes:
: Footnote content
[Description forthcoming]
3. Parsing {#parsing}
----------
This section explains how to build the document model from a character stream
following the Markdown Extra syntax.
### 3.1 Common constructs {#common-constructs}
The following are definitions for basic syntax concepts which are reused
at many places in the parsing section.
space
: One of:
* U+0009 Tabulation
* U+0020 Space
**Editor note:** Should we extend this to include other unicode spaces
as well? Candidates include:
* U+2000 En Quad
* U+2001 Em Quad
* U+2002 En Space
* U+2003 Em Space
* U+2004 Three-Per-Em Space
* U+2005 Four-Per-Em Space
* U+2006 Six-Per-Em Space
* U+2007 Figure Space
* U+2007 Punctuation Space
* U+2007 Thin Space
* U+2007 Hair Space
* U+2007 Medium Mathematical Space
* U+3000 Ideographic Space
non-space
: Any character not matched by [space](#space)
end-of-line
: First match of:
1. U+000D Cariage Return (CR) followed by U+000A Line Feed (LF)
2. U+000D Cariage Return (CR)
3. U+000A Line Feed (LF)
4. End of file
The end-of-line construct is part of the line it ends and must not be
counted as matching the line following it.
**Editor note:** Perhaps we should follow the lead of XML 1.1 and add
the following to our list:
* U+000D Cariage Return (CR) followed by U+0085 Next Line (NEL)
* U+0085 Next Line (NEL)
* U+2028 Line Separator
indent
: Any of:
* One U+0009 Tabulation
* Four U+0020 Space
**Editor note:** This should be updated when/if the [space](#space)
construct is updated to add more characters
insignificant-indent
: Any of:
* One, two, or three U+0020 Space
**Editor note:** This should be updated when/if the [space](#space)
construct is updated to add more characters
textrun
: A run of one or more characters having at least one [non-space](#non-space)
character, and excluding any [end-of-line](#end-of-line).
blankline
: A sequence of:
1. Zero or more [space](#space)
2. One [end-of-line](#end-of-line)
textline
: A sequence of:
1. One [textrun](#textrun)
2. One [end-of-line](#end-of-line)
refname
: A run of one or more characters, excluding any
[end-of-line](#end-of-line) and U+005D Closing Square Bracket.
**Editor note:** Should we allow closing square brakets inside when
they're correctly balanced with opening ones?
identifier
: A run of one or more characters, excluding any
[end-of-line](#end-of-line) and U+007D Right Curly Bracket.
**Editor note:** Should we allow closing square brakets inside when
they're correctly balanced with opening ones?
quoted-textrun
: [To be defined]
singlequoted-textrun
: [To be defined]
parenthesed-textrun
: [To be defined]
url
: First of:
1. A sequence of:
1. "<"
2. Zero or more characters, but no ">" and no [blankline](#blankline)
3. ">"
2. One or more [non-space](#non-space) character
The extracted IRI is created from item 1.2 or item 2, depending of which
one could actually match. The extracted IRI is stripped from any
[end-of-line](#end-of-line) inside it.
**Editor note:** 1.2 should be revised to clarify how the no blankline
requirement can be parsed.
block-element
: Using the [block element generator](#block-element-generator), attempt to
create one element.
first-block-element
: Using the [block element generator](#block-element-generator), attempt to
create one element, but change the hard-block-context-line-prefix rule
so that it matches the empty string when applied to the first line.
**Editor note:** This may need some clarification.
block-element-run
: A sequence of:
1. One [first-block-element](#block-element)
2. Zero or more [block-element](#block-element)
hard-block-context-line-prefix
: Using the current context-line-prefix stack of the block element
generator, attempt to match each rule in the stack in sequence, starting
from the first-inserted rule to the last one. If one match fail, matching
this rule fails; otherwise, it matches.
If the stack is empty, always matches without consuming any character.
soft-block-context-line-prefix
: Using the current context-line-prefix stack of the block element
generator, attempt to optionally match each rule in the stack in sequence,
starting from the first-inserted rule and stopping at first one that
doesn't match.
This rule never fail to match.
### 3.2 Parsing a document {#parsing-a-document}
A Markdown Extra character stream is parsable using three generators, one for
each of the three element categories of the document model. Parsing a document
is done in three steps:
1. Running the [document element generator](#document-element-generator)
on the whole document.
2. Running the [block element generator](#block-element-generator) on the
document ignoring the lines used to create document elements.
3. Running the [span element generator](#span-element-generator) on all
elements with text content flagged as needing span-level processing.
**Editor note:** I think the document element generator and the block element
generator steps could be merged eventually.
The span element generator step could also be merged if we change parsing of
certain span elements to not depend on previously-encountered document
elements (link references, footnotes and abbreviation definitions). But doing
this without causing breaking changes is perhaps not doable.
#### 3.2.1 Document element generator {#document-element-generator}
With this generator, the whole document is scanned for
[document elements](#document-elements). At the start of each line, the parser
checks if the line matches one of the three following constructs. If it does,
it creates the corresponding element, and attempts to match again starting on
the first line not part of the previous match.
Footnote definition (extra)
: A sequence of:
1. "[^"
2. One [refname](#refname)
3. "]"
4. One optional [space](#space)
5. ":"
6. Zero or more [space](#space)
7. One optional [newline](#newline)
8. One [block-element-run](#block-element-run) by pushing the
following sequence to the context-line-prefix stack:
1. One [indent](#indent)
Creates a [footnote definition](#footnote-definition) element with
item 2 filling the reference name attribute and elements generated by
parsing item 8 becoming the content.
Abbreviation definition (extra)
: A sequence of:
1. "*["
2. One [refname](#refname)
3. "]"
4. One optional [space](#space)
5. ":"
6. Zero or more [space](#space)
7. One [textrun](#textrun)
Creates an [abbrevition definition](#abbreviation-definition) element
with item 2 filling the abbreviated word attribute and item 7 becoming
the textual content.
Link reference
: A sequence of:
1. "["
2. One [refname](#refname)
3. "]"
4. One optional [space](#space)
5. ":"
6. Zero or more [space](#space)
7. One [url](#url)
8. Zero or more [space](#space)
9. The optional sequence:
1. One of:
* One [newline](#newline)
* One or more [space](#space)
2. One of:
* One [quoted-textrun](#quoted-textrun)
* One [singlequoted-textrun](#singlequoted-textrun)
* One [parenthesed-textrun](#parenthesed-textrun)
Creates a [link reference](#link-reference) element with item 2 filling
the reference name attribute, the extracted IRI from item 7 filling the
URL attribute, and item 9.2 filling the title attribute.
**Editor note:** Need rephrasing to exclude optional angle brakets around
url and quotes or parenthesis around the title from being part of the
actual content of attributes on the link reference element.
After the document element generator, the document source is given to the
[block element generator](#block-element-generator) after being stripped of
all lines which were part of a match that produced a document element in
this generator.
#### 3.2.2 Block element generator {#block-element-generator}
The block element generator is run on the whole document after the
[document element generator](#document-element-generator). It is also
invoked when block elements need to be parsed outside of the main context
(such as when processing footnotes).
After each newline, the parser checks if the line matches one of the
following constructs. If it does, it creates the corresponding element, and
attempt to match again starting on the first line not part of the previous
match.
The block element generator possess a
context-line-prefix-stack containing
a series of rules to be matched before stopping the generator and returning to
the previous context. When the generator is told to ignore the context line
prefix for the first line (see [first-block-element](#first-block-element),
it means that while parsing the first line, the
context-line-prefix-stack should be considered empty.
The block element generator is used as a parsing rule in the grammar of
the document element generator and the block element generator. The block
element generator matches if it one of the following rule matches and creates
an element.
Code block
: A sequence of:
1. Zero or more [blankline](#blankline)
2. One or more sequence:
1. One [hard-block-context-line-prefix](#hard-block-context-line-prefix)
2. One [indent](#indent)
3. One [textline](#textline)
4. Zero or more sequence:
1. One [soft-block-context-line-prefix](#hard-block-context-line-prefix)
2. [blankline](#blankline)
5. Zero or more sequence:
1. One [soft-block-context-line-prefix](#hard-block-context-line-prefix)
2. One [indent](#indent)
3. One [textline](#textline)
4. Zero or more sequence:
1. One [soft-block-context-line-prefix](#hard-block-context-line-prefix)
2. [blankline](#blankline)
**Note:** a code block does not need to end with a blank line: any
non-blank line not stating with the proper indent ends the code block.
**Editor note:** Should we really allow soft block content line prefix
here? Applying the lazy syntax of a parent container's on an indented code
block seems rather silly and confusing. Here's an illustration:
~~~
> Code block, Line 1
Same code block, Line 2
~~~
Creates a [code block](#code-block) element with the concatenation of all
text lines in items 2.3, 2.4.3, and 2.4.4.2 as the content.
Fenced code block (extra)
: A sequence of:
1. Zero or more [blankline](#blankline)
2. One [hard-block-context-line-prefix](#hard-block-context-line-prefix)
3. Three or more "~"
4. Zero or more [space](#space)
5. One [end-of-line](#end-of-line)
6. One [hard-block-context-line-prefix](#hard-block-context-line-prefix)
7. Zero or more of the following sequence, stopping at the first line
capable of satisfying the remaining parts of the enclosing sequence:
1. One [hard-block-context-line-prefix](#hard-block-context-line-prefix)
2. One [textline](#textline)
8. One [hard-block-context-line-prefix](#hard-block-context-line-prefix)
9. Same number of "~" as found in item 2
10. Zero or more [space](#space)
11. One [end-of-line](#end-of-line)
Creates a [code block](#code-block) element with the concatenation of all
text lines in item 5 as the content.
Blockquote
: A sequence of:
1. Zero or more [blankline](#blankline)
2. Zero or one [insignificant-indent](#insignificant-indent)
3. ">"
4. Zero or one [space](#space)
5. One [block-element-run](#block-element-run) by pushing the following
sequence to the context-line-prefix stack:
1. Zero or one [insignificant-indent](#insignificant-indent)
2. ">"
3. Zero or one [space](#space)
Creates a [blockquote](#blockquote) element with elements generated by
parsing item 5 becoming the content.
Horizontal Rule
: A sequence of:
1. Zero or more [blankline](#blankline)
2. One [hard-block-context-line-prefix](#hard-block-context-line-prefix)
3. One of "_", "-", "*".
4. Two or more sequences of:
1. Zero, one, or two [space](#space)
2. One character identical the one found in item 3 above.
5. Zero or more [space](#space)
6. One [end-of-line)(#end-of-line)
Creates a [horizontal rule](#horizontal-rule) element.
Header, Setext-style
: A sequence of:
1. Zero or more [blankline](#blankline)
2. One [hard-block-context-line-prefix](#hard-block-context-line-prefix)
3. One [textrun](#textrun)
4. (Extra) Zero or one:
1. One "{#"
2. One or more [identifier](#identifier)
3. One "}"
4. Zero or more [space](#space)
5. One [end-of-line](#end-of-line)
6. One [hard-block-context-line-prefix](#hard-block-context-line-prefix)
7. One of:
1. One or more "="
2. One or more "-"
8. Zero or more [space](#space)
9. One [end-of-line](#end-of-line)
Creates a [header](#header) element where the content is set to the result
of applying the span element generator on item 3, the header level
attribute is set to one if item 7.1 was matched or two if item 7.2 was
matched, and the id attribute (extra) is set to item 4.1.
Header, Atx-style
: A sequence of:
1. Zero or more [blankline](#blankline)
2. One [hard-block-context-line-prefix](#hard-block-context-line-prefix)
3. One or more "#"
4. One [textrun](#textrun)
5. Zero or more "#"
6. (Extra) Zero or one:
1. One "{#"
2. One or more [identifier](#identifier)
3. One "}"
4. Zero or more [space](#space)
7. One [end-of-line](#end-of-line)
Creates a [header](#header) element where the content is set to the result
of applying the span element generator on item 4, the header level
attribute is set to the number of characters in item 2, and the id
attribute (extra) is set to item 6.1.
List
: [To be defined]
Definition List (extra)
: [To be defined]
Table (extra)
: [To be defined]
Paragraph
: A sequence of:
1. Zero or more [blankline](#blankline)
2. One [hard-block-context-line-prefix](#hard-block-context-line-prefix)
3. One [textline](#textline)
4. Zero or more sequences of:
1. One [soft-block-context-line-prefix](#soft-block-context-line-prefix)
2. One [textline](#textline)
3. One [blankline]
**Editor note:** Need a way to stop a paragraph when seeing certain
constructs which should start other block-level elements without having
a blank line, such as lists (when inside another list item), blockquotes,
and fenced code blocks.
Creates a [paragraph][#paragraph] element where the content is obtained by
running the span element generator on the concatenation of all text lines
from item 2.
#### 3.2.3 Span element generator {#span-element-generator}
[To be defined]
4. Output {#output}
--------------
### 4.1 HTML Serialization {#html-serialization}
[To be added]
*[IRI]: Internationalized Resource Identifier