Markdown Parser Implemented using the 'MD4C' Library.
md4r
Provides an R wrapper for the MD4C (Markdown for C) library.Functions exist for markdown parsing (CommonMark compliant) along with support for other common markdown extensions (e.g. GitHub flavored markdown, LaTeX equation support, etc.). The package also provides a number of high level functions for exploring and manipulating markdown ASTs as well as translating and displaying the documents.
Installation
Install md4r from CRAN:
install.packages("md4r")
or install the latest development version package from GitHub:
remotes::install_github("rundel/md4r")
Example
We will start with a simple example of parsing a markdown file using the basic CommonMark dialect.
md_file = system.file("examples/commonmark.md", package = "md4r")
readLines(md_file) |> cat(sep='\n')
#> ## Try CommonMark
#>
#> You can try CommonMark here. This dingus is powered by
#> [commonmark.js](https://github.com/commonmark/commonmark.js), the
#> JavaScript reference implementation.
#>
#> 1. item one
#> 2. item two
#> - sublist
#> - sublist
this file (or markdown text) can be processed using the parse_md
function which creates an abstract syntax tree representation of the document (as a list of lists of lists … with custom S3 classes)
library(md4r)
(md = parse_md(md_file))
#> md_block_doc [flags: "MD_DIALECT_COMMONMARK"]
#> ├── md_block_h [level: 2]
#> │ └── md_text_normal - "Try CommonMark"
#> ├── md_block_p
#> │ ├── md_text_normal - "You can try CommonMark here. This dingus is powered by"
#> │ ├── md_text_softbreak
#> │ ├── md_span_a [title: "", href: "https://github.com/commonmark/commonmark.js"]
#> │ │ └── md_text_normal - "commonmark.js"
#> │ ├── md_text_normal - ", the"
#> │ ├── md_text_softbreak
#> │ └── md_text_normal - "JavaScript reference implementation."
#> └── md_block_ol [start: 1, tight: 1, mark_delimiter: "."]
#> ├── md_block_li
#> │ └── md_text_normal - "item one"
#> └── md_block_li
#> ├── md_text_normal - "item two"
#> └── md_block_ul [tight: 1, mark: "-"]
#> ├── md_block_li
#> │ └── md_text_normal - "sublist"
#> └── md_block_li
#> └── md_text_normal - "sublist"
str(md)
#> List of 3
#> $ :List of 1
#> ..$ : 'md_text_normal' chr "Try CommonMark"
#> ..- attr(*, "level")= num 2
#> ..- attr(*, "class")= chr [1:3] "md_block_h" "md_block" "md_node"
#> $ :List of 6
#> ..$ : 'md_text_normal' chr "You can try CommonMark here. This dingus is powered by"
#> ..$ : list()
#> .. ..- attr(*, "class")= chr [1:3] "md_text_softbreak" "md_text" "md_node"
#> ..$ :List of 1
#> .. ..$ : 'md_text_normal' chr "commonmark.js"
#> .. ..- attr(*, "title")= chr ""
#> .. ..- attr(*, "href")= chr "https://github.com/commonmark/commonmark.js"
#> .. ..- attr(*, "class")= chr [1:3] "md_span_a" "md_span" "md_node"
#> ..$ : 'md_text_normal' chr ", the"
#> ..$ : list()
#> .. ..- attr(*, "class")= chr [1:3] "md_text_softbreak" "md_text" "md_node"
#> ..$ : 'md_text_normal' chr "JavaScript reference implementation."
#> ..- attr(*, "class")= chr [1:3] "md_block_p" "md_block" "md_node"
#> $ :List of 2
...
As the AST is just a collection of R lists - we can use subsetting to extract specific elements of the document
parse_md(md_file)[[1]]
#> md_block_h [level: 2]
#> └── md_text_normal - "Try CommonMark"
parse_md(md_file)[[2]]
#> md_block_p
#> ├── md_text_normal - "You can try CommonMark here. This dingus is powered by"
#> ├── md_text_softbreak
#> ├── md_span_a [title: "", href: "https://github.com/commonmark/commonmark.js"]
#> │ └── md_text_normal - "commonmark.js"
#> ├── md_text_normal - ", the"
#> ├── md_text_softbreak
#> └── md_text_normal - "JavaScript reference implementation."
parse_md(md_file)[[3]]
#> md_block_ol [start: 1, tight: 1, mark_delimiter: "."]
#> ├── md_block_li
#> │ └── md_text_normal - "item one"
#> └── md_block_li
#> ├── md_text_normal - "item two"
#> └── md_block_ul [tight: 1, mark: "-"]
#> ├── md_block_li
#> │ └── md_text_normal - "sublist"
#> └── md_block_li
#> └── md_text_normal - "sublist"
or more advanced tools like rapply()
to extract text content
rapply(md, as.character, "md_text")
#> [1] "Try CommonMark"
#> [2] "You can try CommonMark here. This dingus is powered by"
#> [3] "commonmark.js"
#> [4] ", the"
#> [5] "JavaScript reference implementation."
#> [6] "item one"
#> [7] "item two"
#> [8] "sublist"
#> [9] "sublist"
Additionally, the AST and any component can be converted back into markdown
to_md(md) |> cat(sep='\n')
#> ## Try CommonMark
#> You can try CommonMark here. This dingus is powered by
#> [commonmark.js](<https://github.com/commonmark/commonmark.js>), the
#> JavaScript reference implementation.
#>
#> 1. item one
#> 2. item two
#> - sublist
#> - sublist
or into html
to_html(md) |> cat(sep='\n')
Try CommonMark
You can try CommonMark here. This dingus is powered by commonmark.js , the JavaScript reference implementation.
- item one
- item two
- sublist
- sublist