Markdown parser

2023-09-16 - 5 minute read

Part 1

While traveling today i thought about ways to structure my markdown parser.

What I want is a function that takes a string containing markdown formatted text and outputs a string of html elements.

My plan is to solve this using a markdown line feed and symbol classes representing the different possible elements.These symbols are for example headings, list items, code snippets, paragraphs and links. Each symbol will know how to render itself as well as being able to tell if it's the right symbol for the current feed line.

Lets say we have a markdown file that looks something like this:


# Release notes

Release 1.0.0 contains the following changes:
- We're now able to properly render markdown as html.

The line feed consists out of the lines above and can be read one line at a time.We're ignoring the complexity of recursivly parsing symbols due to nesting for now.We know that the parsing is complete when the line feed is empty.

We take the first line # Release notes.We iterate through our list of symbols asking them if they're interested in this line.We arrive at the heading symbol, which identifies the "#" and says yes, we stop the iteration.

We pass the first line to the symbol, alongside the line feed.The symbol can now decide if it's happy with the one line it claimed or if it wants to continue reading from the line feed.Symbols can peek at the line feed, the line is not removed unless it is claimed by the symbol.In this case the heading is a single line symbol, so the symbol strips the line that it claimed from the "#" heading syntax and saves the rest to render as a heading later.The static symbol function now returns an instance of a header symbol which we can save to a list for later rendering.

Calling render on a symbol should result in a html string composed of itself plus any child symbols recursivly.

Part 2

The markdown parser is kind of going according to plan. I realised that with the appoach I decided to take I'm going to need a couple of more steps than just going line by line and deciding what type of element each should be.

Let's take an example:


Some **bold** text and then some _cursive_ text

This is not only a text line but it has both bold and cursive text.

So in addition to parsing the markdown files line by line which i will refer to as stage 1, I implemented stage 2 processing with the purpose of expanding the identified elements into child elements. So when a text row element containing the text in the example above is requested to perform its stage 2 processing it will take its text and run it once more through the stage 1 processing to divide it into new symbols. It then calls stage 2 processing of all its new children to make sure every element has been processed.

In the end we should have gone from TextLine to PlainText, Bold, PlainText , Italic, Plaintext.

Here's the result.

Some bold text and then some cursive text

Next up is stage 3 processing which will affect elements depending on their position in the list of elements.