scd2html - generate HTML from scdoc source files

As hinted at in the last status update, I felt compelled to create scd2html. There are obviously other options for creating HTML versions of man pages, so I figured it would be worth writing down how I ended up here. There are a few things to be learned along the way, about roff, about scdoc, and - in a plot twist that I bet you did not see coming - about web assembly.

The problem space

The problem to be solved can be stated as follows. Find tools that allow me to write man pages for my projects with the following constraints:

  • Have a single source file in an easy syntax
  • Generate a decent roff output to be viewed with man
  • Generate a decent HTML output to be viewed in browsers

Note that these constraints are not well-defined: words like easy and decent leave plenty of room for interpretation. But that’s what I started with.

There are two obvious different approaches here. Either use a source format that can be converted to both roff and HTML, or convert to roff first and convert the result to HTML.

The latter approach is often used, but has issues. They stem from the fact that roff is a presentation layer, and the HTML looses most of the document structure. Take for example lists: in roff, list items are essentially indented text preceded by a bullet point that’s slightly less indented. That is why mandoc, one of the most popular converters from roff to HTML, does not generate any actual HTML lists (you can inspect for example the source of the lists section in scdoc(5)). Not only is that an accessibility issue, it even breaks the presentation if a list item spans multiple lines (you can see it by looking at the same section while making the window very narrow).

At some point I was almost willing to accept this, as I really liked the HTML that pandoc generates from roff. However, I started looking into using Pandoc’s autolink_bare_uris extension to generate clickable links in the HTML output. Unfortunately, I discovered that they take a very “liberal” approach to parsing email addresses. That approach essentially rules out all lists.sr.ht mailing lists, so I decided that I did not want to work with Pandoc any more.

So I would prefer to generate the HTML from something that is not roff. I have tried AsciiDoc, but it does a bit too much. I don’t really like neither the syntax nor the default HTML output (which e.g. insists on including JS). Apparently AsciiDoc even added some Markdown compatibility now, but in my opinion that just makes it even more confusing.

In praise of: scdoc

I had already started using scdoc for some projects. In my opinion, it’s amazing. Both the language and the tool are laser-focused on a single task: provide a simple syntax and generate man pages from it. The syntax takes cues from Markdown, making it easy to remember the basics. Where it deviates (e.g. tables), it can be explained in a few paragraphs.

Writing something that generates HTML from scdoc files was an obvious candidate. The only “problem” was that the code has the same laser-focus on its single job. I recommend that you read it, really. Even though some of the syntax constructs are not trivial to parse, the code looks very simple. When trying to change anything, though, you realize the complex machinery that it adds up to. Nothing can just be removed, and adding something is more difficult than it initially looks.

Enter scd2html

As you already know, I ended up doing it anyways. That it worked is mostly owed to the clarity of the original code, my acceptance of the fact that the new code is much uglier, and to some extend a compromise on features to not mess it up even more.

The features that scd2html brings are:

  • scdoc as input format - allows me to stick to scdoc (see above)
  • Generate terse HTML - the output uses semantically meaningful tags (<header>, <section>, lists), and very little inline styling. Even unstyled, the output represents a reasonable result, except for tables.
  • Automatic links - a re2c-based detection of links and emails was added, automatically generating proper hyperlinks.
  • Section anchors - you can link to individual sections of a man page. A feature quite necessary for larger man pages.

The styling is “take it or leave it”. A fragment can be generated by passing the -f flag, which can then be embedded into an HTML paged with custom styling. Examples of the built-in style (inspired by Pandoc) can be seen here: scdoc(1), scdoc(5), vsync(1).

And now for something completely different

After building this, I thought: wouldn’t it be nice if folks could give this a try on their own files before going forth and compiling this? And I took this as an excuse to play with something I had long since been wanting to play with: emscripten.

I am not exactly a fan of JavaScript or WebAssembly, but the idea of running C code in the browser was interesting enough that I wanted to give it a try to understand how it works. Turns out, it’s fairly straight-forward. I added a few commits to a wasm branch if you are interested. You can find the result here: bitfehler.net/scd2html. You can simply open a scdoc file from your computer and it will display the HTML generated by scd2html.

I don’t think this a suitable means of distributing software, but it certainly does make for an interesting toy…

That’s it. As always, I’d love to see your feedback in my public inbox or find me in the #sr.ht.watercooler IRC channel!