home
subscribe → rss | email

A parser combinator library in Typescript

Try it on GitHub: https://github.com/coffeemug/ts-parsec

I write a lot of throwaway interpreters to play with programming language design ideas. For these projects, writing a parser is usually the most frustrating part. Parser libraries are hard to learn, easy to forget, and finicky to use. The other option, hand-coding a custom parser for each interpreter, raises the activation energy to start a project high enough that I abandon too many ideas before I try them.

All of this is unsatisfactory. To solve this problem I wrote a parser combinator library for myself in Typescript. It has these design goals:

Produces recursive descent parsers capable of parsing PEG grammars.
For throwaway projects only. Will never grow big, have complex optimizations, or other fancy features.
Small, so I can understand every detail. The library is under 500 lines of code and took maybe a couple of days to write.
Type safe. The syntax tree types are inferred from the combinators. It’s beautiful and really fun to use.

Example

Here is a simple example:

const digit = range('0', '9');
const lower = range('a', 'z');
const upper = range('A', 'Z');
const alpha = either(lower, upper);
const alnum = either(alpha, digit);

const ident = seq(alpha, many(alnum)).map(([first, rest]) =>
  [first, ...rest].join(""));

You can see how these parsers build on top of each other. I added a map method to support transforming the concrete syntax tree into an AST on the spot. Here seq(alpha, many(alnum)) return a tuple with an alphabetic character and an array of alphanumeric characters. But we don’t want to deal with that when handling identifiers– we just want to deal with a string. I can do that with a simple map.

Parsers operate on a special stream type that’s mostly irrelevant to the end user. To parse an identifier you’d do this:

const input = "Hello";
const stream = fromString(input);
ident(stream);

Actually, I lied a little. By default the stream automagically skips whitespace. That’s the desired behavior for most higher-order parsers, but when parsing keywords, identifiers, numbers, etc. we want to turn that behavior off. So in practice ident would be defined like this:

// `lex` turns off skipping whitespace for the parser it's wrapping
const ident = lex(seq(alpha, many(alnum))).map(([first, rest]) =>
  [first, ...rest].join(""));

All the usual suspects like seq, either, maybe, some, many, and sepBy are implemented in the library. This turns out to be enough to write parsers for most grammars I may ever want to parse.¹

Calculator

One limitation of recursive descent parsers is that they fall into an infinite loop on left recursion. You can manually rewrite your parser to avoid left recursion, but it’s a pain. This is relevant for toy interpreters because left recursion is the most natural way to express grammars for basic arithmetic. To avoid having to deal with this problem I added a helper parser binop for parsing binary operators. Using binop a calculator grammar looks like this:

const factor = binop(either(str('*'), str('/')), int,
  (op, l, r) => [op, l, r]);

const term = binop(either(str('+'), str('-')), factor,
  (op, l, r) => [op, l, r]);

const input = "1 + 2 * 3";
const stream = fromString(input);

// produces `['+', 1, ['*', 2, 3]]`
term(stream);

(The version binop is left-associative. There is a right-associative version binopr for binary operators like assignments.)

Practical problems

There are two painful limitations of this library, one fixable (but not yet fixed), the other inherent to its design.

The fixable problem is that the library produces no error messages whatsoever. It’s structurally set up to handle errors, but I haven’t implemented error reporting yet. So if something goes wrong during parsing, there is no useful information at all. I have some ideas for how to make error reporting easy and really good, but haven’t gotten around to working on this.

The more serious problem is that type safe parser combinators seem like an elegant, obviously good idea, but they turn out to kind of suck in practice. Maybe I’m not smart enough, or maybe I’m too lazy to properly understand the ins and outs of the Typescript type system, or maybe I just need to work a little harder to mature the library. But whatever the reason, every time I do semi-advanced type hackery like this, I end up spending more time dealing with weird type errors than actually working on my grammar. It’s all right in this narrow case because it’s fun, I know the ins and outs of the library, and it’s meant for throwaway/toy interpreters. But for a serious project I’d use an ugly, boring, old-school parser generator.

Or, more likely, bite the bullet and code the parser by hand.

There are many grammars this doesn’t parse, but see the design goals. It’s not meant to! Actually, I would argue that if your programming language can’t be parsed with a PEG parser, it’s hard for humans to parse too.↩︎

Oct 12, 2024