Why does Rescript syntax not use a parser combinator approach?

loganmac · April 24, 2021, 1:54am

Hi there! I asked this question a long time ago on the discord and it was mentioned that there is a blog post the team was wanting to write about this. As someone who is very interested in parsers and dev tooling, I’m curious why Rescript chose the approach it did for the syntax parser, as opposed to parser combinators? (something like https://github.com/inhabitedtype/angstrom)

cristianoc · April 24, 2021, 6:41am

Even before the post, there’s a little bit here: https://github.com/rescript-lang/syntax#why

jlouis · April 26, 2021, 11:50am

I can only answer for myself:

Parser tooling such as PEGs, Combinators, LALR(1), LR(1), GLR, etc exists as tools to make it easy to define and get the ball rolling. I don’t share Jonathan Blow’s opinion, as they are often simpler to write, and far easier to maintain when you have large changes being done to the language you are looking at.

However, they often fall very short as soon as you need to do error reporting. The most important programmer is often the newcomer to a language. They don’t know the grammar or syntax rules, so they are likely to make many mistakes, and they are also likely to have a harder time understanding the error messages. This by itself should be enough that you start being more serious about parsing, if the tooling doesn’t support it.

The other major reason is that in many cases, you want to parse a superset of the accepted language. If you look at programming as a dialogue between a compiler and a programmer, you want early feedback in the editor of your choice. This argues one should be able to parse programs with obvious syntax errors so one can report more problems in the code base faster.

Once a language start being more set in stone, the changes tend to be smaller in nature, so you can hand-roll the lexer/parser phase without too much worry. And you can start working on the ergonomics of the parser errors.

As for parser combinators, remember that they always provide support for a certain language class. “Classic” combinator parsers target LL(1), and this class is only used by a handful of languages. If you are parsing e.g., Pascal, Oberon, or GraphQL, you are in luck because those languages either directly fit into the LL(1) category, or almost does so. Most “modern” languages are typically LALR(1) or LR(1), at the very least, so you need special support in your parser combinators for these.

chenglou · April 27, 2021, 4:40pm

Yeah we’re gonna write up a blog post about this soon, since now the parser’s been tested in the real world. Your point about error messages and language complexity is on point. Those 2 constitute our major motivation. The ROI also goes down very quickly the moment you start to try to extend some parser tool; might as well roll your own. Like jblow said, a parser is just another program. Our entire community can do well with more of such mentality.

Note that all of the mainstream Turing-complete languages we’ve looked into use a hand-rolled parser instead of parser generators/combinators. From our prior experiences, the latters indeed fall short. Somewhat interesting to play around with and to think about, but again, see all the real-world examples.

Not to mention that our current parser has no dependency. One little self-contained repo we can carry everywhere. Nothing to fork, nothing to extend. Another general software philosophy we’ve tried to push in the community.

We also get to have some important control over the performance and memory characteristics with a hand-rolled one. This has been proven to be critical. More on this in the upcoming blog post.

loganmac · November 8, 2021, 6:58pm

Thanks @chenglou. Circling back to this post after about 6 months of parser development and testing just to reference some of the great points that you made here. I’ve come around to agree with all of the points you made in this post. While things like parser combinators are indeed neat for the amount of work they can do with a terse amount of code, they definitely leave a lot to be desired in terms of error recovery. They also don’t really allow for tooling to handle cases like suggestions based on the (currently incorrect) program that is being typed by the programmer.

I’d still be very interested to hear more from the team about some of the wins and learnings (and any shortcomings) from the current parser, as it does seem that Rescript is near the leading edge in a lot of ways here.