Book Reviews

The Definitive ANTLR 4 Reference by Terence Parr

ISBN-13: 978-1934356999
Publisher: Pragmatic Bookshelf
Pages: 328

I've been a professional programmer for 20 years now. During that time a lot of technologies have passed by, often leaving few reasons to mourn their demise (yes, COM, I'm looking at you). The software industry is a tumultuous place to be; New languages and frameworks pop-up all the time and it takes effort just to keep up. At the same time, few technologies are genuinely new. And some skills are as useful today as they were two decades ago when I started. Parsing text falls into this category. In fact, I'll claim that parsing is one of the most neglected skills of many programmers. Knowing when and how to write a parser is likely to save you a lot of time, bugs and headaches. Trust me - I'd be a rich man if I had a penny for every block of code with nested loops, conditionals and primitive string operations that I've replaced with a simple parser invocation. Unfortunately I've never been compensated that way, so let's leave that here and move on to discuss ANTLR instead.

The Road Towards ANTLR

There are several parser generators to chose from. The most well-known is probably Lex/Yacc and its free alternative Flex/Bison. That's where I once started. My first parser used Flex/Bison to interpret a test description and feed the results to an application that generated test data. It was surprisingly straightforward although with a bit too much boilerplate code for my taste.

However, for simple tasks it feels like overkill to pull in yet another tool with additional build steps, etc. That's why I started to experiment with embedded parser libraries like Boost Spirit for C++ and Instaparse for Clojure. An embedded parser library has the advantage of being just another dependency; No extra tooling required. It's a strategy I still use. When regular expressions grow too complex I replace them with an embedded parser. It's a low-ceremony refactoring that often simplifies the code a lot. Unfortunately I've never seen an embedded parser library scale beyond simple cases like this. Boost Spirit, for example, doesn't scale in cognitive terms, kills system builds and it's horrible to debug the grammar. Instaparse is more pleasant but too memory hungry for most of my applications.

Over the past year I had to reconsider my parse strategies. The changing point was that we developed X-Ray for the CodeScene software analysis tool. The X-Ray feature analyses trends and patterns in the evolution of individual source code files. To pull that off, we need to parse source code written in a variety of programming languages. This is where we turned to ANTLR to optimize both run-time performance and, even more important, the implementation costs to support X-Ray for new languages.

The ANTLR Advantage

The ANTLR tool takes a language grammar as input and generates a parser in Java. So far nothing spectacular. However, ANTLR simplifies a lot of steps that makes writing parsers so much easier. One example is that ANTLR lets you use the same syntax for lexing and parsing. The only differentiator, on a syntactic level, is a naming convention where rules that start with a capital letter are lexer rules. Everything else is parse rules. It took me some time to get used to, but now I just love that idea.

Another advantage is that ANTLR makes it easy to separate actions on the parse tree from the grammar. Instead of embedding actions in code inside the grammar, ANTLR favors listeners and visitors instead. The result is a clear separation between the actual grammar and the code that operates on the parse results. The most pleasant surprise with ANTLR, however, was how easy it is to debug a grammar. ANTLR comes with a small set of tools that let you inspect and profile the parse tree visually. Writing parsers has never been this easy.

Reading The Definitive ANTLR 4 Reference

Once I adapted to the syntax I found it easy to get started with ANTLR. The online documentation is good and there are plenty of examples to emulate and learn from. But of course, after a while a deeper understanding becomes necessary. In my case I wanted to tailor the error handling of the parser. That's why I bought this book.

Terence Parr, the author of ANTLR, has made a wonderful job with The Definitive ANTLR 4 Reference. The book starts out with an introduction to the domain of parser generators and grammars. Terence introduces all terms you'll need to know, like context-free grammars, recursive-descent parsers and so on. Armed with that knowledge we start to build some simple applications like a calculator. We also get an introduction to the Visitors and Listeners that ANTLR generates for us (for example, you traverse a parse tree with Visitors and react to events, like errors, by attaching one or more Listeners).

I enjoyed the initial chapters as a refresher. It's a solid introduction that I wish I'd read years ago. After that the book picks-up the pace as we learn to parse more complex input like a subset of the R language. Along the way Terence shows us how to structure our application and how the different ANTLR concepts come together. To me, the most useful parts in the book are the advanced topics like error recovery, which is covered in a clear and accessible way. If you have any interest in parsers, and you should have, then this is great book that I highly recommend.

Reviewed December 2016