Building an interpreter from scratch
Creating a programming language has always carried this uber-leet status to me, and now I finally get to explore it.
I came across Crafting Interpreters by Robert Nystrom and was immediately interested. The idea of creating my own programming language has been one of those things that's been bouncing around in my head since high school. To me it just carries this elite level that it's always felt out of reach, and I think it intrinsically comes with a certain level of bragging rights that I don't think can really be matched.
Oh, you can program!? You're welcome... 😎
I don't know Java, which was gonna make this a challenge for me because the book is how to craft the lox programming language in Java, but thankfully the second half of the book is writing a bytecode VM using C. I'm jumping into this having completely skipped the first half of the book and am going straight to writing the C implementation aptly named clox. So anyways, this was my experience.
This book is also completely free to read online. It's genuinely outrageous that Robert would release something like this, completely for free, simply for the love of the game. This is the type of generosity that actually makes a difference. Kudos to you Robert! I'll likely order a physical copy because of just how cool this book and to support him but the web version is not only free but incredibly well organized and gorgeous.
The Surprising Simplicity of Lexers
When I first thought about building a lexer (the part that breaks source code into tokens), I expected some elaborate algorithm. Maybe a complex state machine, or some sophisticated special purpose paradigm specifically designed for solving this type of problem.
It's basically just a giant fucking switch statement.
Token scanToken() {
skipWhitespace();
scanner.start = scanner.current;
if (isAtEnd()) return makeToken(TOKEN_EOF);
char c = advance();
if (isAlpha(c)) return identifier();
if (isDigit(c)) return number();
switch (c) {
case '(': return makeToken(TOKEN_LEFT_PAREN);
case ')': return makeToken(TOKEN_RIGHT_PAREN);
case '{': return makeToken(TOKEN_LEFT_BRACE);
case '}': return makeToken(TOKEN_RIGHT_BRACE);
case ';': return makeToken(TOKEN_SEMICOLON);
case ',': return makeToken(TOKEN_COMMA);
case '.': return makeToken(TOKEN_DOT);
case '-': return makeToken(TOKEN_MINUS);
case '+': return makeToken(TOKEN_PLUS);
case '/': return makeToken(TOKEN_SLASH);
case '*': return makeToken(TOKEN_STAR);
// ...more cases...
}
return errorToken("Unexpected character.");
}"Is it an opening parenthesis? Return TOKEN_LEFT_PAREN. Is it a plus sign? Return TOKEN_PLUS."
Although in hindsight this seems a bit obvious, I was suspicious. This couldn't possibly be how "real" lexers work, right? So I went down a rabbit hole researching other lexer implementations. Turns out this approach is totally legitimate, even in production-grade compilers and interpreters.
There's of course some tools like Flex or ANTLR that are purpose built lexers, but under the hood, even those are essentially fancy switch statements (in concept) with some significant optimizations. The core algorithm really is "look at current character, decide what to do, repeat."
The Elegant Trie-Like Keyword Recognition
Yet another opportunity for Robby (we're close like that) to remind me that I have no clue what what the fuck I'm doing was how he identifies keywords. Almost certainly I would have done something really different here. Maybe a hash table or something unreasonably more complex is what I would have reached for but what he did was this:
static TokenType identifierType() {
switch (scanner.start[0]) {
case 'a': return checkKeyword(1, 2, "nd", TOKEN_AND);
case 'c': return checkKeyword(1, 4, "lass", TOKEN_CLASS);
case 'e': return checkKeyword(1, 3, "lse", TOKEN_ELSE);
case 'f':
if (scanner.current - scanner.start > 1) {
switch (scanner.start[1]) {
case 'a': return checkKeyword(2, 3, "lse", TOKEN_FALSE);
case 'o': return checkKeyword(2, 1, "r", TOKEN_FOR);
case 'u': return checkKeyword(2, 1, "n", TOKEN_FUN);
}
}
break;
// More cases...
}
return TOKEN_IDENTIFIER;
}It's a hand-coded prefix tree, checking characters one by one. It's surprisingly efficient - you never compare more characters than necessary. For example, if the first letter is 'z', it immediately knows it's not a keyword.
It also doesn't allocate any memory. The token is just a pointer into the source code that's already in memory rather than slow strcpy(), of which I would have inevitably used.
A Weird Truth About Language Tools
Building this scanner made me realize something strange - despite being used for the most sophisticated software in the world, programming language tools often rely on surprisingly straightforward algorithms.
Take a look at this piece of the parser from the gcc compiler. It looks strikingly familiar if you ask me. It's the gcc compiler, and sure, it's (by a substantial margin) more complex than the clox lexer but it really is the same exact concept!
c_parser_expr_no_commas (c_parser *parser, struct c_expr *after,
tree omp_atomic_lhs)
{
struct c_expr lhs, rhs, ret;
enum tree_code code;
location_t op_location, exp_location;
bool save_in_omp_for = c_in_omp_for;
c_in_omp_for = false;
gcc_assert (!after || c_dialect_objc ());
lhs = c_parser_conditional_expression (parser, after, omp_atomic_lhs);
op_location = c_parser_peek_token (parser)->location;
switch (c_parser_peek_token (parser)->type)
{
case CPP_EQ:
code = NOP_EXPR;
break;
case CPP_MULT_EQ:
code = MULT_EXPR;
break;
case CPP_DIV_EQ:
code = TRUNC_DIV_EXPR;
break;
case CPP_MOD_EQ:
code = TRUNC_MOD_EXPR;
break;
case CPP_PLUS_EQ:
code = PLUS_EXPR;
break;
case CPP_MINUS_EQ:
code = MINUS_EXPR;
break;
case CPP_LSHIFT_EQ:
code = LSHIFT_EXPR;
break;
case CPP_RSHIFT_EQ:
code = RSHIFT_EXPR;
break;
case CPP_AND_EQ:
code = BIT_AND_EXPR;
break;
case CPP_XOR_EQ:
code = BIT_XOR_EXPR;
break;
case CPP_OR_EQ:
code = BIT_IOR_EXPR;
break;
default:
c_in_omp_for = save_in_omp_for;
return lhs;
}
c_parser_consume_token (parser);
exp_location = c_parser_peek_token (parser)->location;
rhs = c_parser_expr_no_commas (parser, NULL);
rhs = convert_lvalue_to_rvalue (exp_location, rhs, true, true);
ret.value = build_modify_expr (op_location, lhs.value, lhs.original_type,
code, exp_location, rhs.value,
rhs.original_type);
ret.m_decimal = 0;
set_c_expr_source_range (&ret, lhs.get_start (), rhs.get_finish ());
if (code == NOP_EXPR)
ret.original_code = MODIFY_EXPR;
else
{
suppress_warning (ret.value, OPT_Wparentheses);
ret.original_code = ERROR_MARK;
}
ret.original_type = NULL;
c_in_omp_for = save_in_omp_for;
return ret;
}
/* Parse a conditional expression (C90 6.3.15, C99 6.5.15, C11 6.5.15). If
AFTER is not NULL then it is an Objective-C message expression which is
the primary-expression starting the expression as an initializer.
conditional-expression:
logical-OR-expression
logical-OR-expression ? expression : conditional-expression
GNU extensions:
conditional-expression:
logical-OR-expression ? : conditional-expression
*/Find this code on github here.
We're taught that real-world software engineering requires complex design patterns and clever optimizations. But sometimes, the simple approach is the right one. A scanner doesn't need to be complicated, because the problem it solves isn't that complex. It's just pattern matching with like 2 pointers worth of state.
First Success: Tokenization
With the scanner in place, I can now break source code down into tokens. I know I just got done doting over the idea that tokenizing is, like, super simple but this felt like a real accomplishment. I'm not over the idea that creating a real programming language is uber leet shit, so, seeing this output actually gave me goosebumps.
VAR 'var'
IDENTIFIER 'answer'
EQUAL '='
NUMBER '42'
SEMICOLON ';'
PRINT 'print'
STRING '"The answer is "'
...