Repurposing A Lexer

-=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- (c) WidthPadding Industries 1987 0\|520\|0 -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=-
SoCoder -> Blogs Home -> Blogs

Cower	Created : 15 April 2013 Edited : 15 April 2013 System : PSP Repurposing A Lexer Take some sad code and make it better
It's been a while since I used the blog thing on here. Been a while since I wrote a blog post in general, though that's usually the case. Anyway, the point of this post is simple: I had an old lexer I wrote for the purpose of lexing BlitzMax source code and repurposed it for other, less nefarious deeds. So in my engine, I'd like to be able to just load materials from source files. Whether these source files are going to be later compiled into something that doesn't require expensive parsing is another matter, but not really important. Files with things like these, however, do need to be parsed at some point: "base/floor/pike_1" { diffuse 'base/floor/pike_1.tga' // yes, single quotes work too blend ONE ONE /* masked, so skip alpha blending */ depthwrite false shader 'shader/alphamask' } --v I'd originally written Sparse to do this work for me, but Sparse is a little bit limited in that it's simply a map of key/value entries, optionally contained within further maps. So, while "depthwrite false" works fine in Sparse because the key is "depthwrite" and the value is "false", it's not really practical for "blend ONE ONE" where the value is "ONE ONE" and not an array of values. This could be fixed, but the purpose of Sparse is to be extremely small and well-defined, so I didn't want to throw a wrench in that. After all, most Sparse implementations are about 200 to 300 lines, they're not made for extensive parsing work. The usual thing most people would suggest, then, is "use flex/bison/yacc/peg/greg/treetop" which is all well and good if you're just building a standalone parser. If you're suggesting that to someone who needs to do this quickly and at runtime (e.g., during a loading screen), well, you should sew your mouth shut and never speak again. Ever. These are practical solutions to the problem, but not practical for games unless you're compiling something to be used by the engine (entirely practical, by the way). I obviously didn't want to do this. Which takes us back to the old lexer I had for BMax source: it's actually pretty good at what it does. And it's fast as hell. It's probably slightly slower now that I tore out some of the original code and replaced it with STL niceties (e.g., no more manually allocating an array for tokens), but it's still good enough. Worse comes to worst and profiling shows it's an issue, I can work on that later. So, getting the lexer working was mostly painless. I tore out a lot of the old tokens and token-pairs (pairs being two tokens that get merged into one when post-processing the tokens after they've all been read), tore out some of the old code, replaced a few other things (e.g., comment parsing, because apostrophes and Rem/EndRem are stupid). Ultimately, it's a very small, generic lexer aimed at parsing C-style stuff. Given those changes, the above example is easy to parse, but because the lexer will spit out tokens for most other purposes, it's entirely reasonable to also lex something like "local x = [](y) { return y ? "foobar" : "gilgamesh"; }; // wooperton". Technically, it could be used to lex anything from JSON to Lua to Javascript to probably a few different programming languages and a handful of other things (but not brainfuck and not who knows what else). It doesn't care about whitespace, however, so the Python/YAML/etc.-lovers are out in the cold there (though each token does store its line and column, so I suppose it's possible). Point of all this? Don't throw away your old code. It's very rarely useful, but at least the one time it is, you've got it handy. And a quick idea of what the lexer output looks like if thrown into a console:

Comments
Monday, 15 April 2013, 05:19
steve_ancell	The Sparse link is dead, only contains https:///
Monday, 15 April 2013, 07:21
Cower	Well, fixed that.
Monday, 15 April 2013, 23:06
Cower	Well, I spent the day mostly rewriting a bunch of things in the lexer. So, it now supports UTF-8 extremely well, no longer depends on pointers to source data, no longer depends on token singles or pairs, and handles escapes and so on inside strings. Incidentally, this led to the lexer losing about 300 lines of code because it was simply unnecessary for most of it to exist. Pairs and singles were arbitrarily-matched ranges of tokens and characters respectively. So, a single is an individual string, like "false", that's optionally case-sensitive. The lexer would match these as it was parsing, so it was fairly simple. Pairs, on the other hand, are a grouping of two tokens within N characters of each other (for BlitzMax, this was 1 character for stuff like "End Method" and "EndMethod") that would subsequently be merged into a single character. Pairs were something of a processing problem and something of a memory problem. First, pairs could rewind into each other, so if one pair resulted in a merge, it might then rewind and see if the new token could be merged with the next token again to create yet another token. The way this worked was pretty convenient because you might have the tokens ">", ">", and "=" with zero distance between each other. Because the tokens were merged in reverse, you'd first tell ">" and "=" to combine into TOK_GREATER_EQUAL (">="), then set the rewind flag since it could possibly merge down again. Once the lexer rewinds, it sees you've got ">" and ">=", and you've got a pair that merges those into TOK_SHIFT_RIGHT_ASSIGN, so it does that. It can't rewind again after that merge, so it continues down the list of tokens. From a convenience perspective, this is nice and it's not terrible for lexing since most tokens don't get merged, so the complexity isn't very high, though there was one particular side effect that resulted from storing tokens in a vector: the merge requires deletion in the middle of the vector, typically. This is slow as hell. Some of it is mitigated by std::move'ing the prior tokens ahead and resizing the vector, but doing that many times over leads to some downright lousy performance (assuming you merge fairly often, this might as well be O(N^2)). So, the vector got ditched for a list, which means no simple random access and so on. The real problem with pairs is that they were originally in the lexer because BlitzMax depends heavily on keywords and the lexer is designed to ignore whitespace whenever possible. So what you end up with is tokens like TOK_END_KW and TOK_REM_KW sitting next to each other, one space apart. Either I would have to postprocess the tokens after running the lexer and merge these, or the lexer could do it for me. I chose the latter option, so pairs were added and the lexer could then merge TOK_END_KW and TOK_REM_KW into TOK_ENDREM_KW. Handy for BlitzMax, not so much for a language that depends mostly on punctuation. So, the pairs were stripped out in addition to the singles. Most singles are keywords, and since the lexer now only recognizes three keywords (true, false, and null), it's easy to just check for those when reading an identifier and replace it as needed. Other cases where merges might be useful, like >>, >=, and so on are easily handled by simply peeking for the next character after encountering the first and adjusting the token/range as needed. In this case, it's just an increment because variants of tokens with the same initial character are kept together (so TOK_GREATER would be 1, TOK_GREATER_EQUAL = 2, and TOK_SHIFT_RIGHT = 3): switch (cur) { /* ... / case '>': token.kind = TOK_GREATER_THAN; goto lex_doubled_token; case '<': token.kind = TOK_LESS_THAN; goto lex_doubled_token; case '&': token.kind = TOK_AMPERSAND; goto lex_doubled_token; case '\|': token.kind = TOK_PIPE; goto lex_doubled_token; case '!': token.kind = TOK_BANG; lex_doubled_token: { auto inserter = std::back_inserter(token.value_); utf8::unchecked::append(cur, inserter); const uint32_t next = read_next(); if (next == cur \|\| (cur == '!' && next == '=')) { utf8::unchecked::append(next, inserter); ++token.kind; read_next(); } else if (token.kind < TOK_AMPERSAND && next == '=') { utf8::unchecked::append(next, inserter); token.kind = (token_kind_t)((unsigned)token.kind + 2); read_next(); } } break; / ... */ } // switch(cur) --v ^ That's basically the footnote from hell.

Repurposing A Lexer

Comments