Single quotes denote a regular expression, while double quotes denote a literal. Notice the presence of both single and double quotes in the above. It exposes all of this through an attributed specification format: It can also hide tokens, like comments and whitespace. Next, it has a feature called "block ends" which make it easy to match things with multicharacter ending conditions, like C block comments, and markup comments and CDATA sections. However, you can reference Rolex.exe in your projects like you would any assembly, and your tokenizer code can use that as an external library, if desired. It can generate its entire dependency code as source, and do this in any language that the CodeDOM will reasonably support, so it requires no external libraries. It has some unique features, hence the "gold plating".įor starters, it can create lexers that have no external dependencies, which is rare or maybe unheard of in the limited. We're going to build Ro lex, the "gold plated" lexer. They can be used any time you need to break up text into symbolic pieces. Tokenizers/Lexers are almost always used for parsers, but they don't have to be. As I said, we're building on what we've done there. Plus you'll get to go over some neat code in it. You're really best off starting there in any case. If you don't know what one is yet, see the previous article from above, because it explains lexing/tokenization.
Parsers use them to break an input text stream into lexemes tagged with symbols so it can identify the "type" of a particular chunk of text.
This will use what we've developed, and expand on what we've done to create a full fledged lexer generator.įirst, what the heck is a lexer? Briefly, lexers are useful to parsers.
This is a follow up to How to Build a Regex Engine.