Writing a programming language (Part 2 Lexer)

June 23rd, 2019

Past couple of weeks have been a very hectic week for me. I just started my new summer internship, so I needed some time to settle into my new workplace and living space. Although I’ve been lazy on writing my blog, I did not stop working on my compiler. I did have some success putting some things together, such as basic functions, arithmetic, and variable declaration, but making these work was not a smooth journey for me. I was too excited to just start coding, and did not have a solid plan how I would approach the project. I do now have a pretty clear idea of the whole workflow of my compiler, but that came only after I had to learn a valuable lesson to always have plan before the hard way. So I encourage everyone to always remember to plan thoroughly before working on a big project like making a compiler.

Anyways, enough rambling, I would like to talk about what I have so far. My compiler breaks down into 4 big parts: Lexer, syntax tree generator, grammar classifier, and code generator. Today I will talk about the lexer. As we talked about before, lexer takes a string input, and generate corresponding tokens based on keywords in the input. This is the interface I created for a token, where each token will hold the type and the actual value of the keyword:

type Token struct {
    Type  string
    Value []byte
}

For converting a line of code into a list of tokens, I wrote a very simple function that iterates through the keywords and puts a corresponding token into the result token array. As you can see, the way I catch keywords in a line of code is by splitting the code by space. This means writing a code like “foo(a,b)” cannot be processed correctly, so instead, you must write in a format with space in between the keywords like “foo ( a , b )”. This is something that I will work on to improve in the future, but for now, this is good enough to test the rest of my compiler implementation!

func tokenizer(line string) []Token {
	words := strings.Fields(line)
	tokens := []Token{}
	for i, w := range words {
        if isOperator(w) {
			token := Token{
				Type:  "Operator",
				Value: []byte(w),
			}
			tokens = append(tokens, token)
		} else if isDeclaration(w) {
			token := Token{
				Type:  "Declaration",
				Value: []byte(w),
			}
			tokens = append(tokens, token)
		} else if isAssignment(w) {
			token := Token{
				Type:  "Assignment",
				Value: []byte(w),
			}
			tokens = append(tokens, token)
		} .....

Besides, some simple helper functions to classify keyword types, this is pretty much all I have for the lexer for now. I will keep you all posted on the progress :)

Ryan Jeon

Co-founder/CTO @ Immigo.io (Techstars 21)

Writing a programming language (Part 2 Lexer)

Share Post

Ryan Jeon