Mariano Gappa
14 Jun 2019
•
6 min read
Article written by Mariano Gappa
This article aims to be the simplest introduction to constructing an LL(1) parser in Go, in this case for parsing SQL queries. It assumes minimal programming competence (functions, structs, ifs and for-loops).
Here’s the complete parser repository if you want to skip to results:
github.com/marianogappa/sqlparser
To make things simple we’re gonna descope sub-selects, functions, complex nested expressions and other features that all SQL flavours support. Those features get really involved quickly with the strategy we’re gonna use.
A parser has two parts:
1. the Lexical Analysis: a.k.a. the “Tokeniser” 2. the Syntactic Analysis: the creation of the AST
Let’s define it by example. “Tokenising” the following query:
SELECT id, name FROM 'users.csv'
Means extracting the “tokens” that form this query. The result of the tokeniser would be something like:
[]string{"SELECT", "id", ",", "name", "FROM", "'users.csv'"}
```## Syntactic Analysis##
This part is where we actually look at the tokens, make sure they make sense and interpret them to construct some product `struct` that represents the query in a way that is convenient for the application that is gonna use it (e.g. for executing the query, colour highlighting it). After this step, we’d end up with something like:
query{ Type: "Select", TableName: "users.csv", Fields: ["id", "name"], }
There are a lot of ways in that parsing can fail, so it would be convenient to do the two steps at the same time and stop as soon as something goes wrong.
## Strategy##
We’ll define a `parser` that looks like this:
type parser struct { sql string // The query to parse i int // Where we are in the query query query.Query // The "query struct" we'll build step step // What's this? Read on... }
// Main function that returns the "query struct" or an error func (p *parser) Parse() (query.Query, error) {}
// A "look-ahead" function that returns the next token to parse func (p *parser) peek() (string) {}
// same as peek(), but advancing our "i" index func (p *parser) pop() (string) {}
Intuitively, what we would do first is to “peek() the first token”. In a basic SQL grammar, there are only a few valid initial tokens: SELECT, UPDATE, DELETE, etc; anything else is an error already. The code would look something like:
switch strings.ToUpper(parser.peek()) {
case "SELECT": parser.query.type = "SELECT" // start building the "query struct" parser.pop() // TODO continue with SELECT query parsing...
case "UPDATE": // TODO handle UPDATE
// TODO other cases...
default: return parser.query, fmt.Errorf("invalid query type")
}
And we can basically fill in the `TODO`s and profit! However, the diligent reader will see that the code for parsing the whole `SELECT` query can get messy really quickly, and we have many types of queries to parse. We’re gonna need some structure.
## Finite-state Machines##
[FSMs](https://en.wikipedia.org/wiki/Finite-state_machine) are a super interesting topic, but we’re not here to get a CS degree. Let’s just focus on what we need.
At any given point (instead of “point” let’s call it “node”) in our parsing journey, only a few tokens are valid, and upon finding these tokens we advance to new nodes where different tokens are valid, and so on until we finish parsing our query. We can visualise these node relationships as a directed graph:
![sql_parser_graph.png](https://functionalworks\-backend\-\-prod.s3.amazonaws.com/logos/06ab5bd0aaa7ae41d8f685dbe784ee7b)
The node transitions can be defined with a simpler table, though:
![sql_parser_table.png](https://functionalworks\-backend\-\-prod.s3.amazonaws.com/logos/8b02698b29f7839bab251ed1003ad72d)
We can directly translate this table to a really large switch statement. We’ll use that sneaky `parser.step` property we defined before:
func (p *parser) Parse() (query.Query, error) { parser.step = stepType // initial step
for parser.i < len(parser.sql) { nextToken := parser.peek()
switch parser.step {
case stepType:
switch nextToken {
case UPDATE:
parser.query.type = "UPDATE"
parser.step = stepUpdateTable
// TODO cases of other query types
}
case stepUpdateSet:
// ...
case stepUpdateField:
// ...
case stepUpdateComma:
// ...
}
parser.pop()
}
return parser.query, nil }
And there we go! Note that some steps might conditionally cycle back to previous ones, like a comma on the `SELECT` field definition. This strategy scales well for basic parsers. As the grammar grows complex, though, the number of states will increase dramatically, so it can get tedious to write. I recommend testing as you code; more on that below.
## Peek() implementation
Remember that we need to implement both `peek()` and `pop()`. Since they do almost the same, let’s use an auxiliary function to keep things [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself). Also, `pop()` should further advance the index to avoid peeking whitespace.
func (p *parser) peek() string { peeked, _ := p.peekWithLength() return peeked }
func (p *parser) pop() string { peeked, len := p.peekWithLength() p.i += len p.popWhitespace() return peeked }
func (p *parser) popWhitespace() { for ; p.i < len(p.sql) && p.sql[p.i] == ' '; p.i++ { } }
Here’s a list of tokens that we might want to peek:
var reservedWords = []string{ "(", ")", ">=", "<=", "!=", ",", "=", ">", "<", "SELECT", "INSERT INTO", "VALUES", "UPDATE", "DELETE FROM", "WHERE", "FROM", "SET", }
In addition to those, we might come across quoted strings or plain identifiers (e.g. field names). Here’s a hopefully complete `peekWithLength()` implementation:
func (p *parser) peekWithLength() (string, int) { if p.i >= len(p.sql) { return "", 0 } for _, rWord := range reservedWords { token := p.sql[p.i:min(len(p.sql), p.i+len(rWord))] upToken := strings.ToUpper(token) if upToken == rWord { return upToken, len(upToken) } } if p.sql[p.i] == '\'' { // Quoted string return p.peekQuotedStringWithLength() } return p.peekIdentifierWithLength() }
The remaining functions are trivial and left as an exercise to the reader. If you’re curious, the link at the TL;DR section contains the full source code implementation. I intentionally won’t link it again here to add a little indirection.
## Final validation
The parser might find the end of the string before arriving at a complete query definition. It’s probably a good idea to implement a `parser.validate()` function that looks at the generated “query” struct, and returns an error if it’s incomplete or otherwise wrong.
## Testing
Go’s table-driven testing pattern lends itself beautifully for our case:
type testCase struct { Name string // description of the test SQL string // input sql e.g. "SELECT a FROM 'b'" Expected query.Query // expected resulting "query" struct Err error // expected error result }
Example tests:
ts := []testCase{ { Name: "empty query fails", SQL: "", Expected: query.Query{}, Err: fmt.Errorf("query type cannot be empty"), }, { Name: "SELECT without FROM fails", SQL: "SELECT", Expected: query.Query{Type: query.Select}, Err: fmt.Errorf("table name cannot be empty"), }, ...
Test the test cases like this:
for _, tc := range ts { t.Run(tc.Name, func(t *testing.T) { actual, err := Parse(tc.SQL) if tc.Err != nil && err == nil { t.Errorf("Error should have been %v", tc.Err) } if tc.Err == nil && err != nil { t.Errorf("Error should have been nil but was %v", err) } if tc.Err != nil && err != nil { require.Equal(t, tc.Err, err, "Unexpected error") } if len(actual) > 0 { require.Equal(t, tc.Expected, actual[0], "Query didn't match expectation") } }) }
I’m using [testify](https://github.com/stretchr/testify) because it provides a diff output when the query structs don’t match.
Going deeper
This experiment is well-suited for:
- Learning LL(1) parser algorithms
- Custom parsing simple grammars with zero dependencies
However, this approach can get tedious and it’s somewhat limiting. Think about how to parse arbitrarily complex compound expressions (e.g. `sqrt(a) = (1 * (2 + 3))`).
For a more powerful parsing model, look into [parser combinators](https://en.wikipedia.org/wiki/Parser_combinator). [goyacc](https://godoc.org/golang.org/x/tools/cmd/goyacc) is a popular Go implementation.
I recommend this [very interesting talk](https://www.youtube.com/watch?v=HxaD_trXwRE) by Rob Pike on Lexical Scanning.
[Recursive descent parsing](http://www.cs.binghamton.edu/~zdu/parsdemo/recintro.html) is another parsing approach.
## Why I wrote this
Recently, I’ve decided to centralise my data into a repository of CSVs. As a bonus, it’d give me a chance to learn [React](https://reactjs.org/) better by creating a UI for [CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)ing the data. When I had to decide on an interface for communicating CRUD operations between frontend and backend, I felt SQL was a natural language for it (and I already know it well).
It seems that, although there are many libraries for reading CSV files with SQL, there aren’t many that support write operations (particularly [DDL statements](https://en.wikipedia.org/wiki/Data_definition_language)). A colleague recommended me to upload the files into an in-memory [SQLite database](https://www.sqlite.org/index.html), run the SQL and then export to CSV; a fine idea since performance wasn’t at all a concern for me. In the end, I thought to myself: didn’t you always want to write a SQL parser? How hard can it be?
Turns out writing a (basic) parser is very simple! But I bet it can appear daunting without a good tutorial that is as simple as can be.
I hope this can be that tutorial for you. [KISS!](https://en.wikipedia.org/wiki/KISS_principle)
Ground Floor, Verse Building, 18 Brunswick Place, London, N1 6DZ
108 E 16th Street, New York, NY 10003
Join over 111,000 others and get access to exclusive content, job opportunities and more!