Analysis and parsing library for SQL queries.
A library for parsing SQL queries into well-typed data structures, and producing easily quantifiable analyses from said data structures.
Currently this includes support for Hive, Vertica, and Presto dialects of SQL. Parsing for each dialect is provided in their own package, as queryparser-{dialect}
queryparser
Parsing and analysis of Vertica, Hive, and Presto SQL.
Gentle introduction
Queryparser supports parsing for three sql-dialects (Vertica, Hive, and Presto).
Each dialect implements its own tokenization and parsing logic. There is a single abstract syntax tree (AST) for representing queries of any dialect. This AST is defined in Database/Sql/Type.hs and Database/Sql/Type/Query.hs.
The parsing logic produces an AST with table and column identifiers that are "raw" or optionally qualified. Frequently, it is desirable to convert the AST over raw names to an AST over resolved names, where identifiers are fully qualified. This transformation is called "name resolution" or simply "resolution". It requires as input the full list of columns in every table and the full list of tables in every schema, otherwise known as "catalog information".
An example of resolution:
SELECT column_c FROM table_t
is a query with raw namesSELECT schema_s.table_t.column_c FROM schema_s.table_t
is the same query with resolved names
Various utility functions ("analyses") have been defined for ASTs over resolved names, such as:
- what tables appear in the query?
- what columns appear in the query, per clause?
- what is the table/column lineage of a query?
- what sets of columns does a query compare for equality?
The analyses are the main value-add of this library.
Demo
Enough talk, let's show what it can do!
$ stack ghci
...
...> :set prompt "> "
> import Demo
> -- for the purposes of our demo, we have two tables: foo with columns a,b,c and bar with columns x,y,z
> demoAllAnalyses "SELECT * FROM foo" -- note that the SELECT * expands to a,b,c
Tables accessed:
public.foo
Columns accessed by clause:
public.foo.a SELECT
public.foo.b SELECT
public.foo.c SELECT
Joins:
no joins
Table lineage:
no tables modified
> demoAllAnalyses "SELECT * FROM bar" -- and here the SELECT * expands to x,y,z
Tables accessed:
public.bar
Columns accessed by clause:
public.bar.x SELECT
public.bar.y SELECT
public.bar.z SELECT
Joins:
no joins
Table lineage:
no tables modified
> demoAllAnalyses "SELECT x, count(1) FROM foo JOIN bar ON foo.a = bar.y WHERE z IS NOT NULL GROUP BY 1 ORDER BY 2 DESC, b"
Tables accessed:
public.bar
public.foo
Columns accessed by clause:
public.bar.x GROUPBY
public.bar.x SELECT
public.bar.y JOIN
public.bar.z WHERE
public.foo.a JOIN
public.foo.b ORDER
Joins:
public.bar.y <-> public.foo.a
Table lineage:
no tables modified
> -- let's play with some queries that modify table-data!
> demoTableLineage "INSERT INTO foo SELECT * FROM bar"
public.foo after the query depends on public.bar, public.foo before the query
> demoTableLineage "TRUNCATE TABLE foo"
public.foo no longer has data
> demoTableLineage "ALTER TABLE bar, foo RENAME TO baz, bar"
public.bar after the query depends on public.foo before the query
public.baz after the query depends on public.bar before the query
public.foo no longer has data
> -- let's explore a few subtler behaviors of the "joins" analysis (admittedly, something of a misnomer)
> demoJoins "SELECT * FROM foo JOIN bar ON a=x AND b+c = y+z"
public.bar.x <-> public.foo.a
public.bar.y <-> public.foo.b
public.bar.y <-> public.foo.c
public.bar.z <-> public.foo.b
public.bar.z <-> public.foo.c
> demoJoins "SELECT a FROM foo UNION SELECT x FROM bar"
public.bar.x <-> public.foo.a
Spin up your own ghci and paste in your own queries!
Requirements
To build, you need:
- stack
- docker (recommended)
OS X
You can install stack on OS X with brew:
brew install ghc haskell-stack
To install docker, use the default installer found on https://docs.docker.com/mac/
To allow stack to see the docker daemon, add eval "$(docker-machine env default)"
to your bashrc or equivalent. (This is following https://docs.docker.com/machine/reference/env/)
Linux & Other
Follow the directions at https://github.com/commercialhaskell/stack#how-to-install
Installation
Once you've got what you need, check out the repo and change to the directory.
Stack will download other dependencies for you:
stack setup
Now you can build the package with:
stack build
Or run the tests with:
stack test
Or run the benchmarks with:
stack bench
Or pull things up in ghci with:
stack ghci
Contributing
If you'd like to contribute to the repo, use the above installation instructions to get started.
When you're ready, make sure that the code compiles with the Development
flag, i.e.:
stack build --flag queryparser:development
Use
Mostly it boils down to this function:
parse :: Text -> Either ParseError SqlQuery
To parse some sql from the repl,
parse "SELECT 1;"
Areas of future interest
There is substantial room for future work in Queryparser. For more details, see Areas of future interest.