Data Intellect
This blog post gives users with a basic knowledge of SQL a brief introduction to querying q/kdb databases, with a focus on filtering data. If you are experimenting on an on-disk database, please read all the way to the bottom before starting as there are some important points about accessing partitioned data structures!
q has its own syntax for querying tables, based on standard SQL. Most commonly used is the select query which features a familiar syntax to SQL but enables much faster queries and introduces functionality not typically available in SQL.
qsql has two notable differences to SQL:
In addition q features two data types for text fields; strings and symbols.
Symbols enable q to interpret text as a fixed length data type, making lookups much faster vs conventional strings. It is advised to use symbols wherever possible, the exception being large lists of distinct strings (e.g. free text fields).
Let’s start off by creating a q table.
q)tab:([]sym:`GOOG`AMAZ`APPL`GOOG;
price:3.1 1.1 2.1 1;
qty:1200 2000 2500 2500;
side:`b`b`s`s;
tradedBy:("George";"George";"George";"Paul"))
This has created an in-memory table with 5 columns and 4 rows. We are now ready try out some queries.
Selecting from a table is straightforward in q. Note the wildcard * is not necessary, q will select all columns by default.
/SQL equivalent: SELECT * FROM tab
q)select from tab
sym price qty side tradedBy
-----------------------------
GOOG 3.1 1200 b "George"
AMAZ 1.1 2000 b "George"
APPL 2.1 2500 s "George"
GOOG 1 2500 s "Paul"
Next, we can select specific columns from a table
/SQL equivalent: SELECT sym,price,qty FROM tab
q)select sym,price,qty from tab
sym price qty
---------------
GOOG 3.1 1200
AMAZ 1.1 2000
APPL 2.1 2500
GOOG 1 2500
q also allows you to rename columns in select statements using :
. This is equivalent of AS in SQL.
/SQL equivalent: SELECT sym,price AS px,qty AS qt FROM tab
q)select sym,px:price,qt:qty from tab
sym px qt
-------------
GOOG 3.1 1200
AMAZ 1.1 2000
APPL 2.1 2500
GOOG 1 2500
It is also possible to apply any built-in or custom q function to columns. For example here we are calculating value as quantity multiplied by price.
/SQL equivalent: SELECT sym,price*qty AS val FROM tab
q)select sym,val:price*qty from tab
sym val
---------
GOOG 3720
AMAZ 2200
APPL 5250
GOOG 2500
As in SQL, q features a where condition, to allow filtering on rows.
/SQL equivalent: SELECT * FROM tab WHERE sym='GOOG' q)select from tab where sym=`GOOG
sym price qty side tradedBy ----------------------------- GOOG 3.1 1200 b "George" GOOG 1 2500 s "Paul"
Or multiple conditions:
/SQL equivalent: SELECT * FROM tab WHERE sym='GOOG' AND qty=2500
q)select from tab where sym=`GOOG,qty=2500
sym price qty side tradedBy
-----------------------------
GOOG 1 2500 s "Paul"
Note this is similar to but not exactly the same as a logical AND. Here q applies each condition sequentially, filtering the results each time. This means it is most efficient to list the most selective condition first. q supports logical “and” and “or” operations. For simplicity, it is best to wrap each condition in brackets due to the parsing logic of q.
/SQL equivalent: SELECT * FROM tab WHERE price>2 OR side='b'
q)select from tab where (price>2) or (side=`b)
sym price qty side tradedBy
-----------------------------
GOOG 3.1 1200 b "George"
AMAZ 1.1 2000 b "George"
APPL 2.1 2500 s "George"
q supports a wide range of functions which can be applied to where conditions. Including;
q)select from tab where qty<1000
sym price qty side tradedBy
-----------------------------
GOOG 3.1 1200 b "George"
AMAZ 1.1 2000 b "George"
q)select from tab where sym in `AMAZ`APPL
sym price qty side tradedBy ----------------------------- AMAZ 1.1 2000 b "George" APPL 2.1 2500 s "George"
q)select from tab where price within 1.0 3.0
sym price qty side tradedBy
-----------------------------
AMAZ 1.1 2000 b "George"
APPL 2.1 2500 s "George"
GOOG 1 2500 s "Paul"
Filtering on string columns presents a slight challenge in q, as strings are handled as an array of characters. As such, a standard equals throws a length error:
q)select from tab where tradedBy="George"
'length
Instead we can avoid this issue by using the like operator
q)select from tab where tradedBy like "George"
sym price qty side tradedBy
-----------------------------
GOOG 3.1 1200 b "George"
AMAZ 1.1 2000 b "George"
APPL 2.1 2500 s "George"
The like operator can also be used to match against symbol columns, and supports regular expressions
q)select from tab where sym like "A*"
sym price qty side tradedBy
-----------------------------
AMAZ 1.1 2000 b "George"
APPL 2.1 2500 s "George"
q supports 4 regex wildcards:
Working with symbols containing special characters (such as *, @, -, /) can also present a challenge as they also function as operators. To get round this we need to create them as strings and then use `$ to cast to symbols.
We can show this by defining a new table, with exchange codes added to the sym column.
q)tab1:([]sym:`$("GOOG-q";"AMAZ-n";"APPL-n";"GOOG-q");
price:3.1 1.1 2.1 1;
qty:1200 2000 2500 2500;
side:`b`b`s`s;
tradedBy:("George";"George";"George";"Paul"));
q)select from tab1
sym price qty side tradedBy
-------------------------------
GOOG-q 3.1 1200 b "George"
AMAZ-n 1.1 2000 b "George"
APPL-n 2.1 2500 s "George"
GOOG-q 1 2500 s "Paul"
Regular queries will not work due to the – character
q)select from tab1 where sym=`AMAZ-n
'n
We can fix this by casting
q)select from tab1 where sym=`$"AMAZ-n" sym price qty side tradedBy ------------------------------- AMAZ-n 1.1 2000 b "George"
// or for a listq)select from tab1 where sym in `$("AMAZ-n";"APPL-n") sym price qty side tradedBy ------------------------------- AMAZ-n 1.1 2000 b "George"
APPL-n 2.1 2500 s "George"
or we can use like
q)select from tab1 where sym like "AMAZ-n"
sym price qty side tradedBy
-------------------------------
AMAZ-n 1.1 2000 b "George
"
To efficiently query a KDB database we will need some meta information such as the list of tables, the schema of each table and any special properties these tables might have.
Firstly to view all tables we can use the tables command
/mySQL equivalent: show tables
q)tables[]
`tab`tab1
Next we can use the meta command to list the columns (c) and their types (t), as well as any linked tables (f) and any attributes (a) that may have been added to the table (attributes are covered in the next section). See here for a full list of data types.
/mySQL equivalent: describe tab
q)meta tab
c | t f a
--------| -----
sym | s
price | f
qty | j
side | s
tradedBy| C
Decoding this table, we can tell that tab has 5 columns sym, price, qty, side, tradedBy and the corresponding types symbol, float, long, symbol, string. The reason why tradedBy is represented by a capital C is because q views strings as an array of characters. In meta all array types are represented by the capitalised version of the underlying data type.
In q we can apply specific attributes to columns. These attributes allow q to access in-built optimisations to improve lookup speeds.
We can apply attributes using the # operator. Here we are telling q that qty is a sorted column
q)tab2:([]sym:`GOOG`AMAZ`APPL`GOOG; price:3.1 1.1 2.1 1; qty:
`s
#1200 2000 2500 2500; side:
`b`b`s`s
; tradedBy:("George";"George";"George";"Paul"))
Attributes can be viewed using the meta command
q)meta tab2
c | t f a
--------| -----
sym | s
price | f
qty | j s
side | s
tradedBy| C
Having these attributes improves the lookup speed, however, attributes only apply to the first where condition, so these columns should be queried first.
/Good
q)select from tab2 where qty=2500,price>1.5
...
/Bad
q)select from tab2 where price>1.5,qty=2500
...
Because we have marked qty as sorted, q is able to utilise a binary search instead of a linear search to perform the lookup.
So far we have focused on in-memory data. There are three on disk table formats which impact how to query.
To check if a table is partitioned we can use the .Q.qp command
q).Q.qp quotes
1b
Then we read the first row of meta to get the partition domain – date.
q)meta quotes
c | t f a
-----| -----
date | d
sym | s p
time | p
src | s
bid | f
ask | f
bsize| i
asize| i
Important: All queries to a historical partitioned database must filter on the partitioned domain and this must always be the first condition of the where clause. Failure to do this will result in q looping over every available subdirectory causing a significant performance hit and a large memory overhead.
/Good
q)select from quotes where date=2019.08.01,bsize>1000
...
/Very Bad
q)select from quotes where bsize>1000,date=2018.08.01
...
In the positive example q queries only the 2019.08.01 directory, filters on bsize for that date, and pulls in the relevant rows from disk into memory.
In the counter example, q filters on bsize for every date directory, and then applies the date clause before loading in the relevant rows.
Whilst both queries return the same output, the counter example will loop through all date directories instead of just accessing the one we need. This makes the execution time n times longer, where n is the total number of partitions in the database.
Attributes still apply to a partitioned table, however they must be placed as the second condition. The meta of quote shows that sym has the `p attribute, so when filtering we will list this field second in the where clause.
/Good
q)select from quotes where date=2018.08.01,sym=`GOOG,bsize>1000
...
/Bad
q)select from quotes where date=2018.08.01,bsize>1000,sym=`GOOG
...
/Very Bad
q)select from quotes where bsize>1000,sym=`GOOG,date=2018.08.01
...
Thanks for reading! We will follow this up soon with how to apply basic aggregations.
Share this: