An Introduction to qsql

Blog kdb+ 9 Aug 2019

Data Intellect

This blog post gives users with a basic knowledge of SQL a brief introduction to querying q/kdb databases, with a focus on filtering data. If you are experimenting on an on-disk database, please read all the way to the bottom before starting as there are some important points about accessing partitioned data structures!

q has its own syntax for querying tables, based on standard SQL. Most commonly used is the select query which features a familiar syntax to SQL but enables much faster queries and introduces functionality not typically available in SQL.

qsql has two notable differences to SQL:

  • q tables have well defined record orders
  • tables are stored as a collection of columns (so vectorized column operations are fast)

In addition q features two data types for text fields; strings and symbols.

  • “strings” Are interpreted as an array of characters
  • `symbols Are an enumerated type, pointing to a text in a central store

Symbols enable q to interpret text as a fixed length data type, making lookups much faster vs conventional strings. It is advised to use symbols wherever possible, the exception being large lists of distinct strings (e.g. free text fields).

Let’s start off by creating a q table.

q)tab:([]sym:`GOOG`AMAZ`APPL`GOOG;
         price:3.1 1.1 2.1 1;
         qty:1200 2000 2500 2500;
         side:`b`b`s`s;
         tradedBy:("George";"George";"George";"Paul"))

This has created an in-memory table with 5 columns and 4 rows. We are now ready try out some queries.

Basic Selects

Selecting from a table is straightforward in q. Note the wildcard * is not necessary, q will select all columns by default.

/SQL equivalent: SELECT * FROM tab
q)select from tab
sym  price qty  side tradedBy
-----------------------------
GOOG 3.1   1200 b    "George"
AMAZ 1.1   2000 b    "George"
APPL 2.1   2500 s    "George"
GOOG 1     2500 s    "Paul"

Next, we can select specific columns from a table

/SQL equivalent: SELECT sym,price,qty FROM tab
q)select sym,price,qty from tab
sym  price qty
---------------
GOOG 3.1   1200
AMAZ 1.1   2000
APPL 2.1   2500
GOOG 1     2500

q also allows you to rename columns in select statements using : . This is equivalent of AS in SQL.

/SQL equivalent: SELECT sym,price AS px,qty AS qt FROM tab
q)select sym,px:price,qt:qty from tab
sym  px  qt
-------------
GOOG 3.1 1200
AMAZ 1.1 2000
APPL 2.1 2500
GOOG 1   2500

It is also possible to apply any built-in or custom q function to columns. For example here we are calculating value as quantity multiplied by price.

/SQL equivalent: SELECT sym,price*qty AS val FROM tab
q)select sym,val:price*qty from tab
sym  val
---------
GOOG 3720
AMAZ 2200
APPL 5250
GOOG 2500

Filtering

As in SQL, q features a where condition, to allow filtering on rows.

/SQL equivalent: SELECT * FROM tab WHERE sym='GOOG'
q)select from tab where sym=`GOOG
sym  price qty  side tradedBy
-----------------------------
GOOG 3.1   1200 b    "George"
GOOG 1     2500 s    "Paul"

Or multiple conditions:

/SQL equivalent: SELECT * FROM tab WHERE sym='GOOG' AND qty=2500
q)select from tab where sym=`GOOG,qty=2500
sym  price qty  side tradedBy
-----------------------------
GOOG 1     2500 s    "Paul"

Note this is similar to but not exactly the same as a logical AND. Here q applies each condition sequentially, filtering the results each time. This means it is most efficient to list the most selective condition first. q supports logical “and” and “or” operations. For simplicity, it is best to wrap each condition in brackets due to the parsing logic of q.

/SQL equivalent: SELECT * FROM tab WHERE price>2 OR side='b'
q)select from tab where (price>2) or (side=`b)                                                                
sym  price qty  side tradedBy
-----------------------------
GOOG 3.1   1200 b    "George"
AMAZ 1.1   2000 b    "George"
APPL 2.1   2500 s    "George"

q supports a wide range of functions which can be applied to where conditions. Including;

  • Greater/Less than:
q)select from tab where qty<1000
sym  price qty  side tradedBy
-----------------------------
GOOG 3.1   1200 b    "George"
AMAZ 1.1   2000 b    "George"
  • Inclusion in a list:
q)select from tab where sym in `AMAZ`APPL
sym  price qty  side tradedBy
-----------------------------
AMAZ 1.1   2000 b    "George"
APPL 2.1   2500 s    "George" 
  • Within a range:
q)select from tab where price within 1.0 3.0
sym  price qty  side tradedBy
-----------------------------
AMAZ 1.1   2000 b    "George"
APPL 2.1   2500 s    "George"
GOOG 1     2500 s    "Paul" 

Filtering on string columns presents a slight challenge in q, as strings are handled as an array of characters. As such, a standard equals throws a length error:

q)select from tab where tradedBy="George"
'length

Instead we can avoid this issue by using the like operator

q)select from tab where tradedBy like "George"
sym  price qty  side tradedBy
-----------------------------
GOOG 3.1   1200 b    "George"
AMAZ 1.1   2000 b    "George"
APPL 2.1   2500 s    "George"

The like operator can also be used to match against symbol columns, and supports regular expressions

q)select from tab where sym like "A*"
sym  price qty  side tradedBy
-----------------------------
AMAZ 1.1   2000 b    "George"
APPL 2.1   2500 s    "George"

q supports 4 regex wildcards:

  • ? Matches any character
  • * Matches any sequence of characters
  • [] Matches a list of alternative
  • ^ List of alternatives not to be matched

Working with symbols containing special characters (such as *, @, -, /) can also present a challenge as they also function as operators. To get round this we need to create them as strings and then use `$ to cast to symbols.

We can show this by defining a new table, with exchange codes added to the sym column.

q)tab1:([]sym:`$("GOOG-q";"AMAZ-n";"APPL-n";"GOOG-q");
         price:3.1 1.1 2.1 1;
         qty:1200 2000 2500 2500;
         side:`b`b`s`s;
         tradedBy:("George";"George";"George";"Paul"));

q)select from tab1
sym    price qty  side tradedBy
-------------------------------
GOOG-q 3.1   1200 b    "George"
AMAZ-n 1.1   2000 b    "George"
APPL-n 2.1   2500 s    "George"
GOOG-q 1     2500 s    "Paul"

Regular queries will not work due to the – character

q)select from tab1 where sym=`AMAZ-n
 'n

We can fix this by casting

q)select from tab1 where sym=`$"AMAZ-n"
sym    price qty  side tradedBy
-------------------------------
AMAZ-n 1.1   2000 b    "George"
// or for a list
q)select from tab1 where sym in `$("AMAZ-n";"APPL-n")
sym    price qty  side tradedBy
-------------------------------
AMAZ-n 1.1   2000 b    "George"
APPL-n 2.1   2500 s    "George"

or we can use like

q)select from tab1 where sym like "AMAZ-n"
sym    price qty  side tradedBy
-------------------------------
AMAZ-n 1.1   2000 b    "George"

Meta Data

To efficiently query a KDB database we will need some meta information such as the list of tables, the schema of each table and any special properties these tables might have.

Firstly to view all tables we can use the tables command

/mySQL equivalent: show tables
q)tables[]
 `tab`tab1

Next we can use the meta command to list the columns (c) and their types (t), as well as any linked tables (f) and any attributes (a) that may have been added to the table (attributes are covered in the next section). See here for a full list of data types.

/mySQL equivalent: describe tab
q)meta tab
c       | t f a
--------| -----
sym     | s
price   | f
qty     | j
side    | s
tradedBy| C

Decoding this table, we can tell that tab has 5 columns sym, price, qty, side, tradedBy and the corresponding types symbol, float, long, symbol, string. The reason why tradedBy is represented by a capital C is because q views strings as an array of characters. In meta all array types are represented by the capitalised version of the underlying data type.

Attributes

In q we can apply specific attributes to columns. These attributes allow q to access in-built optimisations to improve lookup speeds.

  • s -> sorted attribute
  • u -> unique attribute
  • p -> parted attribute
  • g -> grouped attribute

We can apply attributes using the # operator. Here we are telling q that qty is a sorted column

q)tab2:([]sym:`GOOG`AMAZ`APPL`GOOG;
          price:3.1 1.1 2.1 1;
          qty:`s#1200 2000 2500 2500;
          side:`b`b`s`s;
          tradedBy:("George";"George";"George";"Paul"))

Attributes can be viewed using the meta command

q)meta tab2
c       | t f a
--------| -----
sym     | s
price   | f
qty     | j   s
side    | s
tradedBy| C

Having these attributes improves the lookup speed, however, attributes only apply to the first where condition, so these columns should be queried first.

/Good
q)select from tab2 where qty=2500,price>1.5
...

/Bad
q)select from tab2 where price>1.5,qty=2500
...

Because we have marked qty as sorted, q is able to utilise a binary search instead of a linear search to perform the lookup.

On Disk Databases

So far we have focused on in-memory data. There are three on disk table formats which impact how to query.

  • Flat tables- are saved in a single binary file, and are loaded fully into memory when the database is loaded.
  • Splayed tables- are split and saved in a directory containing one file per column. When the database is loaded, q will only pull the schema into memory. When querying q will bring the relevant column into memory, perform any where conditions, and then bring the requested data into memory. Whilst these are slightly slower than in-memory tables, the syntax will be the same. However, caution should be taken when performing large queries as it is possible to exceed the memory limit of the process.
  • Partitioned tables- most q historical databases will be stored in a partitioned format. This is where data is split into subdirectories over a field (usually date). Each subdirectory will contain a splayed table containing the data from that date. q will query each one of these subdirectories individually to prevent the need to load large amounts of data. Partitions must be respected in order to efficiently query large databases.

To check if a table is partitioned we can use the .Q.qp command

q).Q.qp quotes
1b

Then we read the first row of meta to get the partition domain – date.

q)meta quotes
c    | t f a
-----| -----
date | d
sym  | s   p
time | p
src  | s
bid  | f
ask  | f
bsize| i
asize| i

Important: All queries to a historical partitioned database must filter on the partitioned domain and this must always be the first condition of the where clause. Failure to do this will result in q looping over every available subdirectory causing a significant performance hit and a large memory overhead.

/Good
q)select from quotes where date=2019.08.01,bsize>1000
...

/Very Bad
q)select from quotes where bsize>1000,date=2018.08.01
...

In the positive example q queries only the 2019.08.01 directory, filters on bsize for that date, and pulls in the relevant rows from disk into memory.

In the counter example, q filters on bsize for every date directory, and then applies the date clause before loading in the relevant rows.

Whilst both queries return the same output, the counter example will loop through all date directories instead of just accessing the one we need. This makes the execution time n times longer, where n is the total number of partitions in the database.

Attributes still apply to a partitioned table, however they must be placed as the second condition. The meta of quote shows that sym has the `p attribute, so when filtering we will list this field second in the where clause.

/Good
q)select from quotes where date=2018.08.01,sym=`GOOG,bsize>1000
...

/Bad
q)select from quotes where date=2018.08.01,bsize>1000,sym=`GOOG
...

/Very Bad
q)select from quotes where bsize>1000,sym=`GOOG,date=2018.08.01
...

Thanks for reading! We will follow this up soon with how to apply basic aggregations.

Share this:

LET'S CHAT ABOUT YOUR PROJECT.

GET IN TOUCH