In November 2017, kx acquired the rights to PyQ from Enlightenment Research and made it freely available on the Apache 2.0 licence. PyQ enables the seamless integration of a Python interpreter and a kdb+ server via jupyter notebooks, allowing users to benefit from the best of both technologies simultaneously.

We have combined PyQ with our TorQ-FX package to illustrate the power of these two technologies by analysing large quantities of foreign exchange, commodities and index data. This post is a jupyter notebook, so you can see exactly what we've done, and try for yourself.

NumPy and Pandas

In python everything is an object and typing is dynamic. This increases ease of use, but comes at the cost of speed as variables are stored with overhead to include all the python object information. A single integer will contain information specifying its reference count, type, size and digit.

For large arrays or tables of data this type of structure is very inefficient, as all variables in a table column tend to be of the same type. Enter numpy. numpy arrays are densely packed arrays of fixed-type, so you get the benefits of locality of reference. Additionally many numpy operations are implemented in C, avoiding the general cost of loops in python, pointer indirection and per-element dynamic type checking.

The other key piece of the data analysis toolkit in modern python is pandas. pandas combines fast fixed-type arrays together into high level relational data structures (tables!), and provides the syntax and tools for doing practical, real world data analysis on them.

kdb+ is designed from the ground up to cope with large amounts of time-series data. It makes very similar design choices to numpy; data is stored in fixed-type arrays by default with very little overhead, and it uses fast vectorized operations. With kdb+ we also get much more database functionality including the ability to query data on disk and in memory, and use SQL like syntax.

To quickly demonstrate the significance of these different data structures, compare the speed of a simple summation of 20,000,000 random numbers:

In [1]:
import random

# create a python array of 20m numbers and sum
large_array = [random.random() for x in range(20000000)]
%timeit sum(large_array)
1 loop, best of 3: 529 ms per loop
In [2]:
import numpy as np

# create a numpy array of 20m numbers
large_array = np.random.rand(20000000)                    
%timeit np.sum(large_array)
10 loops, best of 3: 22.2 ms per loop
In [3]:
from pyq import q
# create array of 20m numbers in q and sum
%q large_array:20000000?1.0
%q \t sum(large_array)
Out[3]:
24

The speed advantages of a vectorized array approach used by both numpy and kdb+ are obvious as this simple summation takes approximately 20ms, much faster than a standard Python approach.

The Best of Both Worlds with PyQ

pandas and numpy are fantastic data analysis tools, and the ecosystem of tools in python for machine learning, statistics and data visualisation is second to none, however these tools are limited to in-memory processing. Once a dataset is too big to fit in memory you're out of luck! This is where pyq comes in: it allows us to combine kdb+ and python seamlessly in a single process, taking full advantage of the strengths of both. pyq queries on our data are conducted by kdb, and the results of the queries can be stored as shared memory objects that either python or kdb+ can use; this avoids an expensive serialization step to transfer data between processes.

For this post we've prepared an large historical FX database using our TorQ-FX package, if you want to try it out yourself you'll need to download a database first. The full history for this dataset runs to 100s of GB, so in-memory processing is out of the question unless you have some very expensive hardware.

First we load the database into our session (the path will be changed later):

In [4]:
%%q 
system"l /home/username/Gain-FX/fxhdb/";

With a well structured database mapped into memory it's very easy and fast to run simple queries and only pull the necessary information into memory. As a very simple example we can pull out the closing prices of every USD pair in 2017:

In [5]:
%%q -o fxdata
0!select 
        ClosingPrice:last RateBid 
    by 
        date,CurrencyPair 
    from 
        gainfx
    where 
        date within (2017.01.01;2018.01.01),
        CurrencyPair like "*USD"

The result is now saved as fxdata, and can be manipulated by python as well as kdb+:

In [6]:
import datetime

# index the date column and add 1 day (in python!!)
fxdata['date'] + datetime.timedelta(days=1)
Out[6]:
2017.01.03D00:00:00.000000000 2017.01.03D00:00:00.000000000 2017.01.03D00:00:..

We can easily visualize the data using any of the many fantastic libraries available (here we are using plotly):

In [7]:
import pandas as pd
import cufflinks as cf 
import plotly.offline as py
cf.go_offline()

# pivot to the necessary data shape
data = pd.DataFrame(dict(fxdata.flip)).pivot(index='date', 
                                             columns='CurrencyPair', 
                                             values='ClosingPrice')
 # and plot a simple line chart
data.iplot(kind='scatter',
           title='2017 USD Rates',
           yrange = [0.5,2],
          showlegend = False)