Data Intellect
Accessibility is important for all software, it provides a pleasant user experience as well as a seamless interaction with non-native processes. The Data Access API provides TorQ with a sleek user interface as well as accessible and extensive functionality to both the inexperienced kdb+ user and non kdb+ processes. Whilst experienced kdb+ developers will likely prefer to write queries directly, the Data Access API allows the same request to be run across multiple different end points (including non kdb+) and more complex joins automatically across different data sources.
This blog highlights how Data Access API has achieved its goals through:
Download and installation can be found at our github repo and full documentation here.
The getdata
function provides a dynamic lightweight access point to a process. getdata
takes in a uniform dictionary type (see table below) to build a bespoke query. Input consistency permits getdata
to be called either directly from within a process or at a gateway.
Parameter | Example | Description | Required |
---|---|---|---|
tablename | `quote | Table to query | Yes |
starttime | 2020.12.18D12:00 | Start time | Yes |
endtime | 2020.12.20D12:00 | End time | Yes |
timecolumn | `time | Column to apply time filter to | No |
instruments | `AAPL`GOOG | Instruments To filter on | No |
columns | `sym`bid`ask`bsize`asize | Table columns to return | No |
grouping | `sym | Columns to group by | No |
aggregations | `last`max!(`time;`bidprice`askprice) | Dictionary of aggregations | No |
timebar | (`time;10;`minute) | Time grouping | No |
filters | `bid`bsize!(((<;85);(>;83.5));enlist(not;within;5 43)) | Filters | No |
freeformwhere | “sym=`AAPL, src=`BARX, price within 60 85” | kdb where clause | No |
freeformby | “sym:sym, source:src” | kdb by clause | No |
freeformcolumn | “time, sym,mid:0.5*bid+ask” | kdb select clause | No |
ordering | enlist(`desc`bidprice) | column to sort result by | No |
optimisation | 0b | Toggle getdata’s built in optimiser | No |
renamecoloumn | `old1`old2`old3!`new1`new2`new3 | Column renaming dictionary | No |
sublist | 6 | Take the top n rows | No |
Above all, the variety of arguments strike an important balance between accessibility and simplicity. Allowing, yet not enforcing, dictionary manipulation provides an access point to all users. The following two code snippets reinforce this point:
q)getdata`tablename`starttime`endtime`instruments`columns!(`quote;2000.01.01D00:00;2000.01.06D10:00;`GOOG;`sym`time`bidprice`bidsize`askprice`asksize)
sym time bidprice bidsize askprice asksize
---------------------------------------------------------------------
GOOG 2000.01.01D00:00:00.000000000 97.2 959.4 118.8 1172.6
GOOG 2000.01.01D02:24:00.000000000 90.9 932.4 111.1 1139.6
GOOG 2000.01.01D04:48:00.000000000 98.1 933.3 119.9 1140.7
GOOG 2000.01.01D07:12:00.000000000 94.5 939.6 115.5 1148.4
GOOG 2000.01.01D09:36:00.000000000 93.6 925.2 114.4 1130.8
q)getdata`tablename`starttime`endtime`freeformcolumn`freeformby`timebar!(`quote;.z.d+00:00;.z.p;"avgmid:avg 0.5*bid+ask";"sym,src";(`time;6;`hour))
sym time src | avgmid
----------------------------------------| --------
AAPL 2021.01.19D12:00:00.000000000 BARX | 84.15385
AAPL 2021.01.19D12:00:00.000000000 DB | 84.18419
AAPL 2021.01.19D12:00:00.000000000 GETGO| 84.15667
AAPL 2021.01.19D12:00:00.000000000 SUN | 84.375
The first snippet shows the API accepting exclusively simple arguments. By contrast, the second snippet demonstrates the API executing a more traditional kdb+ query, both returning the expected results. No timecolumn
argument has been provided, as the API has used a default time column.
The API also provides a function to build, yet not execute, a query. This allows for faster debugging.
q).dataaccess.buildquery `tablename`starttime`endtime`freeformcolumn `freeformby`timebar`instruments!(`quote;.z.d+00:00;.z.p;"avgmid:avg 0.5*bid+ask";"sym:sym,src:src";(`time;6;`hour);`AAPL)
? `quote ((=;`sym;,`AAPL);(within;`time;2021.01.20D00:00:00.000000000 2021.01.20D08:47:55.058354000)) `sym`time`src!(`sym;({[n;x]
typ:type x;
timebucket:n*0D00:00.000000001;
...
Logging functionality has also been added, allowing a developer to quickly unpick a bad query. The below table is populated with queries to the API from both the GW and within the process.
.dataaccess.stats:([querynumber:()]user:();starttime:();endtime:();handle:();request:();success:();error:())
From within the gateway the user sees the greatest extension of functionality. This is because same the uniform dictionary can be sent to multiple processes and the results joined back together. The process is simple (see diagram):
.dataaccess.getdata inputdictionary
getdata(inputdictionary)
in each processThe resulting code can be seen below:
q)g"querydictyesterday"
tablename | `quote
starttime | 2021.02.08D00:00:00.000000000
endtime | 2021.02.09D00:00:00.000000000
aggregations| `max`min!(`ask`bid;`ask`bid)
q)g"querydicttoday"
tablename | `quote
starttime | 2021.02.09D00:00:00.000000000
endtime | 2021.02.09D09:00:00.000000000
aggregations| `max`min!(`ask`bid;`ask`bid)
g"querydict"
tablename | `quote
starttime | 2021.02.08D00:00:00.000000000
endtime | 2021.02.09D09:00:00.000000000
aggregations| `max`min!(`ask`bid;`ask`bid)
q)g".dataaccess.getdata querydictyesterday"
maxAsk maxBid minAsk minBid
---------------------------
214.41 213.49 8.8 7.82
q)g".dataaccess.getdata querydicttoday"
maxAsk maxBid minAsk minBid
---------------------------
94.81 93.82 8.43 7.43
q)g".dataaccess.getdata querydict"
maxAsk maxBid minAsk minBid
---------------------------
214.41 213.49 8.43 7.43
Another extension the API provides is compatibility with q-REST users can send requests in json form to the gateway see the documentation for a complete explanation.
The getdata
function is split into three sub functions: checkinputs
, extractqueryparams
and queryorder
.
Checkinputs
The checkinputs function takes the input dictionary and runs various checks on each of the passed parameters before executing the query. The function scrutinizes queries to catch errors before they happen as well as return useful error messages such as:
q)getdata `tablename`starttime`endtime`freeformby`instruments!(`notAtable;.z.d+00:00;.z.p;"sym:sym";`AAPL)
'table:`notAtable doesn't exist
Queryorder
Queryorder is where the query is rearranged to return a consistent succession of aggregations as well as optimising the performance.
Firstly, the by clause is ordered date, sym, then other aggregations.
q)getdata`tablename`starttime`endtime`freeformcolumn`freeformby`timebar!(`quote;.z.d+00:00;.z.p;"avgmid:avg 0.5*bid+ask";"sym:sym,src:src";(`time;6;`hour))
sym time src | avgmid
----------------------------------------| --------
AAPL 2021.01.19D12:00:00.000000000 BARX | 84.15385
AAPL 2021.01.19D12:00:00.000000000 DB | 84.18419
AAPL 2021.01.19D12:00:00.000000000 GETGO| 84.15667
AAPL 2021.01.19D12:00:00.000000000 SUN | 84.375
The returned columns have been reordered to create a more intuitive response. Although technically a restriction to the experienced developer, the accessibility far outweighs this. Moreover, queryorder improves the speed of a query (see performance).
q).dataaccess.buildquery `tablename`starttime`endtime`freeformcolumn`freeformby`instruments!(`quote;.z.d+00:00;.z.p;\"mprice:max ask\";\"sym:sym,src:src\";`AAPL)
? `quote ((=;`sym;,`AAPL);(within;`time;2021.01.20D00:00:00.000000000 2021.01.20D09:19:22.689811000)) `sym`src!`sym`src (,`mprice)!,(max;`ask)
Extending the getdata function would require additions to all three sub-functions as well as extensive testing. The data access API has a vast testing library.
The returned query is written in kdb+ hence the performance of getdata and a well written query is similar. By default, queryorder automatically reorders the where clause to ensures a consistently good query speed for all API users. This can be toggled off by setting optimisation to 0b in the input dictionary.
To determine the effectiveness of the API, tests were carried out across three 5Gb HDB with 22 partitions and a RDB. Each setup had a varying number of syms. The following queries were tested:
Queryname | Call |
Optimised1 | `tablename`starttime`endtime`freeformby`aggregations`freeformwhere)!(`quote;00:00+2020.12.17D10;.z.d+12:00;\”sym\”;(`max`min)!((`ask`bid);(`ask`bid));\”sym in `AMD`HPQ`DOW`MSFT`AIG`IBM |
kdb1 | select max ask,min bid,max bid,min ask by sym from quote where sym in `AMD`HPQ`DOW`MSFT`AIG`IBM |
Optimised2 | (`tablename`starttime`endtime`aggregations`timebar)!(`quote;2021.02.23D1;.z.p;(enlist(`max))!enlist(enlist(`ask));(6;`hour;`time)) |
kdb2 | select max ask by 21600000000000 xbar time from quote where time>2021.02.23 |
Optimised3 | (`tablename`starttime`endtime`filters!(`quote;2021.01.20D0;2021.02.25D12;`bsize`sym`bid!(enlist(not;within;5 43);enlist(like;\”*OW\”);((<;85);(>;83.5))))) |
kdb3 | select from quote where bid within(83.5;85),not bsize within(5;43),sym like \”*OW\” |
The unoptimsed columns are the identical query as the retrospective optimised ones however with optimisation
set to 0b
.
The results and full methodology can be seen in the documentation. Each section of the graph highlights key functionality of the API:
getdata
function as fast and lightweight.In conclusion, there is no doubt the API has provided a simple, yet unrestrictive, access point to TorQ. This has been achieved by:
getdata
input arguments maintain freedom for the developer whilst creating a better user experience.The API’s interaction with Google BigQuery is complete, further information will be provided in a future blog post.
Share this: