When you look at a table within a database you usually assume that the data being presented is up-to-date and correct. But what if a particular record was updated or corrected for in-accuracy? Usually the record would be modified directly to reflect this. But what if you wanted to know when this modification occurred, what did it replace, or when did we know about the change?
By making small changes to the schema of a table we can obtain a solution in the form of bitemporal data. Bitemporal data is a specific method of structuring data in such a way that we can track all updates and corrections that occur to records over their lifetime.
In these posts, I will be explaining, through the use of examples, the concept of bitemporal data, and how it can be implemented within a kdb+ environment. To accurately describe bitemporal data, you must first understand what we mean by temporal data. Temporal data refers to any table record that provides information on the past, present or future of an object. In most cases, this is represented by using a temporal datatype, such as a date or time. When a temporal field/column is added to a conventional table (non-temporal) in which records are uniquely identified using an ID, it becomes a uni-temporal Table. I will be using the example of a simple client table consisting of a client id and a client name.
Bitemporal data can be thought of as a special kind of temporal data where a record is uniquely identified using two time fields/periods, in addition to an ID. With the addition of this second time field, each table record now represents two timelines- the valid time (when the value was in force) and the assertion time (when the database knew about it).
A conventional or non-temporal table is one that contains no temporal data. What that means is, each record only describes the current state of the object that it represents. If the state of that object changes, the record is updated, but as a result the original version is overwritten. So you could say that a non-temporal table is one without a history.
So for this initial version, the client table consists of two columns, the client ID (Primary Key) and the client name. The client ID will be used to uniquely identify each client record.
cltid| name -----| --------- C089 | "Johnson" C112 | "Leonard" C113 | "Coyle"
Now, let’s say client C112 ‘s name has changed from “Leonard” to “Smith” (marriage maybe?). To reflect this change, we’ll perform an upsert on the client table for the current record. This results in the following table:
`clientTab upsert (`C112;"Smith")
cltid| name -----| --------- C089 | "Johnson" C112 | "Smith" C113 | "Coyle"
As you can see, the client’s name has been updated, but what if we required the original name at some point in the future? We have no way of knowing that client C112 ever had different surname, or how often it has been modified. Continuing on from this, let’s say we dropped C112 as a client by performing a delete operation on the record.
delete from `clientTab where cltid = `C112
cltid| name -----| --------- C089 | "Johnson" C113 | "Coyle"
Since there is no concept of time within this table, there is no way we can preserve the existence of this client after a deletion. Without a time-based column, this table can only present client records that are still factually clients. So we lose information that could be useful for both auditing and analytical purposes.
Uni-temporal data seeks to extend the amount of information that can be derived from the table, with the addition of a temporal field/column. This temporal field is known as the “Valid Time” which represents the time at which the referenced object existed, in terms of the real-world model. In the example table, this would be the date the specific client first became a client of the company.
Let us take a look at the next iteration of the client table. As you can see, the table now contains a new column labelled “vdate”, which is the valid time column. Already you should see that we can determine more about the records, specifically when each client was created.
cltid vdate | name ----------------| --------- C089 2015.03.10| "Johnson" C112 2015.05.04| "Leonard" C113 2015.08.18| "Coyle"
The valid time can also refer to changes that have occurred to the object. Lets take changing a client’s surname as an example. We will upsert the new change to the table, but unlike the previous example we must include the date at which the change occurred. By including the date, we have now created a reference to a modified version of the object (the client). This results in the following table:
`clientTab upsert (`C112;.z.d;"Smith")
cltid vdate | name ----------------| --------- C089 2015.03.10| "Johnson" C112 2015.05.04| "Leonard" C113 2015.08.18| "Coyle" C112 2015.10.14| "Smith"
As you can see, there are now two records for client C112 both consisting of separate valid times. We now have complete history of each client, from the moment they first entered the table till the present day. With this history we can now obtain information such as the number of times a specific client has been updated, a list of the distinct surnames a client has had or as simply as it may sound, the current details for each client. The latest details can be retrieved like this:
select by cltid from clientTab
cltid| vdate name -----| -------------------- C089 | 2015.03.10 "Johnson" C112 | 2015.10.14 "Smith" C113 | 2015.08.18 "Coyle"
How do we know whether a client is still a client? In our current schema, we only have a single valid date column which works fine with client updates, as logically older updates will occur earlier in the time-line, but it is not possible to infer the current existence of a client.
You could take the approach that clients no longer part of the company should be physically removed from the database. The problem with this approach is that it goes against the whole purpose of using temporal data, which is to provide a detailed record of every change & update that has occurred to an object at any point in time. A deletion would also be classified as an update, and so it is important that we record it as such.
In order to represent whether a particular client still exists, there are two approaches we can use. The first would to add an end valid date in addition to the start date, to create a valid time range. The second and probably simpler approach is to add a flag which indicates whether a client has been deleted logically, commonly known as a delete flag. In this case, when a client leaves the company we would upsert a new record with the date the client left and the delete flag marked as true.
For this example using a delete flag would be the most appropriate route to take to alleviate confusion, below is this approach in action:
cltid vdate name dlt_flg ---------------------------------- C089 2015.03.10 "Johnson" 0 C112 2015.05.04 "Leonard" 0 C113 2015.08.18 "Coyle" 0 C113 2015.12.01 "Davies" 0 C112 2015.12.23 "Leonard" 1
The dlt_flg column can show which clients have been deleted, and when exactly the deletion occurred. We can get the current state of each client with a query like this:
select by cltid from clientTab
cltid| vdate name dlt_flg -----| ---------------------------- C089 | 2015.03.10 "Johnson" 0 C112 | 2015.12.23 "Leonard" 1 C113 | 2015.12.01 "Davies" 0
In this example you can see that on 2015.12.23 the client C112 was logically deleted, whereas the other clients are still recorded as existing clients. If we wish to see the entire state history for a particular client, all we have to do is perform a select on the table for the id of the client. This results in the following table:
select from clientTab where cltid = `C113 cltid vdate name dlt_flg --------------------------------- C113 2015.08.18 "Coyle" 0 C113 2015.12.01 "Davies" 0
Querying to take into account deletions takes a bit more work. What we have to be aware of is the fact that a client can be deleted multiple times, in cases were they come back as a client, so we cannot simply query for records were the delete flag is not true. If such a query was used it still wouldn’t confirm whether the client currently exists, only that the client existed at some point.
To return all of the clients that do still exist, we firstly have to check that the last state of each respective client has a false delete flag set. To do this we can perform an fby (function by) operation on the table, grouping by the client id and aggregating for the last dlt_flag. Once this aggregation is completed, the last value of the flag will rolled out across all records for each client. The result of this operation can then be used in the where clause to return only the clients that haven’t been deleted.
Below is a working example of this approach, which only returns the latest state of clients that do still exist. As long as the fby operation is used in the first condition of the where clause, further conditions and aggregations can be applied with the knowledge that you are working with existing clients. To return only clients which still exist:
select from clientTab where 0 = (last;dlt_flg) fby cltid
cltid vdate name dlt_flg ---------------------------------- C089 2015.03.10 "Johnson" 0 C113 2015.08.18 "Coyle" 0 C113 2015.12.01 "Davies" 0
The previous queries assume that the data is ordered by vdate. This is quite common when working with these data sets but is not necessarily always the case. If the order cannot be relied upon we need to lean on fby a bit more. For example, to extract the current/latest value from a reverse ordered table based on vdate and cltid:
select last name by cltid from reverse[clientTab] where vdate=(max;vdate)fby cltid
cltid| name -----| --------- C089 | "Johnson" C112 | "Leonard" C113 | "Davies"
Or to extract the last current state of every client which hasn’t been deleted:
select from reverse[clientTab] where vdate = (max;vdate) fby cltid, dlt_flg = 0
cltid vdate name dlt_flg ---------------------------------- C113 2015.12.01 "Davies" 0 C089 2015.03.10 "Johnson" 0
The next section of this blog will extend these examples to create bitemporal tables. Bitemporal tables will allow the database to track the full history of the client data, as well as the time that the database knew about the updates, which allows for extended audit capabilities. It will be published shortly.