Version 1.0
Publication March 1 2012
Last Edit March 1 2012
Copyright DiscreteLogics 2012.
Permission granted for free use and distribution, conditioned upon inclusion of the above attribution and copyright notice.
The TeaFile format provides a very efficient and simple way to persist time series data in flat files. Items of homogeneous time series are stored in raw binary format. The header holds optional descriptions of the binary layout of the items or the file content. This removes the opaqueness from the binary data stored, making TeaFiles a self contained transparent storage method that also serves as the data centric interface between applications.
TeaFiles can easily be read using the canonical file I/O functions. APIs will provide a more convenient and safe way to access such files and will be prefered most times. This document targets API writers and those accessing or analyzing TeaFiles directly. It assumes basic background in data representations and programming.
To achive best possible performance, time series data is stored in TeaFiles such that it can be mapped directly into memory. Data might also be written by traditional read/write IO functions, but ensuring the possibility to memory map the data puts a stronger constraint upon the file format, as it requires the data to be written in the form of the data item's binary memory footprint. The platform neutrality requirement therefore can only be achieved if the file has no fixed endianness specification. On a little endian machine, a TeaFile must be in little endian format to allow memory mapping, while on a big endian it must be in big endian format.
Platform compatibility (operating sysytem, cpu type or application) is approached by restriction to ubiquitious number formats, in particular the IEEE 754 floating point numbers. Other platform specific data types are still allowed. If used, they will affect portability.
To ease data exchange via TeaFiles, simple access to their data via primitive IO functions is provided. Raw access requires not more than reading of the first 4 mandatory values in the file. A useful API is therefore written within minutes and can then be incrementally extended to read also the optional sections inside the file that describe the item layout and content in the file.
Items must be primitive types like integers or doubles or structures of such primitive types. Pointers to memory or strings can not be stored. While the design targeted time series storage, in which case each item has one value that holds a time value, the file format is also useful to generally store homogeneous collections of items (arrays) without time stamp.
Thi file format specification is expected to change rarely. Upcoming needs for changes are supposed to be handled by a adding new sections or possibly accompanied by dropping others. If for instance the Item Layout Section should be changed, a new section (Item-Layout-2) with a different id would be proposed that has the new format.
Appart from the pure file format description, this document occasionally holds annotations about the implementation of APIs.
A sample TeaFile holding Time/Price/Volume data, a content description, name value description and a description of the time format used might have the following binary layout:
struct Tick
{
public Time Time;
public double Price;
public long Volume;
}
TeaFile<Tick>.Create("lab.tea", "ACME prices", NameValueCollection.From("decimals", 2));
The C# code above creates this binary file:
0 | Magic Value | 0x0d0e0a0402080500 | byte order, file type identifier. spec | ||||
8 | ItemStart | 200 | binary data area starts at absolute posiition 200. spec | ||||
16 | Itemend | 0 | binary data area ends at the phsyical end of file. spec | ||||
24 | Section Count | 4 | header holds 4 sections. spec | ||||
32 | Section Id | 10 | ItemSection describes the item stored in the file. spec | ||||
36 | Next Section Offset | 67 | next section begins at 36 + 4 + 67 = 107 spec | ||||
40 | ItemSize | 24 | items have a size of 24 byte spec | ||||
Item Type Name 'Tick' | the name of the item is "Tick" spec | ||||||
44 | bytes length | 4 | length if "Tick" | ||||
48 | bytes | 84,105,99,107 | UTF8 bytes of "Tick" | ||||
52 | Field Count | 3 | the item holds 3 fields | ||||
56 | Field Type | 4 | first field is of field type 4 (long). spec | ||||
60 | Field Offset | 0 | it has an offset of 0 inside the item. spec | ||||
Field Name 'Time' | its name is "Time". spec | ||||||
64 | bytes length | 4 | length of "Time" | ||||
68 | bytes | 84,105,109,101 | UTF8 bytes of "Time" | ||||
72 | Field Type | 10 | second field is of field type 10 (double) | ||||
76 | Field Offset | 8 | byte offset = 8 | ||||
Field Name 'Price' | name = "price" | ||||||
80 | bytes length | 5 | |||||
84 | bytes | 80,114,105,99,101 | |||||
89 | Field Type | 4 | third field is of type 4 (long) | ||||
93 | Field Offset | 16 | offset 16 | ||||
Field Name 'Volume' | name = "Volume" | ||||||
97 | bytes length | 6 | |||||
101 | bytes | 86,111,108,117,109,101 | |||||
107 | Section Id | 128 | Content Description Section spec | ||||
111 | Next Section Offset | 15 | next section begins at 111 + 4 + 15 = 130 | ||||
115 | bytes length | 11 | length of "ACME prices" | ||||
119 | bytes | 65,67,77,69,32,112,114,105,99,101,115 | UTF8 bytes of "ACME prices" | ||||
130 | Section Id | 129 | NameValue Section spec | ||||
134 | Next Section Offset | 24 | next section begins at 134 + 4 + 24 = 162 | ||||
138 | Name Value Count | 1 | 1 name/value pair follows spec | ||||
142 | bytes length | 8 | length of "decimals" | ||||
146 | bytes | 100,101,99,105,109,97,108,115 | UTF8 bytes of "decimals" | ||||
154 | NameValue.Kind | 1 | 1 (int) spec | ||||
158 | NameValue.Value | 2 | value = 2 | ||||
162 | Section Id | 64 | TimeSection spec | ||||
166 | Next Section Offset | 24 | next section would be at 166 + 4 + 24 = 194. spec | ||||
170 | epoch | 719162 | epoch is 719162. time is measured from 1.1.1970.spec | ||||
178 | ticks per day | 86400000 | 86400000 ticks are counted per day, so here we count milliseconds. | ||||
186 | time fields count | 1 | the item holds 1 time field. | ||||
190 | time field offset | 0 | The first (and last) time field is the field at offset 0. spec | ||||
194 | Padding Bytes | 0,0,0,0,0,0 | this file aligns the data area at the next 8byte boundary. spec | ||||
200 | item area ... |
TeaFiles start with a header followed by the item area holding the data in binary format.
The first octet of a TeaFile are alays the magic number 0x0d0e0a0402080500 which can be memorized as follows:
0d 0e 0a 04 02 08 05 00 T e a for two at 5 :00
If an application does not find this sequence at the file begin, it should raise an exception. In such case, the file might have been written by a different endianness application. TeaFiles must have the endianness of the local machine so they can be mapped directly into memory. If files are transfered between systems, the file might need endian conversion before it can be read correctly or mapped into memory.
This magic number allows to
An API can translate different endian-formatted files on the fly for read and write operations, for memory mapping purposess the file would need to be converted.
This specification treats these 8 bytes as a byte sequence. API writers may use an unsigned 8 byte integer type like unsigned long. It is however also safe to use a signed 64bit wide integer without causing overflow, since whatever the byte order on a machine is, all byte values within 0x0d0e0a0402080500 are < 0x10. Signed int 64 is actually recommended, as all other 64 bit wide numbers inside the header are unsigned, so it reduces the number of types to be read from the file.
Setting the file pointer to this position allows reading of the first item in the file. If the first value is placed directly after the header, this value equals the size of the header. Often however, it will be larger: First, the data area should be aligned at the 8 byte boundary to avoid considerable performance penalties from unaligned double or long values. Second, if the description of the file is modified at a time it contains data, the header might increase in size and the change will thus require the item area to be moved. To avoid costly data moving, the file can be created with some headroom behind the header.
If this value is 0, then the end of the physical file equals the end of the item area. Otherwise it specifies the absolute byte offset of the end of the item area from the begin of the file.
Leaving this value at 0 provides simple addition of new items to the file. Besides writing new items to the end of the file, no further updates are required. To avoid fragmentation, files might initially be preallocated at a larger size. To allow this, the ItemEnd value can be set to determine the logical end of file. The value of ItemEnd will always be in the range
The TeaFile header can optionally hold an arbitrary number of sections. This is the number of such sections.
The 4 values above are the only mandatory values of a TeaFile. This means that the shortest possible TeaFile has 4 * 8 = 32 bytes. Such file will have these values:
0 | 0x0d0e0a0402080500 | Magic Value | magic number |
8 | 32 | ItemStart | item area directly follows header |
16 | 0 | ItemEnd | logical end of file is not used |
24 | 0 | Section Count | no sections |
32 | file end | item area |
Theoretically there is another 32 byte long TeaFile that is also valid:
0 | 0x0d0e0a0402080500 | Magic Value | magic number |
8 | 32 | ItemStart | item area directly follows header |
16 | 32 | ItemEnd | logical end of file is used |
24 | 0 | Section Count | no sections |
32 | file end | item area |
In the latter sample, the ItemEnd value is set although no preallocation is done as the file is by definition 32 bytes long. This will hardly make good sense but the file is still valid.
The number of items requires knnowledge of the size of the items in the file. This size is either known up front or it is read from the Item Layout section (see below). Given the item size, the number of items in the file is computed by
For the 32 byte long TeaFile samples above we get:
N = 32 -32 / ItemSize which is 0 regardless of the item size and
N = 32 -32 / ItemSize which is also 0, the only difference being that the first number 32 is now ItemEnd while it was the FileSize before.
TeaFiles may contain sections describing their content. Each section is optional, if it occurs it must occur only once. This document defines several sections and leaves room for custom sections that allow storage of additional features. For each section (official or custom) the following fields are written:
Each section has a unique ID. The id of the first section is always at position 32 in the file. Its value is one the of the section id values specified in this document or a custom section id above 0xffff. The IDs specified in this document are:
0x0a | Item Section |
0x40 | Time Section |
0x80 | Content Section |
0x81 | NameValue Section |
Sections are written into the file one after another. Users interested in specific sections only can read the Section Id, decide whether to read the section and optionally skip it by reading the next section offset and adding it to the file pointer to reach the next section id.
This value might equal the length of the section but might also be higher due to padding or because some headroom was reserved for the section such that modifications of the section can be done without affecting other sections or the file's item area.
For the last section, the next section offset shall be ignored by file readers.
Ignoring the next section offset for the last section is easy for readers, as they read the section count before and know when the last section is read. On the other hand, imposing that he last section must have a section offset of 0, which would make good sense, makes API writing considerable harder.
The Item Section describes the layout of the items stored in the file, in case of structured items the name, type and offset of its fields. This is the most important section as it allows reading the data without any further knowledge about the file.
All fields below are written without any padding in between.
The size of the item. This is the space occupied by each item in the file, including any padding bytes.
The name of the type, like "OHLCV", as UTF8 string. Strings are written length prefixed (int32) following by the text data in UTF8.
The number of fields in the item. This value must be >= 1. If the Item Type of a TeaFile is a plain primitive value, the Item Name might be "Int" while the single field value might be "Value". In other words primitives are treated as structs with one field having the type of the primitive.
For each field, a Field description is written holding:
The types of fields inside a TeaFile should be available on as many platforms as possible to support versatile data access. At the same time, TeaFiles might be used in focused scenarios were such ubiquitous data access possibility is not required and where other formats are desirable. The file format foresees this possibility.
FieldType is either a value from 1 to 10, indicating a platform agnostic, ubiquitous type or a platform specific type which currently encompasses the .Net decimal. For fields holding other types, one can use value the values >= 0x1000.
// platform agnostic Int8 = 1, Int16 = 2, Int32 = 3, Int64 = 4, UInt8 = 5, UInt16 = 6, UInt32 = 7, UInt64 = 8, Float = 9, // IEEE 754 Double = 10, // IEEE 754 // platform specific NetDecimal = 0x200, // private extensions must have integer identifiers above 0x1000. Custom = 0x1000
The field's byte-offset inside the item.
The name of the field as UTF8 string. Strings are written length prefixed (int32) following by the text data in UTF8.
A string that describes the content of the file. Examples: "Daily average temperature Easy Village", "Prices of ACME at NYSE".
The number of name/Value pairs to follow.
For each Name/Value pair, the following values are written
The name.
One of the following integer values, indicating the type of the value to follow:
Dependant on the previous value of Kind, the value will hold an int32, double, string or uuid value. A uuid value is a sequence of 16 bytes considered to be unique.
This section describes how a number is interpreted as time and which fields in the item are to be interpreted as time.
Time is specified as the number of "tick" intervals that elapsed since some origin. The epoch value gives this origin as the number of days that passed between since 1.1.0000 and the origin. For the .Net System.DateTime Type, that counts ticks since 1.1.0000 this value would be 0. For time systems counting from 1.1.1970, the epoch is 719162.
This value specifies how many ticks are counted per day. Conversely this number specifies the length of the tick interval. A value of 1 would mean that days are counted, 86400 would mean seconds and 86400000 milliseconds.
For TeaFiles that shall be accessed from various platforms, the Java Time format is recommended, as it often sufficient resolution of milliseconds and due to its origin at 1.1.1970, its values remain much smaller than that of .Net System.DateTime that gives values hard to deal with in applications that cannot handle 64 bit integers like R.
For time series, the item will hold one field that is the event time, the time at which the observation or the event occured. If the file holds a plain collection of values and is not a time series, then no time field might exist. On the other hand items might have an event time and in addition ther time fields. In summary the number of time fields is between 0 and the number fo fields in the item.
If an item has more than one time field, the firstone is considered the event time. The values of this field must then be non-decreasing. In other words each event must have a time >= time of the previous event.
As many int32 values follow as [Time Fields Count] indicates. Each of these values holds the offset of the corresponding field.
The reason to use the offset instead of the field index is that an application that shall select an interval of values from a TeaFile, is required to read the Time Section only in order to deliver all values from t1 to t2.