TeaFile Specification

Version 1.0

Publication March 1 2012

Last Edit March 1 2012

Copyright DiscreteLogics 2012.

 

Permission granted for free use and distribution, conditioned upon inclusion of the above attribution and copyright notice.

Purpose

The TeaFile format provides a very efficient and simple way to persist time series data in flat files. Items of homogeneous time series are stored in raw binary format. The header holds optional descriptions of the binary layout of the items or the file content. This removes the opaqueness from the binary data stored, making TeaFiles a self contained transparent storage method that also serves as the data centric interface between applications.

Audience

TeaFiles can easily be read using the canonical file I/O functions. APIs will provide a more convenient and safe way to access such files and will be prefered most times. This document targets API writers and those accessing or analyzing TeaFiles directly. It assumes basic background in data representations and programming.

Goals and Forces

Design

To achive best possible performance, time series data is stored in TeaFiles such that it can be mapped directly into memory. Data might also be written by traditional read/write IO functions, but ensuring the possibility to memory map the data puts a stronger constraint upon the file format, as it requires the data to be written in the form of the data item's binary memory footprint. The platform neutrality requirement therefore can only be achieved if the file has no fixed endianness specification. On a little endian machine, a TeaFile must be in little endian format to allow memory mapping, while on a big endian it must be in big endian format.

Platform compatibility (operating sysytem, cpu type or application) is approached by restriction to ubiquitious number formats, in particular the IEEE 754 floating point numbers. Other platform specific data types are still allowed. If used, they will affect portability.

To ease data exchange via TeaFiles, simple access to their data via primitive IO functions is provided. Raw access requires not more than reading of the first 4 mandatory values in the file. A useful API is therefore written within minutes and can then be incrementally extended to read also the optional sections inside the file that describe the item layout and content in the file.

Items

Items must be primitive types like integers or doubles or structures of such primitive types. Pointers to memory or strings can not be stored. While the design targeted time series storage, in which case each item has one value that holds a time value, the file format is also useful to generally store homogeneous collections of items (arrays) without time stamp.

File Format Progression

Thi file format specification is expected to change rarely. Upcoming needs for changes are supposed to be handled by a adding new sections or possibly accompanied by dropping others. If for instance the Item Layout Section should be changed, a new section (Item-Layout-2) with a different id would be proposed that has the new format.

API implementation annotations

Appart from the pure file format description, this document occasionally holds annotations about the implementation of APIs.


Sample File

A sample TeaFile holding Time/Price/Volume data, a content description, name value description and a description of the time format used might have the following binary layout:

struct Tick
{
    public Time Time;
    public double Price;
    public long Volume;
}
TeaFile<Tick>.Create("lab.tea", "ACME prices", NameValueCollection.From("decimals", 2));

The C# code above creates this binary file:

0Magic Value0x0d0e0a0402080500byte order, file type identifier. spec
8ItemStart200binary data area starts at absolute posiition 200. spec
16Itemend0binary data area ends at the phsyical end of file. spec
24Section Count4header holds 4 sections. spec
32Section Id 10 ItemSection describes the item stored in the file. spec
36Next Section Offset67next section begins at 36 + 4 + 67 = 107 spec
40ItemSize24items have a size of 24 byte spec
Item Type Name 'Tick'the name of the item is "Tick" spec
44bytes length4length if "Tick"
48bytes84,105,99,107UTF8 bytes of "Tick"
52Field Count3the item holds 3 fields
56Field Type 4first field is of field type 4 (long). spec
60Field Offset0it has an offset of 0 inside the item. spec
Field Name 'Time'its name is "Time". spec
64bytes length4length of "Time"
68bytes 84,105,109,101UTF8 bytes of "Time"
72Field Type10second field is of field type 10 (double)
76Field Offset8byte offset = 8
Field Name 'Price'name = "price"
80bytes length5
84bytes80,114,105,99,101
89Field Type4third field is of type 4 (long)
93Field Offset 16offset 16
Field Name 'Volume'name = "Volume"
97bytes length6
101bytes86,111,108,117,109,101
107Section Id 128Content Description Section spec
111Next Section Offset15next section begins at 111 + 4 + 15 = 130
115bytes length 11length of "ACME prices"
119bytes65,67,77,69,32,112,114,105,99,101,115UTF8 bytes of "ACME prices"
130Section Id129NameValue Section spec
134Next Section Offset24next section begins at 134 + 4 + 24 = 162
138Name Value Count 11 name/value pair follows spec
142bytes length8length of "decimals"
146bytes100,101,99,105,109,97,108,115UTF8 bytes of "decimals"
154NameValue.Kind11 (int) spec
158NameValue.Value2value = 2
162Section Id64TimeSection spec
166Next Section Offset24next section would be at 166 + 4 + 24 = 194. spec
170epoch719162epoch is 719162. time is measured from 1.1.1970.spec
178ticks per day8640000086400000 ticks are counted per day, so here we count milliseconds.
186time fields count1the item holds 1 time field.
190time field offset0The first (and last) time field is the field at offset 0. spec
194Padding Bytes0,0,0,0,0,0this file aligns the data area at the next 8byte boundary. spec
200item area ...

File Format

Header + Data

TeaFiles start with a header followed by the item area holding the data in binary format.

Mandatory Header Fields

Magic Value

0 byte[8] mandatory

Magic value for byte order checking and file detections.

The first octet of a TeaFile are alays the magic number 0x0d0e0a0402080500 which can be memorized as follows:

0d  0e  0a  04  02  08  05  00
 T   e   a for two  at   5 :00

If an application does not find this sequence at the file begin, it should raise an exception. In such case, the file might have been written by a different endianness application. TeaFiles must have the endianness of the local machine so they can be mapped directly into memory. If files are transfered between systems, the file might need endian conversion before it can be read correctly or mapped into memory.

This magic number allows to

An API can translate different endian-formatted files on the fly for read and write operations, for memory mapping purposess the file would need to be converted.

This specification treats these 8 bytes as a byte sequence. API writers may use an unsigned 8 byte integer type like unsigned long. It is however also safe to use a signed 64bit wide integer without causing overflow, since whatever the byte order on a machine is, all byte values within 0x0d0e0a0402080500 are < 0x10. Signed int 64 is actually recommended, as all other 64 bit wide numbers inside the header are unsigned, so it reduces the number of types to be read from the file.

Item Area Start (ItemStart)

8 int 64 mandatory

The absolute byte offset of the item area.

Setting the file pointer to this position allows reading of the first item in the file. If the first value is placed directly after the header, this value equals the size of the header. Often however, it will be larger: First, the data area should be aligned at the 8 byte boundary to avoid considerable performance penalties from unaligned double or long values. Second, if the description of the file is modified at a time it contains data, the header might increase in size and the change will thus require the item area to be moved. To avoid costly data moving, the file can be created with some headroom behind the header.

Item Area End (ItemEnd)

16 int 64 mandatory

The absolute byte offset of the item area end.

If this value is 0, then the end of the physical file equals the end of the item area. Otherwise it specifies the absolute byte offset of the end of the item area from the begin of the file.

Leaving this value at 0 provides simple addition of new items to the file. Besides writing new items to the end of the file, no further updates are required. To avoid fragmentation, files might initially be preallocated at a larger size. To allow this, the ItemEnd value can be set to determine the logical end of file. The value of ItemEnd will always be in the range

ItemStart <= ItemEnd < physical file size.
Usage of preallocation requires attention to fsync calls: To be safe in case of failures, items should be fsync'ed before the ItemEnd value is increased. Conversely, in case of deletion, ItemEnd should be decreased on disk first before deletion.

Sections Count

24 int 64 mandatory

The TeaFile header can optionally hold an arbitrary number of sections. This is the number of such sections.

Mandatory header fields and the shortest valid TeaFile

The 4 values above are the only mandatory values of a TeaFile. This means that the shortest possible TeaFile has 4 * 8 = 32 bytes. Such file will have these values:

00x0d0e0a0402080500Magic Valuemagic number
832ItemStartitem area directly follows header
160ItemEndlogical end of file is not used
240Section Countno sections
32file enditem area

Theoretically there is another 32 byte long TeaFile that is also valid:

00x0d0e0a0402080500Magic Valuemagic number
832ItemStartitem area directly follows header
1632ItemEndlogical end of file is used
240Section Countno sections
32file enditem area

In the latter sample, the ItemEnd value is set although no preallocation is done as the file is by definition 32 bytes long. This will hardly make good sense but the file is still valid.



Computing the number of items in the file

The number of items requires knnowledge of the size of the items in the file. This size is either known up front or it is read from the Item Layout section (see below). Given the item size, the number of items in the file is computed by

N = (physical file size - ItemStart) / Item Size if ItemEnd == 0 or
N = (ItemEnd - ItemStart) / Item Size if ItemEnd != 0

For the 32 byte long TeaFile samples above we get:
N = 32 -32 / ItemSize which is 0 regardless of the item size and
N = 32 -32 / ItemSize which is also 0, the only difference being that the first number 32 is now ItemEnd while it was the FileSize before.

Optional Header Fields - Sections

TeaFiles may contain sections describing their content. Each section is optional, if it occurs it must occur only once. This document defines several sections and leaves room for custom sections that allow storage of additional features. For each section (official or custom) the following fields are written:

Section Id

32, ... int 32

Each section has a unique ID. The id of the first section is always at position 32 in the file. Its value is one the of the section id values specified in this document or a custom section id above 0xffff. The IDs specified in this document are:

0x0aItem Section
0x40Time Section
0x80Content Section
0x81NameValue Section

Next Section Offset

36, ... int 32

Sections are written into the file one after another. Users interested in specific sections only can read the Section Id, decide whether to read the section and optionally skip it by reading the next section offset and adding it to the file pointer to reach the next section id.

This value might equal the length of the section but might also be higher due to padding or because some headroom was reserved for the section such that modifications of the section can be done without affecting other sections or the file's item area.

For the last section, the next section offset shall be ignored by file readers.

Ignoring the next section offset for the last section is easy for readers, as they read the section count before and know when the last section is read. On the other hand, imposing that he last section must have a section offset of 0, which would make good sense, makes API writing considerable harder.

Item Section

The Item Section describes the layout of the items stored in the file, in case of structured items the name, type and offset of its fields. This is the most important section as it allows reading the data without any further knowledge about the file.

All fields below are written without any padding in between.

Item Size

int 32

The size of the item. This is the space occupied by each item in the file, including any padding bytes.

Item Name

string

The name of the type, like "OHLCV", as UTF8 string. Strings are written length prefixed (int32) following by the text data in UTF8.

Field Count

int 32

The number of fields in the item. This value must be >= 1. If the Item Type of a TeaFile is a plain primitive value, the Item Name might be "Int" while the single field value might be "Value". In other words primitives are treated as structs with one field having the type of the primitive.

Fields

struct - Field

For each field, a Field description is written holding:

Field.Type

int32

The types of fields inside a TeaFile should be available on as many platforms as possible to support versatile data access. At the same time, TeaFiles might be used in focused scenarios were such ubiquitous data access possibility is not required and where other formats are desirable. The file format foresees this possibility.

FieldType is either a value from 1 to 10, indicating a platform agnostic, ubiquitous type or a platform specific type which currently encompasses the .Net decimal. For fields holding other types, one can use value the values >= 0x1000.

    // platform agnostic
    Int8 = 1,
    Int16 = 2,
    Int32 = 3,
    Int64 = 4,

    UInt8 = 5,
    UInt16 = 6,
    UInt32 = 7,
    UInt64 = 8,

    Float = 9, // IEEE 754
    Double = 10, // IEEE 754

    // platform specific
    NetDecimal = 0x200,

    // private extensions must have integer identifiers above 0x1000.
    Custom = 0x1000

Field.Offset

int32

The field's byte-offset inside the item.

Field.Name

string

The name of the field as UTF8 string. Strings are written length prefixed (int32) following by the text data in UTF8.

Content Description Section

Description

string

A string that describes the content of the file. Examples: "Daily average temperature Easy Village", "Prices of ACME at NYSE".

Name/Value Section

Name/Value Count

int32

The number of name/Value pairs to follow.

Name/Value Pairs

struct - NameValue

For each Name/Value pair, the following values are written

NameValue.Name

string

The name.

NameValue.Kind

int32

One of the following integer values, indicating the type of the value to follow:

NameValue.Value

int32/double/string/uuid

Dependant on the previous value of Kind, the value will hold an int32, double, string or uuid value. A uuid value is a sequence of 16 bytes considered to be unique.

Time Section

This section describes how a number is interpreted as time and which fields in the item are to be interpreted as time.

Epoch

int64

Time is specified as the number of "tick" intervals that elapsed since some origin. The epoch value gives this origin as the number of days that passed between since 1.1.0000 and the origin. For the .Net System.DateTime Type, that counts ticks since 1.1.0000 this value would be 0. For time systems counting from 1.1.1970, the epoch is 719162.

Ticks Per Day

int64

This value specifies how many ticks are counted per day. Conversely this number specifies the length of the tick interval. A value of 1 would mean that days are counted, 86400 would mean seconds and 86400000 milliseconds.

For TeaFiles that shall be accessed from various platforms, the Java Time format is recommended, as it often sufficient resolution of milliseconds and due to its origin at 1.1.1970, its values remain much smaller than that of .Net System.DateTime that gives values hard to deal with in applications that cannot handle 64 bit integers like R.

Time Fields Count

int32

For time series, the item will hold one field that is the event time, the time at which the observation or the event occured. If the file holds a plain collection of values and is not a time series, then no time field might exist. On the other hand items might have an event time and in addition ther time fields. In summary the number of time fields is between 0 and the number fo fields in the item.

If an item has more than one time field, the firstone is considered the event time. The values of this field must then be non-decreasing. In other words each event must have a time >= time of the previous event.

Time Fields

int32[]

As many int32 values follow as [Time Fields Count] indicates. Each of these values holds the offset of the corresponding field.

The reason to use the offset instead of the field index is that an application that shall select an interval of values from a TeaFile, is required to read the Time Section only in order to deliver all values from t1 to t2.