TSHISTORY =========== This is a library to store/retrieve pandas timeseries to/from a postgres database, tracking their successive versions. [TOC] # Introduction ## Purpose `tshistory` is targetted at applications using time series where [backtesting][backtesting] and [cross-validation][cross-validation] are an essential feature. It provides exhaustivity and efficiency of the storage, with a simple Python api. It can be used as a building block for machine learning, model optimization and validation, both for inputs and outputs. ## Principles There are many ways to represent timeseries in a relational database, and `tshistory` provides two things: * a base python API which abstracts away the underlying storage * a postgres model, which emphasizes the compact storage of successive states of series The core idea of tshistory is to handle successive versions of timeseries as they grow in time, allowing to get older states of any series. # Basic usage ## Starting with a fresh database You need a postgresql database. You can create one like this: ```shell createdb mydb ``` Then, initialize the `tshistory` tables, like this: ```python tsh init-db postgresql://me:password@localhost/mydb ``` From this you're ready to go ! ## Creating a series However here's a simple example: ```python >>> from sqlalchemy import create_engine >>> import pandas as pd >>> from tshistory.tsio import timeseries >>> >>> engine = create_engine('postgres://me:password@localhost/mydb') >>> tsh = timeseries() >>> >>> series = pd.Series([1, 2, 3], ... pd.date_range(start=pd.Timestamp(2017, 1, 1), ... freq='D', periods=3)) # db insertion >>> with engine.begin() as cn: >>> tsh.insert(cn, series, 'my_series', 'babar@pythonian.fr') ... 2017-01-01 1.0 2017-01-02 2.0 2017-01-03 3.0 Freq: D, Name: my_series, dtype: float64 # note how our integers got turned into floats # (there are no provisions to handle integer series as of today) # retrieval >>> with engine.begin() as cn: >>> tsh.get(cn, 'my_series') ... 2017-01-01 1.0 2017-01-02 2.0 2017-01-03 3.0 Name: my_series, dtype: float64 ``` ## Updating a series This is good. Now, let's insert more: ```python >>> series = pd.Series([2, 7, 8, 9], ... pd.date_range(start=pd.Timestamp(2017, 1, 2), ... freq='D', periods=4)) # db insertion >>> with engine.begin() as cn: >>> tsh.insert(cn, series, 'my_series', 'babar@pythonian.fr') ... 2017-01-03 7.0 2017-01-04 8.0 2017-01-05 9.0 Name: my_series, dtype: float64 # you get back the *new information* you put inside # and this is why the `2` doesn't appear (it was already put # there in the first step) # db retrieval >>> with engine.begin() as cn: >>> tsh.get(engine, 'my_series') ... 2017-01-01 1.0 2017-01-02 2.0 2017-01-03 7.0 2017-01-04 8.0 2017-01-05 9.0 Name: my_series, dtype: float64 ``` It is important to note that the third value was *replaced*, and the two last values were just *appended*. As noted the point at `2017-1-2` wasn't a new information so it was just ignored. ## Retrieving history We can access the whole history (or parts of it) in one call: ```python >>> with engine.begin() as cn: >>> history = tsh.history(engine, 'my_series') ... >>> >>> for idate, series in history.items(): # it's a dict ... print('insertion date:', idate) ... print(series) ... insertion date: 2018-09-26 17:10:36.988920+02:00 2017-01-01 1.0 2017-01-02 2.0 2017-01-03 3.0 Name: my_series, dtype: float64 insertion date: 2018-09-26 17:12:54.508252+02:00 2017-01-01 1.0 2017-01-02 2.0 2017-01-03 7.0 2017-01-04 8.0 2017-01-05 9.0 Name: my_series, dtype: float64 ``` Note how this shows the full serie state for each insertion date. Also the insertion date is timzeone aware. It is possible to show the differences only: ```python >>> with engine.begin() as cn: >>> diffs = tsh.history(engine, 'my_series', diffmode=True) ... >>> for idate, series in tsh.history(engine, 'ts', diffmode=True).items(): ... print('insertion date:', idate) ... print(series) ... insertion date: 2018-09-26 17:10:36.988920+02:00 2017-01-01 1.0 2017-01-02 2.0 2017-01-03 3.0 Name: my_series, dtype: float64 insertion date: 2018-09-26 17:12:54.508252+02:00 2017-01-03 7.0 2017-01-04 8.0 2017-01-05 9.0 Name: my_series, dtype: float64 ``` You can see a series metadata: ```python >>> tsh.metadata(engine, 'my_series') {'tzaware': False, 'index_type': 'datetime64[ns]', 'value_type': 'float64', 'index_dtype': '