• Aurélien Campéas's avatar
    snapshot/storage: low-level optimisation · dbce79810ccf
    Aurélien Campéas authored
    Json serialization is replaced with a more low-level scheme,
    affecting both string and numeric series.
    Purpose is to drop the cost of de-serialization, which is
    currently quite high.
    For numerical values, we serialize the underlying C array
    (while recording the in-memory layout/dtype).
    Perf improvement on the reading phase is quite worthwhile:
    TSH GET 0.005136966705322266
    TSH HIST 0.5647647380828857
    DELTA all value dates 2.0582079887390137
    DELTA 1 day  0.20743083953857422
           class                      test      time
    0  TimeSerie            bigdata_insert  1.332391
    1  TimeSerie       bigdata_history_all  1.718589
    2  TimeSerie    bigdata_history_chunks  1.613754
    3  TimeSerie          manydiffs_insert  0.940170
    4  TimeSerie     manydiffs_history_all  0.996268
    5  TimeSerie  manydiffs_history_chunks  2.115351
    TSH GET 0.004252910614013672
    TSH HIST 0.11956286430358887
    DELTA all value dates 1.7346818447113037
    DELTA 1 day  0.16817998886108398
           class                      test      time
    0  TimeSerie            bigdata_insert  1.297348
    1  TimeSerie       bigdata_history_all  0.173700
    2  TimeSerie    bigdata_history_chunks  0.181005
    3  TimeSerie          manydiffs_insert  0.846298
    4  TimeSerie     manydiffs_history_all  0.084483
    5  TimeSerie  manydiffs_history_chunks  0.216825
    A few notes:
    * serialization of strings is a bit tricky since we need to
      encode None/nans in its serialization and have a separator
      for their concatenation (we forbid ascii control characters
      0 and 3 to be ever used)
    * we have to wrap the `index` low-level bytes string into
      a python array to work around an obscure pandas bug in
      index.isin computation (isin is attempting a mutation !)
    Thanks to Alain Leufroy for the proposal !
    Resolves #49.