Data Encoding#

This notebook provides a short walkthrough of some of the data encoding features of the sharrow library.

import numpy as np
import pandas as pd
import xarray as xr

import sharrow as sh

sh.__version__
'2.8.2'

Example Data#

We’ll begin by importing some example data to work with. We’ll be using some test data taken from the MTC example in the ActivitySim project. For this data encoding walkthrough, we’ll focus on the skims containing transportation level of service information for travel around a tiny slice of San Francisco.

We’ll load them as a multi-dimensional xarray.Dataset — or, more exactly, a sharrow.Dataset, which is a subclass from the xarray version that adds some useful features, including compatability with automatic tools for recoding data.

skims = sh.example_data.get_skims()
skims
<xarray.Dataset> Size: 2MB
Dimensions:               (otaz: 25, dtaz: 25, time_period: 5)
Coordinates:
  * dtaz                  (dtaz) int64 200B 1 2 3 4 5 6 7 ... 20 21 22 23 24 25
  * otaz                  (otaz) int64 200B 1 2 3 4 5 6 7 ... 20 21 22 23 24 25
  * time_period           (time_period) <U2 40B 'EA' 'AM' 'MD' 'PM' 'EV'
Data variables: (12/170)
    DIST                  (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DISTBIKE              (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DISTWALK              (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DRV_COM_WLK_BOARDS    (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_DDIST     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_DTIM      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    ...                    ...
    WLK_TRN_WLK_IVT       (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_IWAIT     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_WACC      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_WAUX      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_WEGR      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_XWAIT     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>

Because sharrow uses the xarray.Dataset format to work with data, individual variables in each Dataset can be encoded in different data types. For example, automobile travel times can be stored with high(er) precision floating point numbers, while transit fares, which vary less and have a narrower range, can be stored with lower precision. This allows a user to choose the most efficient encoding for each variable, if desired.

Fixed Point Encoding#

Very often, data (especially skim matrixes like here) can be expressed adequately with far less precicion than a standard 32-bit floating point representation allows. In these cases, it may be beneficial to store this data with “fixed point” encoding, which is also sometimes called scaled integers.

Instead of storing values as 32-bit floating point values, they could be multiplied by a scale factor (e.g., 100) and then converted to 16-bit integers. This uses half the RAM and can still express any value (to two decimal point precision) up to positive or negative 327.68. If the lowest values in that range are never needed, it can also be shifted, moving both the bottom and top limits by a fixed amount. Then, for a particular scale \(\mu\) and shift \(\xi\) (stored in metadata), from any array element \(i\) the implied (original) value \(x\) can quickly be recovered by evaluating \((i / \mu) - \xi\).

Sharrow includes a pair of functions to encode and decode arrays in this manner. These functions also attach the necessary metadata to the Dataset objects, so that later when we construct sharrow.Flow instances, they can decode arrays automatically.

from sharrow.digital_encoding import array_decode, array_encode

The distance data in the skims is a great candidate for fixed point of encoding. We can peek at the top corner of this array:

skims.DIST.values[:2, :3]
array([[0.12, 0.24, 0.44],
       [0.37, 0.14, 0.28]], dtype=float32)

The data are all small(ish) values with two decimal point fixed precision, so we can probably efficiently encode this data by scaling by 100. If we’re not sure, we can confirm by checking the range of values, to make sure it fits inside the 16-bit integers we’re hoping to use.

skims.DIST.values.min(), skims.DIST.values.max()
(0.1, 2.7)

That’s a really small range because this is only test data. But even the full-scale MTC skims spanning the entire region don’t contain distances over 300 miles.

We can create a new DataArray and apply fixed point encoding using the array_encode function.

distance_encoded = array_encode(skims.DIST, scale=0.01, offset=0)
distance_encoded.values[:2, :3]
array([[12, 24, 44],
       [37, 14, 28]], dtype=int16)

We can apply that function for any number of variables in the skims, and create a new Dataset that includes the encoded arrays.

skims_encoded = skims.assign({"DIST": array_encode(skims.DIST, scale=0.01, offset=0)})

To manage the digital encodings across an entire dataset, sharrow implements a digital_encoding accessor. You can use it to apply encodings to one or more variables in a simple fashion.

skims_encoded = skims_encoded.digital_encoding.set(
    ["DISTWALK", "DISTBIKE"], scale=0.01, offset=0
)

And you can review the encodings for every variable in the dataset like this:

skims_encoded.digital_encoding.info()
{'DIST': {'scale': 0.01, 'offset': 0, 'missing_value': None},
 'DISTBIKE': {'scale': 0.01, 'offset': 0, 'missing_value': None},
 'DISTWALK': {'scale': 0.01, 'offset': 0, 'missing_value': None}}

To demonstrate that the encoding works transparently with a Flow, we can construct a simple flow that extracts the distance and square of distance for the top corner of values we looked at above.

First we’ll do so for a flow with the original float32 encoded skims.

pairs = pd.DataFrame({"orig": [0, 0, 0, 1, 1, 1], "dest": [0, 1, 2, 0, 1, 2]})
tree = sh.DataTree(
    base=pairs,
    skims=skims.drop_dims("time_period"),
    relationships=(
        "base.orig -> skims.otaz",
        "base.dest -> skims.dtaz",
    ),
)
flow = tree.setup_flow({"d1": "DIST", "d2": "DIST**2"})
arr = flow.load()
arr
array([[0.12      , 0.0144    ],
       [0.24      , 0.0576    ],
       [0.44      , 0.1936    ],
       [0.37      , 0.13690001],
       [0.14      , 0.0196    ],
       [0.28      , 0.0784    ]], dtype=float32)

We can do the same for the encoded skims, and we get exactly the same result, even though the encoded skims use less RAM.

tree_enc = sh.DataTree(
    base=pairs,
    skims=skims_encoded.drop_dims("time_period"),
    relationships=(
        "base.orig -> skims.otaz",
        "base.dest -> skims.dtaz",
    ),
)
flow_enc = tree_enc.setup_flow({"d1": "DIST", "d2": "DIST**2"})
arr_enc = flow_enc.load()
arr_enc
array([[0.12  , 0.0144],
       [0.24  , 0.0576],
       [0.44  , 0.1936],
       [0.37  , 0.1369],
       [0.14  , 0.0196],
       [0.28  , 0.0784]], dtype=float32)

Dictionary Encoding#

For skim matrixes where the universe of all possible cell values can be adequately represented by just 255 unique values, we can use an explicit mapping process called “dictionary encoding”, which works by storing those unique values in a tiny base array. Then, in the main body of the skim data we only store offsets that point to positions in that base array. This reduces the marginal memory footprint of each array cell to just an 8 bit integer, reducing memory requirements by up to 75% for these arrays compared to float32’s. This approach is particularly appropriate for many transit skims, as fares, wait times, and transfers can almost always be reduced to a dictionary encoding with no meaningful information loss.

For example, the 'WLK_LOC_WLK_FAR' array containing fares only has four unique values:

np.unique(skims.WLK_LOC_WLK_FAR)
array([  0., 152., 474., 626.], dtype=float32)

We can see various fares applied at different time periods if we look at the top corner of the array:

skims.WLK_LOC_WLK_FAR.values[:2, :3, :]
array([[[  0.,   0.,   0.,   0.,   0.],
        [152., 474., 474., 152., 474.],
        [152., 474., 474., 152., 474.]],

       [[152., 152., 474., 474., 152.],
        [  0.,   0.,   0.,   0.,   0.],
        [152., 474., 474., 152., 474.]]], dtype=float32)

Once encoded, the array itself only contains offset pointers (small integers), plus the original values stored in metadata.

wlwfare_enc = array_encode(skims.WLK_LOC_WLK_FAR, bitwidth=8, by_dict=True)
wlwfare_enc.values[:2, :3, :]
array([[[0, 0, 0, 0, 0],
        [1, 2, 2, 1, 2],
        [1, 2, 2, 1, 2]],

       [[1, 1, 2, 2, 1],
        [0, 0, 0, 0, 0],
        [1, 2, 2, 1, 2]]], dtype=uint8)
wlwfare_enc.attrs["digital_encoding"]["dictionary"]
array([  0., 152., 474., 626.], dtype=float32)

If we want to recover the original data for analysis (other than in a Flow, which can decode it automatically), we can use the array_decode function.

array_decode(wlwfare_enc)
<xarray.DataArray 'WLK_LOC_WLK_FAR' (otaz: 25, dtaz: 25, time_period: 5)> Size: 12kB
array([[[  0.,   0.,   0.,   0.,   0.],
        [152., 474., 474., 152., 474.],
        [152., 474., 474., 152., 474.],
        ...,
        [152., 152., 152., 152., 152.],
        [152., 152., 152., 152., 474.],
        [152., 152., 152., 152., 474.]],

       [[152., 152., 474., 474., 152.],
        [  0.,   0.,   0.,   0.,   0.],
        [152., 474., 474., 152., 474.],
        ...,
        [152., 152., 152., 152., 152.],
        [152., 152., 152., 152., 152.],
        [152., 152., 152., 152., 474.]],

       [[152., 474., 474., 474., 474.],
        [152., 152., 152., 152., 474.],
        [  0.,   0.,   0.,   0.,   0.],
        ...,
...
        ...,
        [  0.,   0.,   0.,   0.,   0.],
        [152., 152., 152., 152., 152.],
        [152., 152., 152., 152., 152.]],

       [[152., 152., 152., 152., 152.],
        [152., 152., 152., 152., 474.],
        [152., 152., 152., 152., 474.],
        ...,
        [152., 152., 152., 152., 152.],
        [  0.,   0.,   0.,   0.,   0.],
        [152., 152., 152., 152., 152.]],

       [[152., 152., 152., 152., 152.],
        [152., 152., 152., 152., 152.],
        [152., 152., 152., 152., 152.],
        ...,
        [152., 152., 152., 152., 152.],
        [152., 152., 152., 152., 152.],
        [  0.,   0.,   0.,   0.,   0.]]], dtype=float32)
Coordinates:
  * dtaz         (dtaz) int64 200B 1 2 3 4 5 6 7 8 9 ... 18 19 20 21 22 23 24 25
  * otaz         (otaz) int64 200B 1 2 3 4 5 6 7 8 9 ... 18 19 20 21 22 23 24 25
  * time_period  (time_period) <U2 40B 'EA' 'AM' 'MD' 'PM' 'EV'

Joint Dict Encoding#

Dictionary encoding can be expanded to map multiple different variables using the same underlying offsets array. For large datasets with several dimension lengths in the thousands, the offset array may constitute the vast majority of the memory usage, so sharing the same offsets for several variables can result in huge reductions in the memory footprint.

The joint dictionary can be applied using the set method of the digital_encoding accessor, giving a list of the variables to jointly encode:

skims1 = skims.digital_encoding.set(
    [
        "WLK_LOC_WLK_FAR",
        "WLK_EXP_WLK_FAR",
        "WLK_HVY_WLK_FAR",
        "DRV_LOC_WLK_FAR",
        "DRV_HVY_WLK_FAR",
        "DRV_EXP_WLK_FAR",
    ],
    joint_dict=True,
)

A unique name is automatically generated for the join when joint_dict is set to True. Alternatively, the user can specify a name to use for the join by giving a string input as the joint_dict. Different sets of variables in the same Dataset can be grouped and encoded jointly, but each group must have a unique name.

skims1 = skims1.digital_encoding.set(
    ["DISTBIKE", "DISTWALK"],
    joint_dict="jointWB",
)

The resulting dataset adds a variable for each created group, which contains the offsets, and the named variables in the group are replaced with a new one-dimension array with coordinating lengths.

skims1
<xarray.Dataset> Size: 2MB
Dimensions:               (otaz: 25, dtaz: 25, time_period: 5, joined_0: 40,
                           jointWB: 192)
Coordinates:
  * dtaz                  (dtaz) int64 200B 1 2 3 4 5 6 7 ... 20 21 22 23 24 25
  * otaz                  (otaz) int64 200B 1 2 3 4 5 6 7 ... 20 21 22 23 24 25
  * time_period           (time_period) <U2 40B 'EA' 'AM' 'MD' 'PM' 'EV'
Dimensions without coordinates: joined_0, jointWB
Data variables: (12/172)
    DIST                  (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DRV_COM_WLK_BOARDS    (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_DDIST     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_DTIM      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_FAR       (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_IWAIT     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    ...                    ...
    DRV_LOC_WLK_FAR       (joined_0) float32 160B 0.0 152.0 ... 152.0 474.0
    DRV_HVY_WLK_FAR       (joined_0) float32 160B 0.0 206.0 ... 220.0 220.0
    DRV_EXP_WLK_FAR       (joined_0) float32 160B 0.0 139.0 ... 244.0 244.0
    jointWB_offsets       (otaz, dtaz) uint8 625B 112 107 135 103 ... 68 113 137
    DISTBIKE              (jointWB) float32 768B 0.5 2.25 0.75 ... 1.72 1.97
    DISTWALK              (jointWB) float32 768B 0.5 2.25 0.75 ... 1.72 1.97

Skims encoded in this manner can be fed into sharrow and will compile and return the same results as if they were not encoded. If you are mixing compiled flows between encoded and unencoded Datasets (which should be unusual, but for the examples in this notebook we’ve done it) you’ll need to set the hashing_level to at least 2, to ensure you are matching the correct numba code with the encodings used in the data.

tree1 = sh.DataTree(
    base=pairs,
    skims=skims1,
    rskims=skims1,
    relationships=(
        "base.orig -> skims.otaz",
        "base.dest -> skims.dtaz",
        "base.orig -> rskims.dtaz",
        "base.dest -> rskims.otaz",
    ),
)
flow1 = tree1.setup_flow(
    {
        "d1": 'skims["WLK_LOC_WLK_FAR", "AM"]',
        "d2": 'skims["WLK_LOC_WLK_FAR", "AM"]**2',
        "w1": "skims.DISTWALK",
        "w2": 'skims.reverse("DISTWALK")',
        "w3": "rskims.DISTWALK",
        "x1": "skims.DIST",
        "x2": 'skims.reverse("DIST")',
    },
    hashing_level=2,
)
arr1 = flow1.load_dataframe()
arr1
d1 d2 w1 w2 w3 x1 x2
0 0.0 0.0 0.12 0.12 0.12 0.12 0.12
1 474.0 224676.0 0.24 0.37 0.37 0.24 0.37
2 474.0 224676.0 0.44 0.57 0.57 0.44 0.57
3 152.0 23104.0 0.37 0.24 0.24 0.37 0.24
4 0.0 0.0 0.14 0.14 0.14 0.14 0.14
5 474.0 224676.0 0.28 0.28 0.28 0.28 0.28

Pandas Categorical Dtype#

Dictionary encoding is very similar to the approach used for the pandas Categorical dtype, and can be used to achieve some of the efficiencies of categorical data, even though xarray lacks a formal native categorical data representation. Sharrow’s construct function for creating Dataset objects will automatically use dictionary encoding for “category” data.

To demonstrate, we’ll load some household data and create a categorical data column.

hh = sh.example_data.get_households()
hh["income_grp"] = pd.cut(
    hh.income, bins=[-np.inf, 30000, 60000, np.inf], labels=["Low", "Mid", "High"]
)
hh = hh[["income", "income_grp"]]
hh.head()
income income_grp
HHID
2717868 361000 High
763899 59220 Mid
2222791 197000 High
112477 2200 Low
370491 16500 Low
hh.info()
<class 'pandas.core.frame.DataFrame'>
Index: 5000 entries, 2717868 to 702559
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   income      5000 non-null   int64   
 1   income_grp  5000 non-null   category
dtypes: category(1), int64(1)
memory usage: 83.1 KB

We’ll then create a Dataset using construct.

hh_dataset = sh.dataset.construct(hh[["income", "income_grp"]])
hh_dataset
<xarray.Dataset> Size: 85kB
Dimensions:     (HHID: 5000)
Coordinates:
  * HHID        (HHID) int64 40kB 2717868 763899 2222791 ... 2049372 702559
Data variables:
    income      (HHID) int64 40kB 361000 59220 197000 2200 ... 0 103000 14800
    income_grp  (HHID) int8 5kB 2 1 2 0 0 0 0 0 0 2 0 ... 0 0 0 1 1 0 0 0 0 2 0

Note that the “income” variable remains an integer as expected, but the “income_grp” variable, which had been a “category” dtype in pandas, is now stored as an int8, giving the category index of each element (it would be an int16 or larger if needed, but that’s not necessary with only 3 categories). The information about the labels for the categories is retained not in the data itself but in the digital_encoding:

hh_dataset["income_grp"].digital_encoding
{'dictionary': array(['Low', 'Mid', 'High'], dtype='<U4'), 'ordered': True}

If you try to make the return trip to a pandas DataFrame using the regular xarray.Dataset.to_pandas() method, the details of the categorical nature of this variable are lost, and only the int8 index is available.

hh_dataset.to_pandas()
income income_grp
HHID
2717868 361000 2
763899 59220 1
2222791 197000 2
112477 2200 0
370491 16500 0
... ... ...
109218 15000 0
570708 13100 0
2762199 0 0
2049372 103000 2
702559 14800 0

5000 rows × 2 columns

But, if you use the single_dim accessor on the dataset provided by sharrow, the categorical dtype is restored correctly.

hh_dataset.single_dim.to_pandas()
income income_grp
HHID
2717868 361000 High
763899 59220 Mid
2222791 197000 High
112477 2200 Low
370491 16500 Low
... ... ...
109218 15000 Low
570708 13100 Low
2762199 0 Low
2049372 103000 High
702559 14800 Low

5000 rows × 2 columns

Note that this automatic handling of categorical data only applies when constructing or deconstructing a dataset with a single dimension (i.e. the index is not a MultiIndex). Multidimensional datasets use the normal xarray processing, which will dump string categoricals back into python objects, which is bad news for high performance applications.

sh.dataset.construct(
    hh[["income", "income_grp"]].reset_index().set_index(["HHID", "income"])
)
<xarray.Dataset> Size: 38MB
Dimensions:     (HHID: 5000, income: 951)
Coordinates:
  * HHID        (HHID) int64 40kB 25671 25675 25678 ... 2863524 2863552 2863568
  * income      (income) int64 8kB 0 30 200 230 ... 650000 652070 660000 733000
Data variables:
    income_grp  (HHID, income) object 38MB nan nan nan nan ... nan nan nan nan