Data Encoding

Data Encoding#

This notebook provides a short walkthrough of some of the data encoding features of the sharrow library.

import numpy as np
import pandas as pd
import xarray as xr

import sharrow as sh

sh.__version__

'2.14.0'

Example Data#

We’ll begin by importing some example data to work with. We’ll be using some test data taken from the MTC example in the ActivitySim project. For this data encoding walkthrough, we’ll focus on the skims containing transportation level of service information for travel around a tiny slice of San Francisco.

We’ll load them as a multi-dimensional xarray.Dataset — or, more exactly, a sharrow.Dataset, which is a subclass from the xarray version that adds some useful features, including compatability with automatic tools for recoding data.

skims = sh.example_data.get_skims()
skims

<xarray.Dataset> Size: 2MB
Dimensions:               (otaz: 25, dtaz: 25, time_period: 5)
Coordinates:
  * dtaz                  (dtaz) int64 200B 1 2 3 4 5 6 7 ... 20 21 22 23 24 25
  * otaz                  (otaz) int64 200B 1 2 3 4 5 6 7 ... 20 21 22 23 24 25
  * time_period           (time_period) <U2 40B 'EA' 'AM' 'MD' 'PM' 'EV'
Data variables: (12/170)
    DIST                  (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DISTBIKE              (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DISTWALK              (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DRV_COM_WLK_BOARDS    (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_DDIST     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_DTIM      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    ...                    ...
    WLK_TRN_WLK_IVT       (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_IWAIT     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_WACC      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_WAUX      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_WEGR      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_XWAIT     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>

Because sharrow uses the xarray.Dataset format to work with data, individual variables in each Dataset can be encoded in different data types. For example, automobile travel times can be stored with high(er) precision floating point numbers, while transit fares, which vary less and have a narrower range, can be stored with lower precision. This allows a user to choose the most efficient encoding for each variable, if desired.

Fixed Point Encoding#

Very often, data (especially skim matrixes like here) can be expressed adequately with far less precicion than a standard 32-bit floating point representation allows. In these cases, it may be beneficial to store this data with “fixed point” encoding, which is also sometimes called scaled integers.

Instead of storing values as 32-bit floating point values, they could be multiplied by a scale factor (e.g., 100) and then converted to 16-bit integers. This uses half the RAM and can still express any value (to two decimal point precision) up to positive or negative 327.68. If the lowest values in that range are never needed, it can also be shifted, moving both the bottom and top limits by a fixed amount. Then, for a particular scale \(\mu\) and shift \(\xi\) (stored in metadata), from any array element \(i\) the implied (original) value \(x\) can quickly be recovered by evaluating \((i / \mu) - \xi\).

Sharrow includes a pair of functions to encode and decode arrays in this manner. These functions also attach the necessary metadata to the Dataset objects, so that later when we construct sharrow.Flow instances, they can decode arrays automatically.

from sharrow.digital_encoding import array_decode, array_encode

The distance data in the skims is a great candidate for fixed point of encoding. We can peek at the top corner of this array:

skims.DIST.values[:2, :3]

array([[0.12, 0.24, 0.44],
       [0.37, 0.14, 0.28]], dtype=float32)

The data are all small(ish) values with two decimal point fixed precision, so we can probably efficiently encode this data by scaling by 100. If we’re not sure, we can confirm by checking the range of values, to make sure it fits inside the 16-bit integers we’re hoping to use.

skims.DIST.values.min(), skims.DIST.values.max()

(0.1, 2.7)

That’s a really small range because this is only test data. But even the full-scale MTC skims spanning the entire region don’t contain distances over 300 miles.

We can create a new DataArray and apply fixed point encoding using the array_encode function.

distance_encoded = array_encode(skims.DIST, scale=0.01, offset=0)
distance_encoded.values[:2, :3]

array([[12, 24, 44],
       [37, 14, 28]], dtype=int16)

We can apply that function for any number of variables in the skims, and create a new Dataset that includes the encoded arrays.

skims_encoded = skims.assign({"DIST": array_encode(skims.DIST, scale=0.01, offset=0)})

To manage the digital encodings across an entire dataset, sharrow implements a digital_encoding accessor. You can use it to apply encodings to one or more variables in a simple fashion.

skims_encoded = skims_encoded.digital_encoding.set(
    ["DISTWALK", "DISTBIKE"], scale=0.01, offset=0
)

And you can review the encodings for every variable in the dataset like this:

skims_encoded.digital_encoding.info()

{'DIST': {'scale': 0.01, 'offset': 0, 'missing_value': None},
 'DISTBIKE': {'scale': 0.01, 'offset': 0, 'missing_value': None},
 'DISTWALK': {'scale': 0.01, 'offset': 0, 'missing_value': None}}

To demonstrate that the encoding works transparently with a Flow, we can construct a simple flow that extracts the distance and square of distance for the top corner of values we looked at above.

First we’ll do so for a flow with the original float32 encoded skims.

pairs = pd.DataFrame({"orig": [0, 0, 0, 1, 1, 1], "dest": [0, 1, 2, 0, 1, 2]})
tree = sh.DataTree(
    base=pairs,
    skims=skims.drop_dims("time_period"),
    relationships=(
        "base.orig -> skims.otaz",
        "base.dest -> skims.dtaz",
    ),
)
flow = tree.setup_flow({"d1": "DIST", "d2": "DIST**2"})
arr = flow.load()
arr

array([[0.12      , 0.0144    ],
       [0.24      , 0.0576    ],
       [0.44      , 0.1936    ],
       [0.37      , 0.13690001],
       [0.14      , 0.0196    ],
       [0.28      , 0.0784    ]], dtype=float32)

We can do the same for the encoded skims, and we get exactly the same result, even though the encoded skims use less RAM.

tree_enc = sh.DataTree(
    base=pairs,
    skims=skims_encoded.drop_dims("time_period"),
    relationships=(
        "base.orig -> skims.otaz",
        "base.dest -> skims.dtaz",
    ),
)
flow_enc = tree_enc.setup_flow({"d1": "DIST", "d2": "DIST**2"})
arr_enc = flow_enc.load()
arr_enc

array([[0.12  , 0.0144],
       [0.24  , 0.0576],
       [0.44  , 0.1936],
       [0.37  , 0.1369],
       [0.14  , 0.0196],
       [0.28  , 0.0784]], dtype=float32)

Joint Dict Encoding#

Dictionary encoding can be expanded to map multiple different variables using the same underlying offsets array. For large datasets with several dimension lengths in the thousands, the offset array may constitute the vast majority of the memory usage, so sharing the same offsets for several variables can result in huge reductions in the memory footprint.

The joint dictionary can be applied using the set method of the digital_encoding accessor, giving a list of the variables to jointly encode:

skims1 = skims.digital_encoding.set(
    [
        "WLK_LOC_WLK_FAR",
        "WLK_EXP_WLK_FAR",
        "WLK_HVY_WLK_FAR",
        "DRV_LOC_WLK_FAR",
        "DRV_HVY_WLK_FAR",
        "DRV_EXP_WLK_FAR",
    ],
    joint_dict=True,
)

A unique name is automatically generated for the join when joint_dict is set to True. Alternatively, the user can specify a name to use for the join by giving a string input as the joint_dict. Different sets of variables in the same Dataset can be grouped and encoded jointly, but each group must have a unique name.

skims1 = skims1.digital_encoding.set(
    ["DISTBIKE", "DISTWALK"],
    joint_dict="jointWB",
)

The resulting dataset adds a variable for each created group, which contains the offsets, and the named variables in the group are replaced with a new one-dimension array with coordinating lengths.

skims1

<xarray.Dataset> Size: 2MB
Dimensions:               (otaz: 25, dtaz: 25, time_period: 5, joined_0: 40,
                           jointWB: 192)
Coordinates:
  * dtaz                  (dtaz) int64 200B 1 2 3 4 5 6 7 ... 20 21 22 23 24 25
  * otaz                  (otaz) int64 200B 1 2 3 4 5 6 7 ... 20 21 22 23 24 25
  * time_period           (time_period) <U2 40B 'EA' 'AM' 'MD' 'PM' 'EV'
Dimensions without coordinates: joined_0, jointWB
Data variables: (12/172)
    DIST                  (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DRV_COM_WLK_BOARDS    (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_DDIST     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_DTIM      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_FAR       (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_IWAIT     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    ...                    ...
    DRV_LOC_WLK_FAR       (joined_0) float32 160B 0.0 152.0 ... 152.0 474.0
    DRV_HVY_WLK_FAR       (joined_0) float32 160B 0.0 206.0 ... 220.0 220.0
    DRV_EXP_WLK_FAR       (joined_0) float32 160B 0.0 139.0 ... 244.0 244.0
    jointWB_offsets       (otaz, dtaz) uint8 625B 112 107 135 103 ... 68 113 137
    DISTBIKE              (jointWB) float32 768B 0.5 2.25 0.75 ... 1.72 1.97
    DISTWALK              (jointWB) float32 768B 0.5 2.25 0.75 ... 1.72 1.97

Skims encoded in this manner can be fed into sharrow and will compile and return the same results as if they were not encoded. If you are mixing compiled flows between encoded and unencoded Datasets (which should be unusual, but for the examples in this notebook we’ve done it) you’ll need to set the hashing_level to at least 2, to ensure you are matching the correct numba code with the encodings used in the data.

tree1 = sh.DataTree(
    base=pairs,
    skims=skims1,
    rskims=skims1,
    relationships=(
        "base.orig -> skims.otaz",
        "base.dest -> skims.dtaz",
        "base.orig -> rskims.dtaz",
        "base.dest -> rskims.otaz",
    ),
)
flow1 = tree1.setup_flow(
    {
        "d1": 'skims["WLK_LOC_WLK_FAR", "AM"]',
        "d2": 'skims["WLK_LOC_WLK_FAR", "AM"]**2',
        "w1": "skims.DISTWALK",
        "w2": 'skims.reverse("DISTWALK")',
        "w3": "rskims.DISTWALK",
        "x1": "skims.DIST",
        "x2": 'skims.reverse("DIST")',
    },
    hashing_level=2,
)
arr1 = flow1.load_dataframe()
arr1

	d1	d2	w1	w2	w3	x1	x2
0	0.0	0.0	0.12	0.12	0.12	0.12	0.12
1	474.0	224676.0	0.24	0.37	0.37	0.24	0.37
2	474.0	224676.0	0.44	0.57	0.57	0.44	0.57
3	152.0	23104.0	0.37	0.24	0.24	0.37	0.24
4	0.0	0.0	0.14	0.14	0.14	0.14	0.14
5	474.0	224676.0	0.28	0.28	0.28	0.28	0.28

Pandas Categorical Dtype#

Dictionary encoding is very similar to the approach used for the pandas Categorical dtype, and can be used to achieve some of the efficiencies of categorical data, even though xarray lacks a formal native categorical data representation. Sharrow’s construct function for creating Dataset objects will automatically use dictionary encoding for “category” data.

To demonstrate, we’ll load some household data and create a categorical data column.

hh = sh.example_data.get_households()
hh["income_grp"] = pd.cut(
    hh.income, bins=[-np.inf, 30000, 60000, np.inf], labels=["Low", "Mid", "High"]
)
hh = hh[["income", "income_grp"]]
hh.head()

	income	income_grp
HHID
2717868	361000	High
763899	59220	Mid
2222791	197000	High
112477	2200	Low
370491	16500	Low

hh.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5000 entries, 2717868 to 702559
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   income      5000 non-null   int64   
 1   income_grp  5000 non-null   category
dtypes: category(1), int64(1)
memory usage: 83.1 KB

We’ll then create a Dataset using construct.

Note that the “income” variable remains an integer as expected, but the “income_grp” variable, which had been a “category” dtype in pandas, is now stored as an int8, giving the category index of each element (it would be an int16 or larger if needed, but that’s not necessary with only 3 categories). The information about the labels for the categories is retained not in the data itself but in the digital_encoding:

hh_dataset["income_grp"].digital_encoding

{'dictionary': array(['Low', 'Mid', 'High'], dtype='<U4'), 'ordered': True}

If you try to make the return trip to a pandas DataFrame using the regular xarray.Dataset.to_pandas() method, the details of the categorical nature of this variable are lost, and only the int8 index is available.

hh_dataset.to_pandas()

	income	income_grp
HHID
2717868	361000	2
763899	59220	1
2222791	197000	2
112477	2200	0
370491	16500	0
...	...	...
109218	15000	0
570708	13100	0
2762199	0	0
2049372	103000	2
702559	14800	0

5000 rows × 2 columns

But, if you use the single_dim accessor on the dataset provided by sharrow, the categorical dtype is restored correctly.

hh_dataset.single_dim.to_pandas()

	income	income_grp
HHID
2717868	361000	High
763899	59220	Mid
2222791	197000	High
112477	2200	Low
370491	16500	Low
...	...	...
109218	15000	Low
570708	13100	Low
2762199	0	Low
2049372	103000	High
702559	14800	Low