Multi-Dimensional Analysis#

This notebook provides a walkthrough of some of the multi-dimensional analysis capabilities of the sharrow library.

import numpy as np
import xarray as xr

import sharrow as sh

sh.__version__
'2.8.2'

Example Data#

We’ll begin by again importing some example data to work with. We’ll be using some test data taken from the MTC example in the ActivitySim project, including tables of data for households and persons, as well as a set of skims containing transportation level of service information for travel around a tiny slice of San Francisco.

The households and persons are typical tabular data, and each can be read in and stored as a pandas.DataFrame.

households = sh.example_data.get_households()
households.head()
TAZ SERIALNO PUMA5 income PERSONS HHT UNITTYPE NOC BLDGSZ TENURE ... hschpred hschdriv htypdwel hownrent hadnwst hadwpst hadkids bucketBin originalPUMA hmultiunit
HHID
2717868 25 2715386 2202 361000 2 1 0 0 9 1 ... 0 0 2 1 0 0 0 3 2202 1
763899 6 5360279 2203 59220 1 4 0 0 9 3 ... 0 0 2 2 0 0 0 4 2203 1
2222791 9 77132 2203 197000 2 2 0 0 9 1 ... 0 0 2 1 0 0 1 5 2203 1
112477 17 3286812 2203 2200 1 6 0 0 8 3 ... 0 0 2 2 0 0 0 7 2203 1
370491 21 6887183 2203 16500 3 1 0 1 8 3 ... 1 0 2 2 0 0 0 7 2203 1

5 rows × 46 columns

persons = sh.example_data.get_persons()
persons.head()
household_id age RELATE ESR GRADE PNUM PAUG DDP sex WEEKS HOURS MSP POVERTY EARNS pagecat pemploy pstudent ptype padkid
PERID
25671 25671 47 1 6 0 1 0 0 1 0 0 6 39 0 6 3 3 4 2
25675 25675 27 1 6 7 1 0 0 2 52 40 2 84 7200 5 3 2 3 2
25678 25678 30 1 6 0 1 0 0 2 0 0 6 84 0 5 3 3 4 2
25683 25683 23 1 6 0 1 0 0 1 0 0 6 1 0 4 3 3 4 2
25684 25684 52 1 6 0 1 0 0 1 0 0 6 94 0 7 3 3 4 2

The skims, on the other hand, are not just simple tabular data, but rather a multi-dimensional representation of the transportation system, indexed by origin. destination, and time of day. Rather than using a single DataFrame for this data, we store it as a multi-dimensional xarray.Dataset.

skims = sh.example_data.get_skims()
skims
<xarray.Dataset> Size: 2MB
Dimensions:               (otaz: 25, dtaz: 25, time_period: 5)
Coordinates:
  * dtaz                  (dtaz) int64 200B 1 2 3 4 5 6 7 ... 20 21 22 23 24 25
  * otaz                  (otaz) int64 200B 1 2 3 4 5 6 7 ... 20 21 22 23 24 25
  * time_period           (time_period) <U2 40B 'EA' 'AM' 'MD' 'PM' 'EV'
Data variables: (12/170)
    DIST                  (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DISTBIKE              (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DISTWALK              (otaz, dtaz) float32 2kB dask.array<chunksize=(25, 25), meta=np.ndarray>
    DRV_COM_WLK_BOARDS    (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_DDIST     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    DRV_COM_WLK_DTIM      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    ...                    ...
    WLK_TRN_WLK_IVT       (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_IWAIT     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_WACC      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_WAUX      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_WEGR      (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>
    WLK_TRN_WLK_XWAIT     (otaz, dtaz, time_period) float32 12kB dask.array<chunksize=(25, 25, 5), meta=np.ndarray>

For this example, we’ll also load a land use table, that contains some attributes of the alternatives.

landuse = sh.example_data.get_land_use()
landuse.head()
DISTRICT SD COUNTY TOTHH HHPOP TOTPOP EMPRES SFDU MFDU HHINCQ1 ... area_type HSENROLL COLLFTE COLLPTE TOPOLOGY TERMINAL ZERO hhlds sftaz gqpop
TAZ
1 1 1 1 46 74 82 37 1 60 15 ... 0 0.0 0.00000 0.0 3 5.89564 0 46 1 8
2 1 1 1 134 214 240 107 5 147 57 ... 0 0.0 0.00000 0.0 1 5.84871 0 134 2 26
3 1 1 1 267 427 476 214 9 285 101 ... 0 0.0 0.00000 0.0 1 5.53231 0 267 3 49
4 1 1 1 151 239 253 117 6 210 52 ... 0 0.0 0.00000 0.0 2 5.64330 0 151 4 14
5 1 1 1 611 974 1069 476 22 671 223 ... 0 0.0 72.14684 0.0 1 5.52555 0 611 5 95

5 rows × 41 columns

Multi-Dimensional Analysis#

Now that we’re loaded our inputs, let’s take a look at preparing some data for a workplace location choice simulation model. This is a different kind of model, and it will use differently shaped data: the decision makers (or “choosers”) in this model will be the workers, and the alternatives will be the various zones included in the land use table.

The workers are only a subset of the persons data we looked at before. We can identify workers from values 1 and 2 (full-time employed and part-time employed) in the 'pemploy' attribute of the persons table.

workers = persons.query("pemploy in [1,2]").rename_axis(index="WORKERID")
workers
household_id age RELATE ESR GRADE PNUM PAUG DDP sex WEEKS HOURS MSP POVERTY EARNS pagecat pemploy pstudent ptype padkid
WORKERID
72220 72220 35 1 1 0 1 0 0 2 40 30 6 139 6000 6 2 3 2 2
72229 72229 62 1 1 0 1 0 0 1 35 40 2 153 13200 7 1 3 1 2
72235 72235 35 1 1 0 1 0 0 2 40 30 6 139 6000 6 2 3 2 2
107594 107594 55 1 1 0 1 0 0 1 48 16 6 144 5600 7 2 3 2 2
107597 107597 43 1 1 0 1 0 0 1 0 0 6 51 0 6 1 3 1 2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7513986 2822651 42 23 1 0 1 0 0 1 47 20 4 54 4600 6 2 3 2 2
7513995 2822660 28 23 1 0 1 0 0 1 3 40 6 289 25000 5 2 3 2 2
7514003 2822668 19 23 4 0 1 0 0 1 47 50 6 0 13000 4 1 3 1 2
7514037 2822702 33 23 1 0 1 0 0 2 49 48 2 277 24000 5 1 3 1 2
7514060 2822725 35 23 1 0 1 0 0 1 0 0 6 55 0 6 1 3 1 2

4361 rows × 19 columns

As we filter the persons table to just the workers, we also renamed the index from “PERSONID” to “WORKERID”. This renaming is important for sharrow, as it expects dimensions that have the same name to match, but the workers don’t align directly with the persons anymore.

For our workplace location choice model, we will want to link in data from our skims, which can tell us about travel times and costs. Since we have not yet determined a time of day for each worker’s work tours, we’ll just use the 'AM' skims for the outbound leg of a hypothetical work tour, and the 'PM' skims for the return leg. Instead of trying to select constant skims using the dynamic lookups that sharrow can compile, we can just filter the skims down in a static manner before placing them into the data tree.

skims_am = skims.sel(time_period="AM")
skims_pm = skims.sel(time_period="PM")

Creating a DataTree Iteratively#

The last step in getting ready for this model is building out the relationships between all this data we’ve prepared. We’ll again use the DataTree class to do that, but this time we’ll demostrate building the tree in stages. First, we’ll create a base Dataset to be the root data for the tree. We can start by creating an otherwise empty Dataset indexed on the two dimensions we want to end up with for analysis, workers and zones.

base = sh.dataset.from_named_objects(
    workers.index,
    landuse.index,
)
base
<xarray.Dataset> Size: 35kB
Dimensions:   (WORKERID: 4361, TAZ: 25)
Coordinates:
  * WORKERID  (WORKERID) int64 35kB 72220 72229 72235 ... 7514037 7514060
  * TAZ       (TAZ) int64 200B 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24 25
Data variables:
    *empty*

Since our base dataset has two dimensions, we can specify a dimension order when writing into a DataTree (the default is alphabetical order). This ordering will be applied to outputs from the flows later.

tree = sh.DataTree(base=base, dim_order=("WORKERID", "TAZ"))

Then, we can progressively build our DataTree by adding additional data. Each new branch of the tree we want to add using the add_dataset command should have a unique name, a dataset being attached, and one or more relationship declarations that describe how the new data attaches. For example, we can attach the persons data like this:

tree.add_dataset("person", persons, "base.WORKERID @ person.PERID")

The relationship definition here starts with a dotted name of some data dimension already in the tree, an @ operator to indicating matching by label in that dimension.

tree.add_dataset("landuse", landuse, "base.TAZ @ landuse.TAZ")
tree.add_dataset("hh", households, "person.household_id @ hh.HHID")

Unlike in the mode choice work in the previous example, we’ve already filtered the time period dimensions of the skims to be morning and afternoon peak, so we simply attach the two different parts, linking relationships only for the remaining dimensions.

tree.add_dataset(
    "odskims",
    skims_am,
    relationships=(
        "hh.TAZ @ odskims.otaz",
        "base.TAZ @ odskims.dtaz",
    ),
)

tree.add_dataset(
    "doskims",
    skims_pm,
    relationships=(
        "base.TAZ @ doskims.otaz",
        "hh.TAZ @ doskims.dtaz",
    ),
)

Dynamically Defined Flows#

Although it is convenient to write expressions into a seperately configured “spec” file, especially when working with ActivitySim, it’s not strictly necessary to employ such a file in csv format; a simple Python dictionary can also be used to setup a flow.

definition = {
    "round_trip_dist": "odskims.DIST + doskims.DIST",
    "round_trip_dist_first_mile": "clip(odskims.DIST, 0, 1) + clip(doskims.DIST, 0, 1)",
    "round_trip_dist_addl_miles": "clip(odskims.DIST-1, 0, None) + clip(doskims.DIST-1, 0, None)",
    "size_term": "log(TOTPOP + 0.5*EMPRES)",
}

flow = tree.setup_flow(definition)

Loading from this flow is done the same as before.

arr = flow.load()
arr
array([[[0.61      , 0.61      , 0.        , 4.6101575 ],
        [0.28      , 0.28      , 0.        , 5.6818776 ],
        [0.56      , 0.56      , 0.        , 6.368187  ],
        ...,
        [1.9200001 , 1.9200001 , 0.        , 7.071149  ],
        [0.77      , 0.77      , 0.        , 7.0531535 ],
        [1.04      , 1.04      , 0.        , 8.302885  ]],

       [[1.19      , 1.19      , 0.        , 4.6101575 ],
        [1.49      , 1.49      , 0.        , 5.6818776 ],
        [1.88      , 1.85      , 0.02999997, 6.368187  ],
        ...,
        [2.58      , 2.        , 0.58000004, 7.071149  ],
        [2.04      , 1.9       , 0.13999999, 7.0531535 ],
        [2.47      , 2.        , 0.47000003, 8.302885  ]],

       [[1.19      , 1.19      , 0.        , 4.6101575 ],
        [1.49      , 1.49      , 0.        , 5.6818776 ],
        [1.88      , 1.85      , 0.02999997, 6.368187  ],
        ...,
        [2.58      , 2.        , 0.58000004, 7.071149  ],
        [2.04      , 1.9       , 0.13999999, 7.0531535 ],
        [2.47      , 2.        , 0.47000003, 8.302885  ]],

       ...,

       [[2.31      , 2.        , 0.30999994, 4.6101575 ],
        [2.15      , 2.        , 0.1500001 , 5.6818776 ],
        [2.06      , 2.        , 0.05999994, 6.368187  ],
        ...,
        [3.79      , 2.        , 1.79      , 7.071149  ],
        [2.7       , 2.        , 0.70000005, 7.0531535 ],
        [2.66      , 2.        , 0.6600001 , 8.302885  ]],

       [[2.31      , 2.        , 0.30999994, 4.6101575 ],
        [2.15      , 2.        , 0.1500001 , 5.6818776 ],
        [2.06      , 2.        , 0.05999994, 6.368187  ],
        ...,
        [3.79      , 2.        , 1.79      , 7.071149  ],
        [2.7       , 2.        , 0.70000005, 7.0531535 ],
        [2.66      , 2.        , 0.6600001 , 8.302885  ]],

       [[2.31      , 2.        , 0.30999994, 4.6101575 ],
        [2.15      , 2.        , 0.1500001 , 5.6818776 ],
        [2.06      , 2.        , 0.05999994, 6.368187  ],
        ...,
        [3.79      , 2.        , 1.79      , 7.071149  ],
        [2.7       , 2.        , 0.70000005, 7.0531535 ],
        [2.66      , 2.        , 0.6600001 , 8.302885  ]]], dtype=float32)

For the tour mode example above, the tours dataset had only one dimension (TOURIDX), and so the output of the load function had two dimensions (TOURIDX and expressions). In this example, the base dataset in the tree has two dimensions (workers and zones) and so the result from the basic load function has three dimensions (workers, zones, and expressions).

arr.shape
(4361, 25, 4)

Just as we could neatly format the two-dimensional output above as a pandas.DataFrame, so too can we neatly format this three-dimensional output, as a xarray.DataArray.

arr_pretty = flow.load_dataarray()
arr_pretty
<xarray.DataArray (WORKERID: 4361, TAZ: 25, expressions: 4)> Size: 2MB
array([[[0.61      , 0.61      , 0.        , 4.6101575 ],
        [0.28      , 0.28      , 0.        , 5.6818776 ],
        [0.56      , 0.56      , 0.        , 6.368187  ],
        ...,
        [1.9200001 , 1.9200001 , 0.        , 7.071149  ],
        [0.77      , 0.77      , 0.        , 7.0531535 ],
        [1.04      , 1.04      , 0.        , 8.302885  ]],

       [[1.19      , 1.19      , 0.        , 4.6101575 ],
        [1.49      , 1.49      , 0.        , 5.6818776 ],
        [1.88      , 1.85      , 0.02999997, 6.368187  ],
        ...,
        [2.58      , 2.        , 0.58000004, 7.071149  ],
        [2.04      , 1.9       , 0.13999999, 7.0531535 ],
        [2.47      , 2.        , 0.47000003, 8.302885  ]],

       [[1.19      , 1.19      , 0.        , 4.6101575 ],
        [1.49      , 1.49      , 0.        , 5.6818776 ],
        [1.88      , 1.85      , 0.02999997, 6.368187  ],
        ...,
...
        ...,
        [3.79      , 2.        , 1.79      , 7.071149  ],
        [2.7       , 2.        , 0.70000005, 7.0531535 ],
        [2.66      , 2.        , 0.6600001 , 8.302885  ]],

       [[2.31      , 2.        , 0.30999994, 4.6101575 ],
        [2.15      , 2.        , 0.1500001 , 5.6818776 ],
        [2.06      , 2.        , 0.05999994, 6.368187  ],
        ...,
        [3.79      , 2.        , 1.79      , 7.071149  ],
        [2.7       , 2.        , 0.70000005, 7.0531535 ],
        [2.66      , 2.        , 0.6600001 , 8.302885  ]],

       [[2.31      , 2.        , 0.30999994, 4.6101575 ],
        [2.15      , 2.        , 0.1500001 , 5.6818776 ],
        [2.06      , 2.        , 0.05999994, 6.368187  ],
        ...,
        [3.79      , 2.        , 1.79      , 7.071149  ],
        [2.7       , 2.        , 0.70000005, 7.0531535 ],
        [2.66      , 2.        , 0.6600001 , 8.302885  ]]], dtype=float32)
Coordinates:
  * WORKERID     (WORKERID) int64 35kB 72220 72229 72235 ... 7514037 7514060
  * TAZ          (TAZ) int64 200B 1 2 3 4 5 6 7 8 9 ... 18 19 20 21 22 23 24 25
  * expressions  (expressions) <U26 416B 'round_trip_dist' ... 'size_term'

Linear-in-Parameters Functions#

We can also use the dot method here with the two dimensional base. We’ll apply a one-dimensional coefficients array, with length three to match the three terms in the spec.

coefs = np.asarray([1.0, 0.1, 0.01, 0.001])
flow.dot(coefs)
array([[0.67561017, 0.31368188, 0.62236819, ..., 2.11907123, 0.85405313,
        1.15230284],
       [1.31361022, 1.64468189, 2.07166818, ..., 2.79287107, 2.23845311,
        2.68300291],
       [1.31361022, 1.64468189, 2.07166818, ..., 2.79287107, 2.23845311,
        2.68300291],
       ...,
       [2.5177101 , 2.35718197, 2.26696813, ..., 4.01497111, 2.9140532 ,
        2.87490297],
       [2.5177101 , 2.35718197, 2.26696813, ..., 4.01497111, 2.9140532 ,
        2.87490297],
       [2.5177101 , 2.35718197, 2.26696813, ..., 4.01497111, 2.9140532 ,
        2.87490297]])

The dot_dataarray method does the same underlying computational work, but yields a well-formatted DataArray intead of just a plain numpy array.

flow.dot_dataarray(coefs)
<xarray.DataArray (WORKERID: 4361, TAZ: 25)> Size: 872kB
array([[0.67561017, 0.31368188, 0.62236819, ..., 2.11907123, 0.85405313,
        1.15230284],
       [1.31361022, 1.64468189, 2.07166818, ..., 2.79287107, 2.23845311,
        2.68300291],
       [1.31361022, 1.64468189, 2.07166818, ..., 2.79287107, 2.23845311,
        2.68300291],
       ...,
       [2.5177101 , 2.35718197, 2.26696813, ..., 4.01497111, 2.9140532 ,
        2.87490297],
       [2.5177101 , 2.35718197, 2.26696813, ..., 4.01497111, 2.9140532 ,
        2.87490297],
       [2.5177101 , 2.35718197, 2.26696813, ..., 4.01497111, 2.9140532 ,
        2.87490297]])
Coordinates:
  * WORKERID  (WORKERID) int64 35kB 72220 72229 72235 ... 7514037 7514060
  * TAZ       (TAZ) int64 200B 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24 25

Multinomial Logit Simulation#

And we can build and simulate an MNL model directly using the logit_draws method.
To do so we need to provide the “random” draws exogenously. Here, we’ll sample 10 zones (with replacement) from the selection of alternatives.

draws = np.random.default_rng(123).random(size=[4361, 10])
choices, choice_probs = flow.logit_draws(
    coefficients=coefs,
    draws=draws,
)
choices
array([[ 5,  8,  8, ..., 19, 19, 20],
       [ 7,  7,  7, ..., 19, 20, 23],
       [ 1,  6,  6, ..., 17, 18, 22],
       ...,
       [ 5,  6,  7, ..., 22, 23, 23],
       [ 0,  2,  3, ..., 22, 24, 24],
       [ 7,  8,  8, ..., 21, 22, 22]], dtype=int8)
choice_probs
array([[0.02175669, 0.08209198, 0.08209198, ..., 0.13050658, 0.13050658,
        0.03846856],
       [0.06363645, 0.06363645, 0.06363645, ..., 0.06487329, 0.02116722,
        0.0313424 ],
       [0.01730855, 0.05250299, 0.05250299, ..., 0.035372  , 0.10315963,
        0.05456484],
       ...,
       [0.03972635, 0.02711131, 0.01971475, ..., 0.207882  , 0.06913442,
        0.06913442],
       [0.04651197, 0.03619669, 0.02502164, ..., 0.207882  , 0.06648009,
        0.06648009],
       [0.01971475, 0.03896772, 0.03896772, ..., 0.08727995, 0.207882  ,
        0.207882  ]])

It’s more common to make many repeated choices for destination choice type models (e.g. to sample destinations), so there’s also a “pick count” feature, that can summarize the simulation results efficiently.

choices_, choice_probs_, pick_count = flow.logit_draws(
    coefficients=coefs,
    draws=draws,
    pick_counted=True,
)

If you compare against the non-pick-counted results above, you’ll see that we get exactly the same choices out, but when choices are repeated they are aggregated in the resulting arrays.

choices_
array([[ 5,  8,  9, ..., -1, -1, -1],
       [ 7,  9, 10, ..., 23, -1, -1],
       [ 1,  6,  7, ..., 22, -1, -1],
       ...,
       [ 5,  6,  7, ..., -1, -1, -1],
       [ 0,  2,  3, ..., -1, -1, -1],
       [ 7,  8, 11, ..., -1, -1, -1]], dtype=int8)
pick_count
array([[1, 2, 2, ..., 0, 0, 0],
       [3, 1, 1, ..., 1, 0, 0],
       [1, 2, 1, ..., 1, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 2, 1, ..., 0, 0, 0]], dtype=int32)

Accessing Logsums#

If you want to also access the MNL logsum values from the choice model, adding logsums=True will return those values in the fourth position of the returned tuple (even if pick counting is disabled, the logsum array is in the 4th value):

choices, choice_probs, _, logsums = flow.logit_draws(
    coefficients=coefs,
    draws=draws,
    logsums=True,
)
logsums
array([5.61834913, 5.70123664, 5.70123664, ..., 5.58575577, 5.58575577,
       5.58575577])

Gotchas#

When working with multi-dimension outputs, if you don’t specify the dimensions ordering explicitly (as done above) then the output dimensions will be in lexicographic order according to the unicode binary representations of the dimension names. This is similar to alphabetical ordering, except all uppercase letters come before lower case letters.

tree_unordered = sh.DataTree(
    base=base,
    person=persons,
    landuse=landuse,
    hh=households,
    odskims=skims_am,
    doskims=skims_pm,
    relationships=(
        "base.WORKERID @ person.PERID",
        "base.TAZ @ landuse.TAZ",
        "person.household_id @ hh.HHID",
        "hh.TAZ @ odskims.otaz",
        "base.TAZ @ odskims.dtaz",
        "base.TAZ @ doskims.otaz",
        "hh.TAZ @ doskims.dtaz",
    ),
)
flow_unordered = tree_unordered.setup_flow(definition)
arr_unordered = flow_unordered.load_dataarray()
arr_unordered.dims
('TAZ', 'WORKERID', 'expressions')