Core Components

ActivitySim’s core components include features for multiprocessing, data management, utility expressions, choice models, person time window management, and helper functions. These core components include the multiprocessor, network LOS (skim) manager, the data pipeline manager, the random number manager, the tracer, sampling methods, simulation methods, model specification readers and expression evaluators, choice models, timetable, transit virtual path builder, and helper functions.

Multiprocessing

Parallelization using multiprocessing

API

activitysim.core.mp_tasks.MEM_TRACE_TICKS = 5

mp_tasks - activitysim multiprocessing overview

Activitysim runs a list of models sequentially, performing various computational operations on tables. Model steps can modify values in existing tables, add columns, or create additional tables. Activitysim provides the facility, via expression files, to specify vectorized operations on data tables. The ability to vectorize operations depends upon the independence of the computations performed on the vectorized elements.

Python is agonizingly slow performing scalar operations sequentially on large datasets, so vectorization (using pandas and/or numpy) is essential for good performance.

Fortunately most activity based model simulation steps are row independent at the household, person, tour, or trip level. The decisions for one household are independent of the choices made by other households. Thus it is (generally speaking) possible to run an entire simulation on a household sample with only one household, and get the same result for that household as you would running the simulation on a thousand households. (See the shared data section below for an exception to this highly convenient situation.)

The random number generator supports this goal by providing streams of random numbers for each households and person that are mutually independent and repeatable across model runs and processes.

To the extent that simulation model steps are row independent, we can implement most simulations as a series of vectorized operations on pandas DataFrames and numpy arrays. These vectorized operations are much faster than sequential python because they are implemented by native code (compiled C) and are to some extent multi-threaded. But the benefits of numpy multi-processing are limited because they only apply to atomic numpy or pandas calls, and as soon as control returns to python it is single-threaded and slow.

Multi-threading is not an attractive strategy to get around the python performance problem because of the limitations imposed by python’s global interpreter lock (GIL). Rather than struggling with python multi-threading, this module uses the python multiprocessing to parallelize certain models.

Because of activitysim’s modular and extensible architecture, we don’t hardwire the multiprocessing architecture. The specification of which models should be run in parallel, how many processers should be used, and the segmentation of the data between processes are all specified in the settings config file. For conceptual simplicity, the single processing model as treated as dominant (because even though in practice multiprocessing may be the norm for production runs, the single-processing model will be used in development and debugging and keeping it dominant will tend to concentrate the multiprocessing-specific code in one place and prevent multiprocessing considerations from permeating the code base obscuring the model-specific logic.

The primary function of the multiprocessing settings are to identify distinct stages of computation, and to specify how many simultaneous processes should be used to perform them, and how the data to be treated should be apportioned between those processes. We assume that the data can be apportioned between subprocesses according to the index of a single primary table (e.g. households) or else are by derivative or dependent tables that reference that table’s index (primary key) with a ref_col (foreign key) sharing the name of the primary table’s key.

Generally speaking, we assume that any new tables that are created are directly dependent on the previously existing tables, and all rows in new tables are either attributable to previously existing rows in the pipeline tables, or are global utility tables that are identical across sub-processes.

Note: There are a few exceptions to ‘row independence’, such as school and location choice models, where the model behavior is externally constrained or adjusted. For instance, we want school location choice to match known aggregate school enrollments by zone. Similarly, a parking model (not yet implemented) might be constrained by availability. These situations require special handling.

models:
  ### mp_initialize step
  - initialize_landuse
  - compute_accessibility
  - initialize_households
  ### mp_households step
  - school_location
  - workplace_location
  - auto_ownership_simulate
  - free_parking
  ### mp_summarize step
  - write_tables

multiprocess_steps:
  - name: mp_initialize
    begin: initialize_landuse
  - name: mp_households
    begin: school_location
    num_processes: 2
    slice:
      tables:
        - households
        - persons
  - name: mp_summarize
    begin: write_tables

The multiprocess_steps setting above annotates the models list to indicate that the simulation should be broken into three steps.

The first multiprocess_step (mp_initialize) begins with the initialize_landuse step and is implicity single-process because there is no ‘slice’ key indicating how to apportion the tables. This first step includes all models listed in the ‘models’ setting up until the first step in the next multiprocess_steps.

The second multiprocess_step (mp_households) starts with the school location model and continues through auto_ownership_simulate. The ‘slice’ info indicates that the tables should be sliced by households, and that persons is a dependent table and so and persons with a ref_col (foreign key column with the same name as the Households table index) referencing a household record should be taken to ‘belong’ to that household. Similarly, any other table that either share an index (i.e. having the same name) with either the households or persons table, or have a ref_col to either of their indexes, should also be considered a dependent table.

The num_processes setting of 2 indicates that the pipeline should be split in two, and half of the households should be apportioned into each subprocess pipeline, and all dependent tables should likewise be apportioned accordingly. All other tables (e.g. land_use) that do share an index (name) or have a ref_col should be considered mirrored and be included in their entirety.

The primary table is sliced by num_processes-sized strides. (e.g. for num_processes == 2, the sub-processes get every second record starting at offsets 0 and 1 respectively. All other dependent tables slices are based (directly or indirectly) on this primary stride segmentation of the primary table index.

Two separate sub-process are launched (num_processes == 2) and each passed the name of their apportioned pipeline file. They execute independently and if they terminate successfully, their contents are then coalesced into a single pipeline file whose tables should then be essentially the same as it had been generated by a single process.

We assume that any new tables that are created by the sub-processes are directly dependent on the previously primary tables or are mirrored. Thus we can coalesce the sub-process pipelines by concatenating the primary and dependent tables and simply retaining any copy of the mirrored tables (since they should all be identical.)

The third multiprocess_step (mp_summarize) then is handled in single-process mode and runs the write_tables model, writing the results, but also leaving the tables in the pipeline, with essentially the same tables and results as if the whole simulation had been run as a single process.

activitysim.core.mp_tasks.allocate_shared_shadow_pricing_buffers()

This is called by the main process to allocate memory buffer to share with subprocs

Returns
multiprocessing.RawArray
activitysim.core.mp_tasks.allocate_shared_skim_buffers()

This is called by the main process to allocate shared memory buffer to share with subprocs

Note: Buffers must be allocated BEFORE network_los.load_data

Returns
skim_buffersdict {<skim_tag>: <multiprocessing.RawArray>}
activitysim.core.mp_tasks.apportion_pipeline(sub_proc_names, step_info)

apportion pipeline for multiprocessing step

create pipeline files for sub_procs, apportioning data based on slice_rules

Called at the beginning of a multiprocess step prior to launching the sub-processes Pipeline files have well known names (pipeline file name prefixed by subjob name)

Parameters
sub_proc_nameslist of str

names of the sub processes to apportion

step_infodict

step_info from multiprocess_steps for step we are apportioning pipeline tables for

Returns
creates apportioned pipeline files for each sub job
activitysim.core.mp_tasks.build_slice_rules(slice_info, pipeline_tables)

based on slice_info for current step from run_list, generate a recipe for slicing the tables in the pipeline (passed in tables parameter)

slice_info is a dict with two well-known keys:

‘tables’: required list of table names (order matters!) ‘except’: optional list of tables not to slice even if they have a sliceable index name

Note: tables listed in slice_info must appear in same order and before any others in tables dict

The index of the first table in the ‘tables’ list is the primary_slicer.

Any other tables listed are dependent tables with either ref_cols to the primary_slicer or with the same index (i.e. having an index with the same name). This cascades, so any tables dependent on the primary_table can in turn have dependent tables that will be sliced by index or ref_col.

For instance, if the primary_slicer is households, then persons can be sliced because it has a ref_col to (column with the same same name as) the household table index. And the tours table can be sliced since it has a ref_col to persons. Tables can also be sliced by index. For instance the person_windows table can be sliced because it has an index with the same names as the persons table.

slice_info from multiprocess_steps

slice:
  tables:
    - households
    - persons

tables from pipeline

Table Name

Index

ref_col

households

household_id

persons

person_id

household_id

person_windows

person_id

accessibility

zone_id

generated slice_rules dict

households:
   slice_by: primary       <- primary table is sliced in num_processors-sized strides
persons:
   source: households
   slice_by: column
   column:  household_id   <- slice by ref_col (foreign key) to households
person_windows:
   source: persons
   slice_by: index         <- slice by index of persons table
accessibility:
   slice_by:               <- mirrored (non-dependent) tables don't get sliced
land_use:
   slice_by:
Parameters
slice_infodict

‘slice’ info from run_list for this step

pipeline_tablesdict {<table_name>, <pandas.DataFrame>}

dict of all tables from the pipeline keyed by table name

Returns
slice_rulesdict
activitysim.core.mp_tasks.coalesce_pipelines(sub_proc_names, slice_info)

Coalesce the data in the sub_processes apportioned pipelines back into a single pipeline

We use slice_rules to distinguish sliced (apportioned) tables from mirrored tables.

Sliced tables are concatenated to create a single omnibus table with data from all sub_procs but mirrored tables are the same across all sub_procs, so we can grab a copy from any pipeline.

Parameters
sub_proc_nameslist of str
slice_infodict

slice_info from multiprocess_steps

Returns
creates an omnibus pipeline with coalesced data from individual sub_proc pipelines
activitysim.core.mp_tasks.drop_breadcrumb(step_name, crumb, value=True)

Add (crumb: value) to specified step in breadcrumbs and flush breadcrumbs to file run can be resumed with resume_after

Breadcrumbs provides a record of steps that have been run for use when resuming Basically, we want to know which steps have been run, which phases completed (i.e. apportion, simulate, coalesce). For multi-processed simulate steps, we also want to know which sub-processes completed successfully, because if resume_after is LAST_CHECKPOINT we don’t have to rerun the successful ones.

Parameters
step_namestr
crumbstr
valueyaml-writable value
activitysim.core.mp_tasks.get_breadcrumbs(run_list)

Read, validate, and annotate breadcrumb file from previous run

if resume_after specifies a model name, we need to determine which step it falls within, drop any subsequent steps, and set the ‘simulate’ and ‘coalesce’ to None so

Extract from breadcrumbs file showing completed mp_households step with 2 processes:

- apportion: true
  completed: [mp_households_0, mp_households_1]
  name: mp_households
  simulate: true
  coalesce: true
Parameters
run_listdict

validated and annotated run_list from settings

Returns
breadcrumbsdict

validated and annotated breadcrumbs file from previous run

activitysim.core.mp_tasks.get_run_list()

validate and annotate run_list from settings

Assign defaults to missing settings (e.g. chunk_size) Build individual step model lists based on step starts If resuming, read breadcrumbs file for info on previous run execution status

# annotated run_list with two steps, the second with 2 processors

resume_after: None
multiprocess: True
models:
  -  initialize_landuse
  -  compute_accessibility
  -  initialize_households
  -  school_location
  -  workplace_location

multiprocess_steps:
  step: mp_initialize
    begin: initialize_landuse
    name: mp_initialize
    models:
       - initialize_landuse
       - compute_accessibility
       - initialize_households
    num_processes: 1
    chunk_size: 0
    step_num: 0
  step: mp_households
    begin: school_location
    slice: {'tables': ['households', 'persons']}
    name: mp_households
    models:
       - school_location
       - workplace_location
    num_processes: 2
    chunk_size: 10000
    step_num: 1
Returns
run_listdict

validated and annotated run_list

activitysim.core.mp_tasks.if_sub_task(if_is, if_isnt)

select one of two values depending whether current process is primary process or subtask

This is primarily intended for use in yaml files to select between (e.g.) logging levels so main log file can display only warnings and errors from subtasks

In yaml file, it can be used like this:

level: !!python/object/apply:activitysim.core.mp_tasks.if_sub_task [WARNING, NOTSET]

Parameters
if_is(any type) value to return if process is a subtask
if_isnt(any type) value to return if process is not a subtask
Returns
(any type) (one of parameters if_is or if_isnt)
activitysim.core.mp_tasks.mp_apportion_pipeline(injectables, sub_proc_names, step_info)

mp entry point for apportion_pipeline

Parameters
injectablesdict

injectables from parent

sub_proc_nameslist of str

names of the sub processes to apportion

step_infodict

step_info for multiprocess_step we are apportioning

activitysim.core.mp_tasks.mp_coalesce_pipelines(injectables, sub_proc_names, slice_info)

mp entry point for coalesce_pipeline

Parameters
injectablesdict

injectables from parent

sub_proc_nameslist of str

names of the sub processes to apportion

slice_infodict

slice_info from multiprocess_steps

activitysim.core.mp_tasks.mp_run_simulation(locutor, queue, injectables, step_info, resume_after, **kwargs)

mp entry point for run_simulation

Parameters
locutor
queue
injectables
step_info
resume_afterbool
kwargsdict

shared_data_buffers passed as kwargs to avoid picking dict

activitysim.core.mp_tasks.mp_setup_skims(injectables, **kwargs)

Sub process to load skim data into shared_data

There is no particular necessity to perform this in a sub process instead of the parent except to ensure that this heavyweight task has no side-effects (e.g. loading injectables)

Parameters
injectablesdict

injectables from parent

kwargsdict

shared_data_buffers passed as kwargs to avoid picking dict

activitysim.core.mp_tasks.pipeline_table_keys(pipeline_store)

return dict of current (as of last checkpoint) pipeline tables and their checkpoint-specific hdf5_keys

This facilitates reading pipeline tables directly from a ‘raw’ open pandas.HDFStore without opening it as a pipeline (e.g. when apportioning and coalescing pipelines)

We currently only ever need to do this from the last checkpoint, so the ability to specify checkpoint_name is not required, and thus omitted.

Parameters
pipeline_storeopen hdf5 pipeline_store
Returns
checkpoint_namename of the checkpoint
checkpoint_tablesdict {<table_name>: <table_key>}
activitysim.core.mp_tasks.print_run_list(run_list, output_file=None)

Print run_list to stdout or file (informational - not read back in)

Parameters
run_listdict
output_fileopen file
activitysim.core.mp_tasks.read_breadcrumbs()

Read breadcrumbs file from previous run

write_breadcrumbs wrote OrderedDict steps as list so ordered is preserved (step names are duplicated in steps)

Returns
breadcrumbsOrderedDict
activitysim.core.mp_tasks.run_multiprocess(injectables)

run the steps in run_list, possibly resuming after checkpoint specified by resume_after

we never open the pipeline since that is all done within multi-processing steps - mp_apportion_pipeline, run_sub_simulations, mp_coalesce_pipelines - each of which opens the pipeline/s and closes it/them within the sub-process This ‘feature’ makes the pipeline state a bit opaque to us, for better or worse…

Steps may be either single or multi process. For multi-process steps, we need to apportion pipelines before running sub processes and coalesce them afterwards

injectables arg allows propagation of setting values that were overridden on the command line (parent process command line arguments are not available to sub-processes in Windows)

  • allocate shared data buffers for skims and shadow_pricing

  • load shared skim data from OMX files

  • run each (single or multiprocess) step in turn

Drop breadcrumbs along the way to facilitate resuming in a later run

Parameters
run_listdict

annotated run_list (including prior run breadcrumbs if resuming)

injectablesdict

dict of values to inject in sub-processes

activitysim.core.mp_tasks.run_simulation(queue, step_info, resume_after, shared_data_buffer)

run step models as subtask

called once to run each individual sub process in multiprocess step

Unless actually resuming resuming, resume_after will be None for first step, and then FINAL for subsequent steps so pipelines opened to resume where previous step left off

Parameters
queuemultiprocessing.Queue
step_infodict

step_info for current step from multiprocess_steps

resume_afterstr or None
shared_data_bufferdict

dict of shared data (e.g. skims and shadow_pricing)

activitysim.core.mp_tasks.run_sub_simulations(injectables, shared_data_buffers, step_info, process_names, resume_after, previously_completed, fail_fast)

Launch sub processes to run models in step according to specification in step_info.

If resume_after is LAST_CHECKPOINT, then pick up where previous run left off, using breadcrumbs from previous run. If some sub-processes completed in the prior run, then skip rerunning them.

If resume_after specifies a checkpiont, skip checkpoints that precede the resume_after

Drop ‘completed’ breadcrumbs for this run as sub-processes terminate

Wait for all sub-processes to terminate and return list of those that completed successfully.

Parameters
injectablesdict

values to inject in subprocesses

shared_data_buffersdict

dict of shared_data for sub-processes (e.g. skim and shadow pricing data)

step_infodict

step_info from run_list

process_nameslist of str

list of sub process names to in parallel

resume_afterstr or None

name of simulation to resume after, or LAST_CHECKPOINT to resume where previous run left off

previously_completedlist of str

names of processes that successfully completed in previous run

fail_fastbool

whether to raise error if a sub process terminates with nonzero exitcode

Returns
completedlist of str

names of sub_processes that completed successfully

activitysim.core.mp_tasks.run_sub_task(p)

Run process p synchroneously,

Return when sub process terminates, or raise error if exitcode is nonzero

Parameters
pmultiprocessing.Process
activitysim.core.mp_tasks.setup_injectables_and_logging(injectables, locutor=True)

Setup injectables (passed by parent process) within sub process

we sometimes want only one of the sub-processes to perform an action (e.g. write shadow prices) the locutor flag indicates that this sub process is the designated singleton spokesperson

Parameters
injectablesdict {<injectable_name>: <value>}

dict of injectables passed by parent process

locutorbool

is this sub process the designated spokesperson

Returns
injects injectables
activitysim.core.mp_tasks.write_breadcrumbs(breadcrumbs)

Write breadcrumbs file with execution history of multiprocess run

Write steps as array so order is preserved (step names are duplicated in steps)

Extract from breadcrumbs file showing completed mp_households step with 2 processes:

- apportion: true
  coalesce: true
  completed: [mp_households_0, mp_households_1]
  name: mp_households
  simulate: true
Parameters
breadcrumbsOrderedDict

Data Management

Input

Input data table functions

API

activitysim.core.input.read_from_table_info(table_info)

Read input text files and return cleaned up DataFrame.

table_info is a dictionary that specifies the following input params.

See input_table_list in settings.yaml in the example folder for a working example

key

description

tablename

name of pipeline table in which to store dataframe

filename

name of csv file to read (in data_dir)

column_map

list of input columns to rename from_name: to_name

index_col

name of column to set as dataframe index column

drop_columns

list of column names of columns to drop

h5_tablename

name of target table in HDF5 file

activitysim.core.input.read_input_table(tablename, required=True)

Reads input table name and returns cleaned DataFrame.

Uses settings found in input_table_list in global settings file

Parameters
tablenamestring
Returns
pandas DataFrame

LOS

Network Level of Service (LOS) data access

API

class activitysim.core.los.Network_LOS(los_settings_file_name='network_los.yaml')
singleton object to manage skims and skim-related tables

los_settings_file_name: str         # e.g. 'network_los.yaml'
skim_dtype_name:str                 # e.g. 'float32'

dict_factory_name: str              # e.g. 'NumpyArraySkimFactory'
zone_system: str                    # str (ONE_ZONE, TWO_ZONE, or THREE_ZONE)
skim_time_periods = None            # list of str e.g. ['AM', 'MD', 'PM''

skims_info: dict                    # dict of SkimInfo keyed by skim_tag
skim_buffers: dict                  # if multiprocessing, dict of multiprocessing.Array buffers keyed by skim_tag
skim_dicts: dice                    # dict of SkimDict keyed by skim_tag

# TWO_ZONE and THREE_ZONE
maz_taz_df: pandas.DataFrame        # DataFrame with two columns, MAZ and TAZ, mapping MAZ to containing TAZ
maz_to_maz_df: pandas.DataFrame     # maz_to_maz attributes for MazSkimDict sparse skims
                                    # indexed by synthetic omaz/dmaz index for faster get_mazpairs lookup)
maz_ceiling: int                    # max maz_id + 1 (to compute synthetic omaz/dmaz index by get_mazpairs)
max_blend_distance: dict            # dict of int maz_to_maz max_blend_distance values keyed by skim_tag

# THREE_ZONE only
tap_df: pandas.DataFrame
tap_lines_df: pandas.DataFrame      # if specified in settings, list of transit lines served, indexed by TAP
                                    # use to prune maz_to_tap_dfs to drop more distant TAPS with redundant service
                                    # since a TAP can serve multiple lines, tap_lines_df TAP index is not unique
maz_to_tap_dfs: dict                # dict of maz_to_tap DataFrames indexed by access mode (e.g. 'walk', 'drive')
                                    # maz_to_tap dfs have OMAZ and DMAZ columns plus additional attribute columns
tap_tap_uid: TapTapUidCalculator
allocate_shared_skim_buffers()

Allocate multiprocessing.RawArray shared data buffers sized to hold data for the omx skims. Only called when multiprocessing - BEFORE load_data()

Returns dict of allocated buffers so they can be added to mp_tasks can add them to dict of data to be shared with subprocesses.

Note: we are only allocating storage, but not loading any skim data into it

Returns
dict of multiprocessing.RawArray keyed by skim_tag
create_skim_dict(skim_tag)

Create a new SkimDict of type specified by skim_tag (e.g. ‘taz’, ‘maz’ or ‘tap’)

Parameters
skim_tag: str
Returns
SkimDict or subclass (e.g. MazSkimDict)
get_default_skim_dict()

Get the default (non-transit) skim dict for the (1, 2, or 3) zone_system

Returns
TAZ SkimDict for ONE_ZONE, MazSkimDict for TWO_ZONE and THREE_ZONE
get_mazpairs(omaz, dmaz, attribute)

look up attribute values of maz od pairs in sparse maz_to_maz df

Parameters
omaz: array-like list of omaz zone_ids
dmaz: array-like list of omaz zone_ids
attribute: str name of attribute column in maz_to_maz_df
Returns
Numpy.ndarray: list of attribute values for od pairs
get_skim_dict(skim_tag)

Get SkimDict for the specified skim_tag (e.g. ‘taz’, ‘maz’, or ‘tap’)

Returns
SkimDict or subclass (e.g. MazSkimDict)
get_tappairs3d(otap, dtap, dim3, key)

TAP skim lookup

FIXME - why do we provide this for taps, but use skim wrappers for TAZ?

Parameters
otap: pandas.Series

origin (boarding tap) zone_ids

dtap: pandas.Series

dest (aligting tap) zone_ids

dim3: pandas.Series or str

dim3 (e.g. tod) str

key

skim key (e.g. ‘IWAIT_SET1’)

Returns
Numpy.ndarray: list of tap skim values for odt tuples
load_data()

Load tables and skims from files specified in network_los settigns

load_settings()

Read setting file and initialize object variables (see class docstring for list of object variables)

load_shared_data(shared_data_buffers)

Load omx skim data into shared_data buffers Only called when multiprocessing - BEFORE any models are run or any call to load_data()

Parameters
shared_data_buffers: dict of multiprocessing.RawArray keyed by skim_tag
load_skim_info()

read skim info from omx files into SkimInfo, and store in self.skims_info dict keyed by skim_tag

ONE_ZONE and TWO_ZONE systems have only TAZ skims THREE_ZONE systems have both TAZ and TAP skims

multiprocess()

return True if this is a multiprocessing run (even if it is a main or single-process subprocess)

Returns
bool
omx_file_names(skim_tag)

Return list of omx file names from network_los settings file for the specified skim_tag (e.g. ‘taz’)

Parameters
skim_tag: str (e.g. ‘taz’)
Returns
list of str
skim_time_period_label(time_period)

convert time period times to skim time period labels (e.g. 9 -> ‘AM’)

Parameters
time_periodpandas Series
Returns
numpy.array

string time period labels

Skims

Skims data access

API

class activitysim.core.skim_dict_factory.AbstractSkimFactory(network_los)

Provide access to skim data from store.

load_skim_info(skim_tag: str): dict

Read omx files for skim <skim_tag> (e.g. ‘TAZ’) and build skim_info dict

get_skim_data(skim_tag: str, skim_info: dict): SkimData

Read skim data from backing store and return it as a 3D ndarray quack-alike SkimData object

allocate_skim_buffer(skim_info, shared: bool): 1D array buffer sized for 3D SkimData

Allocate a ram skim buffer (ndarray or multiprocessing.Array) to use as frombuffer for SkimData

allocate_skim_buffer(skim_info, shared=False)

For multiprocessing

property supports_shared_data_for_multiprocessing

Does subclass support shareable data for multiprocessing

Returns
boolean
class activitysim.core.skim_dict_factory.JitMemMapSkimData(skim_cache_path, skim_info)

SkimData subclass for just-in-time memmap.

Since opening a memmap is fast, open the memmap read the data on demand and immediately close it. This essentially eliminates RAM usage, but it means we are loading the data every time we access the skim, which may be significantly slower, depending on patterns of usage.

property shape
Returns
list-like shape tuple as returned by numpy.shape
class activitysim.core.skim_dict_factory.MemMapSkimFactory(network_los)

The numpy.memmap docs states: The memmap object can be used anywhere an ndarray is accepted. You might think that since memmap duck-types ndarray, we could simply wrap it in a SkimData object.

But, as the numpy.memmap docs also say: “Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.”

The words “small segments” are not accidental, because, as you gradually access all the parts of the memmapped array, memory usage increases as all the memory is loaded into RAM.

Under this scenario, the MemMapSkimFactory operates as a just-in-time loader, with no net savings in RAM footprint (other than potentially avoiding loading any unused skims).

Alternatively, since opening a memmap is fast, you could just open the memmap read the data on demand, and immediately close it. This essentially eliminates RAM usage, but it means you are loading the data every time you access the skim, which, depending on you patterns of usage, may or may not be acceptable.

get_skim_data(skim_tag, skim_info)

Read skim data from backing store and return it as a 3D ndarray quack-alike SkimData object (either a JitMemMapSkimData or a memmap backed SkimData object)

Parameters
skim_tag: str
skim_info: string
Returns
SkimData or subclass
class activitysim.core.skim_dict_factory.NumpyArraySkimFactory(network_los)
allocate_skim_buffer(skim_info, shared=False)

Allocate a ram skim buffer to use as frombuffer for SkimData If shared is True, return a shareable multiprocessing.RawArray, otherwise a numpy.ndarray

Parameters
skim_info: dict
shared: boolean
Returns
multiprocessing.RawArray or numpy.ndarray
get_skim_data(skim_tag, skim_info)

Read skim data from backing store and return it as a 3D ndarray quack-alike SkimData object

Parameters
skim_tag: str
skim_info: string
Returns
SkimData
load_skims_to_buffer(skim_info, skim_buffer)

Load skims from disk store (omx or cache) into ram skim buffer (multiprocessing.RawArray or numpy.ndarray)

Parameters
skim_info: doct
skim_buffer: 1D buffer sized to hold all skims (multiprocessing.RawArray or numpy.ndarray)
property supports_shared_data_for_multiprocessing

Does subclass support shareable data for multiprocessing

Returns
boolean
class activitysim.core.skim_dict_factory.SkimData(skim_data)

A facade for 3D skim data exposing numpy indexing and shape The primary purpose is to document and police the api used to access skim data Subclasses using a different backing store to perform additional/alternative only need to implement the methods exposed here.

For instance, to open/close memmapped files just in time, or to access backing data via an alternate api

property shape
Returns
list-like shape tuple as returned by numpy.shape
class activitysim.core.skim_dictionary.DataFrameMatrix(df)

Utility class to allow a pandas dataframe to be treated like a 2-D array, indexed by rowid, colname

For use in vectorized expressions where the desired values depend on both a row column selector e.g. size_terms.get(df.dest_taz, df.purpose)

df = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]}, index=[100,101,102,103,104])

dfm = DataFrameMatrix(df)

dfm.get(row_ids=[100,100,103], col_ids=['a', 'b', 'a'])

returns [1, 10,  4]
get(row_ids, col_ids)
Parameters
row_ids - list of row_ids (df index values)
col_ids - list of column names, one per row_id,

specifying column from which the value for that row should be retrieved

Returns
series with one row per row_id, with the value from the column specified in col_ids
class activitysim.core.skim_dictionary.MazSkimDict(skim_tag, network_los, taz_skim_dict)

MazSkimDict provides a facade that allows skim-like lookup by maz orig,dest zone_id when there are often too many maz zones to create maz skims.

Dependencies: network_los.load_data must have already loaded: taz skim_dict, maz_to_maz_df, and maz_taz_df

It performs lookups from a sparse list of maz-maz od pairs on selected attributes (e.g. WALKDIST) where accuracy for nearby od pairs is critical. And is backed by a fallback taz skim dict to return values of for more distant pairs (or for skims that are not attributes in the maz-maz table.)

get_skim_usage()

return set of keys of skims looked up. e.g. {‘DIST’, ‘SOV’}

Returns
set:
lookup(orig, dest, key)

Return list of skim values of skims(s) at orig/dest in skim with the specified key (e.g. ‘DIST’)

Look up in sparse table (backed by taz skims) if key is a sparse_key, otherwise look up in taz skims For taz skim lookups, the offset_mapper will convert maz zone_ids directly to taz skim indexes.

Parameters
orig: list of orig zone_ids
dest: list of dest zone_ids
key: str
Returns
Numpy.ndarray: list of skim values for od pairs
sparse_lookup(orig, dest, key)

Get impedence values for a set of origin, destination pairs.

Parameters
orig1D array
dest1D array
keystr

skim key

Returns
valuesnumpy 1D array
class activitysim.core.skim_dictionary.OffsetMapper(offset_int=None, offset_list=None, offset_series=None)

Utility to map skim zone ids to ordinal offsets (e.g. numpy array indices)

Can map either by a fixed offset (e.g. -1 to map 1-based to 0-based) or by an explicit mapping of zone id to offset (slower but more flexible)

Internally, there are two representations:

offset_int:

int offset which when added to zone_id yields skim array index (e.g. -1 to map 1-based zones to 0-based index)

offset_series:

pandas series with zone_id index and skim array offset values. Ordinarily, index is just range(0, omx_size) if series has duplicate offset values, this can map multiple zone_ids to a single skim array index (e.g. can map maz zone_ids to corresponding taz skim offset)

map(zone_ids)

map zone_ids to skim indexes

Parameters
zone_idslist-like (numpy.ndarray, pandas.Int64Index, or pandas.Series)
Returns
offsetsnumpy array of int
set_offset_int(offset_int)

specify int offset which when added to zone_id yields skim array index (e.g. -1 to map 1-based to 0-based)

Parameters
offset_intint
set_offset_list(offset_list)

Convenience method to set offset_series using an integer list the same size as target skim dimension with implicit skim index mapping (e.g. an omx mapping as returned by omx_file.mapentries)

Parameters
offset_listlist of int
set_offset_series(offset_series)
Parameters
offset_series: pandas.Series

series with zone_id index and skim array offset values (can map many zone_ids to skim array index)

class activitysim.core.skim_dictionary.Skim3dWrapper(skim_dict, orig_key, dest_key, dim3_key)

This works the same as a SkimWrapper above, except the third dim3 is also supplied, and a 3D lookup is performed using orig, dest, and dim3.

Parameters
skims: Skims

This is the Skims object to wrap

dim3_keystr

This identifies the column in the dataframe which is used to select among Skim object using the SECOND item in each tuple (see above for a more complete description)

set_df(df)

Set the dataframe

Parameters
dfDataFrame

The dataframe which contains the orig, dest, and dim3 values

Returns
self (to facilitiate chaining)
class activitysim.core.skim_dictionary.SkimDict(skim_tag, skim_info, skim_data)

A SkimDict object is a wrapper around a dict of multiple skim objects, where each object is identified by a key.

Note that keys are either strings or tuples of two strings (to support stacking of skims.)

get_skim_usage()

return set of keys of skims looked up. e.g. {‘DIST’, ‘SOV’}

Returns
set:
lookup(orig, dest, key)

Return list of skim values of skims(s) at orig/dest in skim with the specified key (e.g. ‘DIST’)

Parameters
orig: list of orig zone_ids
dest: list of dest zone_ids
key: str
Returns
Numpy.ndarray: list of skim values for od pairs
lookup_3d(orig, dest, dim3, key)

3D lookup of skim values of skims(s) at orig/dest for stacked skims indexed by dim3 selector

The idea is that skims may be stacked in groups with a base key and a dim3 key (usually a time of day key)

On import (from omx) skims stacks are represented by base and dim3 keys seperated by a double_underscore

e.g. DRV_COM_WLK_BOARDS__AM indicates base skim key DRV_COM_WLK_BOARDS with a time of day (dim3) of ‘AM’

Since all the skimsa re stored in a single contiguous 3D array, we can use the dim3 key as a third index and thus rapidly get skim values for a list of (orig, dest, tod) tuples using index arrays (‘fancy indexing’)

Parameters
orig: list of orig zone_ids
dest: list of dest zone_ids
block_offsets: list with one dim3 key for each orig/dest pair
Returns
Numpy.ndarray: list of skim values
wrap(orig_key, dest_key)

return a SkimWrapper for self

wrap_3d(orig_key, dest_key, dim3_key)

return a SkimWrapper for self

property zone_ids

Return list of zone_ids we grok in skim index order

Returns
ndarray of int domain zone_ids
class activitysim.core.skim_dictionary.SkimWrapper(skim_dict, orig_key, dest_key)

A SkimWrapper object is an access wrapper around a SkimDict of multiple skim objects, where each object is identified by a key.

This is just a way to simplify expression files by hiding the and orig, dest arguments when the orig and dest vectors are in a dataframe with known column names (specified at init time) The dataframe is identified by set_df because it may not be available (e.g. due to chunking) at the time the SkimWrapper is instantiated.

When the user calls skims[key], key is an identifier for which skim to use, and the object automatically looks up impedances of that skim using the specified orig_key column in df as the origin and the dest_key column in df as the destination. In this way, the user does not do the O-D lookup by hand and only specifies which skim to use for this lookup. This is the only purpose of this object: to abstract away the O-D lookup and use skims by specifying which skim to use in the expressions.

Note that keys are either strings or tuples of two strings (to support stacking of skims.)

lookup(key, reverse=False)

Generally not called by the user - use __getitem__ instead

Parameters
keyhashable

The key (identifier) for this skim object

odbool (optional)

od=True means lookup standard origin-destination skim value od=False means lookup destination-origin skim value

Returns
impedances: pd.Series

A Series of impedances which are elements of the Skim object and with the same index as df

max(key)

return max skim value in either o-d or d-o direction

reverse(key)

return skim value in reverse (d-o) direction

set_df(df)

Set the dataframe

Parameters
dfDataFrame

The dataframe which contains the origin and destination ids

Returns
self (to facilitiate chaining)

Pipeline

Data pipeline manager, which manages the list of model steps, runs them, reads and writes data tables from/to the pipeline datastore, and supports restarting of the pipeline at any model step.

API

activitysim.core.pipeline.add_checkpoint(checkpoint_name)

Create a new checkpoint with specified name, write all data required to restore the simulation to its current state.

Detect any changed tables , re-wrap them and write the current version to the pipeline store. Write the current state of the random number generator.

Parameters
checkpoint_namestr
activitysim.core.pipeline.checkpointed_tables()

Return a list of the names of all checkpointed tables

activitysim.core.pipeline.cleanup_pipeline()

Cleanup pipeline after successful run

Open main pipeline if not already open (will be closed if multiprocess) Create a single-checkpoint pipeline file with latest version of all checkpointed tables, Delete main pipeline and any subprocess pipelines

Called if cleanup_pipeline_after_run setting is True

Returns
nothing, but with changed state: pipeline file that was open on call is closed and deleted
activitysim.core.pipeline.close_pipeline()

Close any known open files

activitysim.core.pipeline.extend_table(table_name, df, axis=0)

add new table or extend (add rows) to an existing table

Parameters
table_namestr

orca/inject table name

dfpandas DataFrame
activitysim.core.pipeline.get_checkpoints()

Get pandas dataframe of info about all checkpoints stored in pipeline

pipeline doesn’t have to be open

Returns
checkpoints_dfpandas.DataFrame
activitysim.core.pipeline.get_pipeline_store()

Return the open pipeline hdf5 checkpoint store or return None if it not been opened

activitysim.core.pipeline.get_rn_generator()

Return the singleton random number object

Returns
activitysim.random.Random
activitysim.core.pipeline.get_table(table_name, checkpoint_name=None)

Return pandas dataframe corresponding to table_name

if checkpoint_name is None, return the current (most recent) version of the table. The table can be a checkpointed table or any registered orca table (e.g. function table)

if checkpoint_name is specified, return table as it was at that checkpoint (the most recently checkpointed version of the table at or before checkpoint_name)

Parameters
table_namestr
checkpoint_namestr or None
Returns
dfpandas.DataFrame
activitysim.core.pipeline.last_checkpoint()
Returns
last_checkpoint: str

name of last checkpoint

activitysim.core.pipeline.load_checkpoint(checkpoint_name)

Load dataframes and restore random number channel state from pipeline hdf5 file. This restores the pipeline state that existed at the specified checkpoint in a prior simulation. This allows us to resume the simulation after the specified checkpoint

Parameters
checkpoint_namestr

model_name of checkpoint to load (resume_after argument to open_pipeline)

activitysim.core.pipeline.open_pipeline(resume_after=None)

Start pipeline, either for a new run or, if resume_after, loading checkpoint from pipeline.

If resume_after, then we expect the pipeline hdf5 file to exist and contain checkpoints from a previous run, including a checkpoint with name specified in resume_after

Parameters
resume_afterstr or None

name of checkpoint to load from pipeline store

activitysim.core.pipeline.open_pipeline_store(overwrite=False)

Open the pipeline checkpoint store

Parameters
overwritebool

delete file before opening (unless resuming)

activitysim.core.pipeline.read_df(table_name, checkpoint_name=None)

Read a pandas dataframe from the pipeline store.

We store multiple versions of all simulation tables, for every checkpoint in which they change, so we need to know both the table_name and the checkpoint_name of hte desired table.

The only exception is the checkpoints dataframe, which just has a table_name

An error will be raised by HDFStore if the table is not found

Parameters
table_namestr
checkpoint_namestr
Returns
dfpandas.DataFrame

the dataframe read from the store

activitysim.core.pipeline.registered_tables()

Return a list of the names of all currently registered dataframe tables

activitysim.core.pipeline.replace_table(table_name, df)

Add or replace a orca table, removing any existing added orca columns

The use case for this function is a method that calls to_frame on an orca table, modifies it and then saves the modified.

orca.to_frame returns a copy, so no changes are saved, and adding multiple column with add_column adds them in an indeterminate order.

Simply replacing an existing the table “behind the pipeline’s back” by calling orca.add_table risks pipeline to failing to detect that it has changed, and thus not checkpoint the changes.

Parameters
table_namestr

orca/pipeline table name

dfpandas DataFrame
activitysim.core.pipeline.rewrap(table_name, df=None)

Add or replace an orca registered table as a unitary DataFrame-backed DataFrameWrapper table

if df is None, then get the dataframe from orca (table_name should be registered, or an error will be thrown) which may involve evaluating added columns, etc.

If the orca table already exists, deregister it along with any associated columns before re-registering it.

The net result is that the dataframe is a registered orca DataFrameWrapper table with no computed or added columns.

Parameters
table_name
df
Returns
the underlying df of the rewrapped table
activitysim.core.pipeline.run(models, resume_after=None)

run the specified list of models, optionally loading checkpoint and resuming after specified checkpoint.

Since we use model_name as checkpoint name, the same model may not be run more than once.

If resume_after checkpoint is specified and a model with that name appears in the models list, then we only run the models after that point in the list. This allows the user always to pass the same list of models, but specify a resume_after point if desired.

Parameters
models[str]

list of model_names

resume_afterstr or None

model_name of checkpoint to load checkpoint and AFTER WHICH to resume model run

returns:

nothing, but with pipeline open

activitysim.core.pipeline.run_model(model_name)

Run the specified model and add checkpoint for model_name

Since we use model_name as checkpoint name, the same model may not be run more than once.

Parameters
model_namestr

model_name is assumed to be the name of a registered orca step

activitysim.core.pipeline.split_arg(s, sep, default='')

split str s in two at first sep, returning empty string as second result if no sep

activitysim.core.pipeline.write_df(df, table_name, checkpoint_name=None)

Write a pandas dataframe to the pipeline store.

We store multiple versions of all simulation tables, for every checkpoint in which they change, so we need to know both the table_name and the checkpoint_name to label the saved table

The only exception is the checkpoints dataframe, which just has a table_name

Parameters
dfpandas.DataFrame

dataframe to store

table_namestr

also conventionally the injected table name

checkpoint_namestr

the checkpoint at which the table was created/modified

Random

ActivitySim’s random number generation has a number of important features unique to AB modeling:

  • Regression testing, debugging - run the exact model with the same inputs and get exactly the same results.

  • Debugging models - run the exact model with the same inputs but with changes to expression files and get the same results except where the equations differ.

  • Since runs can take a while, the above cases need to work with a restartable pipeline.

  • Debugging Multithreading - run the exact model with different multithreading configurations and get the same results.

  • Repeatable household-level choices - results for a household are repeatable when run with different sample sizes

  • Repeatable household level results with different scenarios - results for a household are repeatable with different scenario configurations sequentially up to the point at which those differences emerge, and in alternate submodels in which those differences do not apply.

Random number generation is done using the numpy Mersenne Twister PNRG. ActivitySim seeds on-the-fly and uses a stream of random numbers seeded by the household id, person id, tour id, trip id, the model step offset, and the global seed. The logic for calculating the seed is something along the lines of:

chooser_table.index * number_of_models_for_chooser + chooser_model_offset + global_seed_offset

for example
  1425 * 2 + 0 + 1
where:
  1425 = household table index - households.id
  2 = number of household level models - auto ownership and cdap
  0 = first household model - auto ownership
  1 = global seed offset for testing the same model under different random global seeds

ActivitySim generates a separate, distinct, and stable random number stream for each tour type and tour number in order to maintain as much stability as is possible across alternative scenarios. This is done for trips as well, by direction (inbound versus outbound).

Note

The Random module contains max model steps constants by chooser type - household, person, tour, trip - needs to be equal to the number of chooser sub-models.

API

class activitysim.core.random.SimpleChannel(channel_name, base_seed, domain_df, step_name)

We need to ensure that we generate the same random streams (when re-run or even across different simulations.) We do this by generating a random seed for each domain_df row that is based on the domain_df index (which implies that generated tables like tours and trips are also created with stable, predictable, repeatable row indexes.

Because we need to generate a distinct stream for each step, we can’t just use the domain_df index - we need a strategy for handling multiple steps without generating collisions between streams (i.e. choosing the same seed for more than one stream.)

The easiest way to do this would be to use an array of integers to seed the generator, with a global seed, a channel seed, a row seed, and a step seed. Unfortunately, seeding numpy RandomState with arrays is a LOT slower than with a single integer seed, and speed matters because we reseed on-the-fly for every call because creating a different RandomState object for each row uses too much memory (5K per RandomState object)

numpy random seeds are unsigned int32 so there are 4,294,967,295 available seeds. That is probably just about enough to distribute evenly, for most cities, depending on the number of households, persons, tours, trips, and steps.

So we use (global_seed + channel_seed + step_seed + row_index) % (1 << 32) to get an int32 seed rather than a tuple.

We do read in the whole households and persons tables at start time, so we could note the max index values. But we might then want a way to ensure stability between the test, example, and full datasets. I am punting on this for now.

begin_step(step_name)

Reset channel state for a new state

Parameters
step_namestr

pipeline step name for this step

choice_for_df(df, step_name, a, size, replace)

Apply numpy.random.choice once for each row in df using the appropriate random channel for each row.

Concatenate the the choice arrays for every row into a single 1-D ndarray The resulting array will be of length: size * len(df.index) This method is designed to support creation of a interaction_dataset

The columns in df are ignored; the index name and values are used to determine which random number sequence to to use.

Parameters
dfpandas.DataFrame

df with index name and values corresponding to a registered channel

step_namestr

current step name so we can update row_states seed info

The remaining parameters are passed through as arguments to numpy.random.choice
a1-D array-like or int

If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a was np.arange(n)

sizeint or tuple of ints

Output shape

replaceboolean

Whether the sample is with or without replacement

Returns
choices1-D ndarray of length: size * len(df.index)

The generated random samples for each row concatenated into a single (flat) array

extend_domain(domain_df)

Extend or create row_state df by adding seed info for each row in domain_df

If extending, the index values of new tables must be disjoint so there will be no ambiguity/collisions between rows

Parameters
domain_dfpandas.DataFrame

domain dataframe with index values for which random streams are to be generated and well-known index name corresponding to the channel

init_row_states_for_step(row_states)

initialize row states (in place) for new step

with stable, predictable, repeatable row_seeds for that domain_df index value

See notes on the seed generation strategy in class comment above.

Parameters
row_states
normal_for_df(df, step_name, mu, sigma, lognormal=False)

Return a floating point random number in normal (or lognormal) distribution for each row in df using the appropriate random channel for each row.

Subsequent calls (in the same step) will return the next rand for each df row

The resulting array will be the same length (and order) as df This method is designed to support alternative selection from a probability array

The columns in df are ignored; the index name and values are used to determine which random number sequence to to use.

If “true pseudo random” behavior is desired (i.e. NOT repeatable) the set_base_seed method (q.v.) may be used to globally reseed all random streams.

Parameters
dfpandas.DataFrame or Series

df or series with index name and values corresponding to a registered channel

mufloat or pd.Series or array of floats with one value per df row
sigmafloat or array of floats with one value per df row
Returns
rands2-D ndarray

array the same length as df, with n floats in range [0, 1) for each df row

random_for_df(df, step_name, n=1)

Return n floating point random numbers in range [0, 1) for each row in df using the appropriate random channel for each row.

Subsequent calls (in the same step) will return the next rand for each df row

The resulting array will be the same length (and order) as df This method is designed to support alternative selection from a probability array

The columns in df are ignored; the index name and values are used to determine which random number sequence to to use.

If “true pseudo random” behavior is desired (i.e. NOT repeatable) the set_base_seed method (q.v.) may be used to globally reseed all random streams.

Parameters
dfpandas.DataFrame

df with index name and values corresponding to a registered channel

nint

number of rands desired per df row

Returns
rands2-D ndarray

array the same length as df, with n floats in range [0, 1) for each df row

activitysim.core.random.hash32(s)
Parameters
s: str
Returns
32 bit unsigned hash

Tracing

Household tracer. If a household trace ID is specified, then ActivitySim will output a comprehensive set of trace files for all calculations for all household members:

  • hhtrace.log - household trace log file, which specifies the CSV files traced. The order of output files is consistent with the model sequence.

  • various CSV files - every input, intermediate, and output data table - chooser, expressions/utilities, probabilities, choices, etc. - for the trace household for every sub-model

With the set of output CSV files, the user can trace ActivitySim’s calculations in order to ensure they are correct and/or to help debug data and/or logic errors.

API

activitysim.core.tracing.config_logger(basic=False)

Configure logger

look for conf file in configs_dir, if not found use basicConfig

Returns
Nothing
activitysim.core.tracing.delete_output_files(file_type, ignore=None, subdir=None)

Delete files in output directory of specified type

Parameters
output_dir: str

Directory of trace output CSVs

Returns
Nothing
activitysim.core.tracing.delete_trace_files()

Delete CSV files in output_dir

Returns
Nothing
activitysim.core.tracing.deregister_traceable_table(table_name)

un-register traceable table

Parameters
df: pandas.DataFrame

traced dataframe

Returns
Nothing
activitysim.core.tracing.get_trace_target(df, slicer, column=None)

get target ids and column or index to identify target trace rows in df

Parameters
df: pandas.DataFrame

dataframe to slice

slicer: str

name of column or index to use for slicing

Returns
(target, column) tuple
targetint or list of ints

id or ids that identify tracer target rows

columnstr

name of column to search for targets or None to search index

activitysim.core.tracing.hh_id_for_chooser(id, choosers)
Parameters
id - scalar id (or list of ids) from chooser index
choosers - pandas dataframe whose index contains ids
Returns
scalar household_id or series of household_ids
activitysim.core.tracing.interaction_trace_rows(interaction_df, choosers, sample_size=None)

Trace model design for interaction_simulate

Parameters
interaction_df: pandas.DataFrame

traced model_design dataframe

choosers: pandas.DataFrame

interaction_simulate choosers (needed to filter the model_design dataframe by traced hh or person id)

sample_size int or None

int for constant sample size, or None if choosers have different numbers of alternatives

Returns
——-
trace_rowsnumpy.ndarray

array of booleans to flag which rows in interaction_df to trace

trace_idstuple (str, numpy.ndarray)

column name and array of trace_ids mapping trace_rows to their target_id for use by trace_interaction_eval_results which needs to know target_id so it can create separate tables for each distinct target for readability

activitysim.core.tracing.no_results(trace_label)

standard no-op to write tracing when a model produces no results

activitysim.core.tracing.print_summary(label, df, describe=False, value_counts=False)

Print summary

Parameters
label: str

tracer name

df: pandas.DataFrame

traced dataframe

describe: boolean

print describe?

value_counts: boolean

print value counts?

Returns
Nothing
activitysim.core.tracing.register_traceable_table(table_name, df)

Register traceable table

Parameters
df: pandas.DataFrame

traced dataframe

Returns
Nothing
activitysim.core.tracing.slice_ids(df, ids, column=None)

slice a dataframe to select only records with the specified ids

Parameters
df: pandas.DataFrame

traced dataframe

ids: int or list of ints

slice ids

column: str

column to slice (slice using index if None)

Returns
df: pandas.DataFrame

sliced dataframe

activitysim.core.tracing.trace_df(df, label, slicer=None, columns=None, index_label=None, column_labels=None, transpose=True, warn_if_empty=False)

Slice dataframe by traced household or person id dataframe and write to CSV

Parameters
df: pandas.DataFrame

traced dataframe

label: str

tracer name

slicer: Object

slicer for subsetting

columns: list

columns to write

index_label: str

index name

column_labels: [str, str]

labels for columns in csv

transpose: boolean

whether to transpose file for legibility

warn_if_empty: boolean

write warning if sliced df is empty

Returns
Nothing
activitysim.core.tracing.trace_id_for_chooser(id, choosers)
Parameters
id - scalar id (or list of ids) from chooser index
choosers - pandas dataframe whose index contains ids
Returns
scalar household_id or series of household_ids
activitysim.core.tracing.trace_interaction_eval_results(trace_results, trace_ids, label)

Trace model design eval results for interaction_simulate

Parameters
trace_results: pandas.DataFrame

traced model_design dataframe

trace_idstuple (str, numpy.ndarray)

column name and array of trace_ids from interaction_trace_rows() used to filter the trace_results dataframe by traced hh or person id

label: str

tracer name

Returns
Nothing
activitysim.core.tracing.write_csv(df, file_name, index_label=None, columns=None, column_labels=None, transpose=True)

Print write_csv

Parameters
df: pandas.DataFrame or pandas.Series

traced dataframe

file_name: str

output file name

index_label: str

index name

columns: list

columns to write

transpose: bool

whether to transpose dataframe (ignored for series)

Returns
——-
Nothing

Utility Expressions

Much of the power of ActivitySim comes from being able to specify Python, pandas, and numpy expressions for calculations. Refer to the pandas help for a general introduction to expressions. ActivitySim provides two ways to evaluate expressions:

  • Simple table expressions are evaluated using DataFrame.eval(). pandas’ eval operates on the current table.

  • Python expressions, denoted by beginning with @, are evaluated with Python’s eval().

Simple table expressions can only refer to columns in the current DataFrame. Python expressions can refer to any Python objects currently in memory.

Conventions

There are a few conventions for writing expressions in ActivitySim:

  • each expression is applied to all rows in the table being operated on

  • expressions must be vectorized expressions and can use most numpy and pandas expressions

  • global constants are specified in the settings file

  • comments are specified with #

  • you can refer to the current table being operated on as df

  • often an object called skims, skims_od, or similar is available and is used to lookup the relevant skim information. See LOS for more information.

  • when editing the CSV files in Excel, use single quote ‘ or space at the start of a cell to get Excel to accept the expression

Example Expressions File

An expressions file has the following basic form:

Label

Description

Expression

cars0

cars1

util_drivers_2

2 Adults (age 16+)

drivers==2

coef_cars1_drivers_2

util_persons_25_34

Persons age 25-34

num_young_adults

coef_cars1_persons_25_34

util_num_workers_clip_3

Number of workers, capped at 3

@df.workers.clip(upper=3)

coef_cars1_num_workers_clip_3

util_dist_0_1

Distance, from 0 to 1 miles

@skims[‘DIST’].clip(1)

coef_dist_0_1

In the Tour Mode Choice model expression file example shown below, the @c_ivt*(@odt_skims['SOV_TIME'] + dot_skims['SOV_TIME']) expression is travel time for the tour origin to destination at the tour start time plus the tour destination to tour origin at the tour end time. The odt_skims and dot_skims objects are setup ahead-of-time to refer to the relevant skims for this model. The @c_ivt comes from the tour mode choice coefficient file. The tour mode choice model is a nested logit (NL) model and the nesting structure (including nesting coefficients) is specified in the YAML settings file.

Label

Description

Expression

DRIVEALONEFREE

DRIVEALONEPAY

util_DRIVEALONEFREE_Unavailable

DRIVEALONEFREE - Unavailable

sov_available == False

-999

util_DRIVEALONEFREE_In_vehicle_time

DRIVEALONEFREE - In-vehicle time

odt_skims[‘SOV_TIME’] + dot_skims[‘SOV_TIME’]

coef_ivt

util_DRIVEALONEFREE_Unavailable_for_persons_less_than_16

DRIVEALONEFREE - Unavailable for persons less than 16

age < 16

-999

util_DRIVEALONEFREE_Unavailable_for_joint_tours

DRIVEALONEFREE - Unavailable for joint tours

is_joint == True

-999

  • Rows are vectorized expressions that will be calculated for every record in the current table being operated on

  • The Label column is the unique expression name (used for model estimation integration)

  • The Description column describes the expression

  • The Expression column contains a valid vectorized Python/pandas/numpy expression. In the example above, drivers is a column in the current table. Use @ to refer to data outside the current table

  • There is a column for each alternative and its relevant coefficient from the submodel coefficient file

There are some variations on this setup, but the functionality is similar. For example, in the example destination choice model, the size terms expressions file has market segments as rows and employment type coefficients as columns. Broadly speaking, there are currently four types of model expression configurations:

  • Simple Simulate choice model - select from a fixed set of choices defined in the specification file, such as the example above.

  • Simulate with Interaction choice model - combine the choice expressions with the choice alternatives files since the alternatives are not listed in the expressions file. The Non-Mandatory Tour Destination Choice model implements this approach.

  • Combinatorial choice model - first generate a set of alternatives based on a combination of alternatives across choosers, and then make choices. The Coordinated Daily Activity Pattern model implements this approach.

Expressions

The expressions class is often used for pre- and post-processor table annotation, which read a CSV file of expression, calculate a number of additional table fields, and join the fields to the target table. An example table annotation expressions file is found in the example configuration files for households for the CDAP model - annotate_households_cdap.csv.

activitysim.core.expressions.assign_columns(df, model_settings, locals_dict={}, trace_label=None)

Evaluate expressions in context of df and assign resulting target columns to df

Can add new or modify existing columns (if target same as existing df column name)

Parameters - same as for compute_columns except df must not be None Returns - nothing since we modify df in place

activitysim.core.expressions.compute_columns(df, model_settings, locals_dict={}, trace_label=None)

Evaluate expressions_spec in context of df, with optional additional pipeline tables in locals

Parameters
dfpandas DataFrame

or if None, expect name of pipeline table to be specified by DF in model_settings

model_settingsdict or str
dict with keys:

DF - df_alias and (additionally, if df is None) name of pipeline table to load as df SPEC - name of expressions file (csv suffix optional) if different from model_settings TABLES - list of pipeline tables to load and make available as (read only) locals

str:

name of yaml file in configs_dir to load dict from

locals_dictdict

dict of locals (e.g. utility functions) to add to the execution environment

trace_label
Returns
results: pandas.DataFrame

one column for each expression (except temps with ALL_CAP target names) same index as df

Sampling with Interaction

Methods for expression handling, solving, and sampling (i.e. making multiple choices), with interaction with the chooser table.

Sampling is done with replacement and a sample correction factor is calculated. The factor is calculated as follows:

freq = how often an alternative is sampled (i.e. the pick_count)
prob = probability of the alternative
correction_factor = log(freq/prob)

#for example:

freq              1.00        2.00    3.00    4.00    5.00
prob              0.30        0.30    0.30    0.30    0.30
correction factor 1.20        1.90    2.30    2.59    2.81

As the alternative is oversampled, its utility goes up for final selection. The unique set of alternatives is passed to the final choice model and the correction factor is included in the utility.

API

activitysim.core.interaction_sample.interaction_sample(choosers, alternatives, spec, sample_size, alt_col_name, allow_zero_probs=False, log_alt_losers=False, skims=None, locals_d=None, chunk_size=0, chunk_tag=None, trace_label=None)

Run a simulation in the situation in which alternatives must be merged with choosers because there are interaction terms or because alternatives are being sampled.

optionally (if chunk_size > 0) iterates over choosers in chunk_size chunks

Parameters
chooserspandas.DataFrame

DataFrame of choosers

alternativespandas.DataFrame

DataFrame of alternatives - will be merged with choosers and sampled

specpandas.DataFrame

A Pandas DataFrame that gives the specification of the variables to compute and the coefficients for each variable. Variable specifications must be in the table index and the table should have only one column of coefficients.

sample_sizeint, optional

Sample alternatives with sample of given size. By default is None, which does not sample alternatives.

alt_col_name: str

name to give the sampled_alternative column

skimsSkims object

The skims object is used to contain multiple matrices of origin-destination impedances. Make sure to also add it to the locals_d below in order to access it in expressions. The only job of this method in regards to skims is to call set_df with the dataframe that comes back from interacting choosers with alternatives. See the skims module for more documentation on how the skims object is intended to be used.

locals_dDict

This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @

chunk_sizeint

if chunk_size > 0 iterates over choosers in chunk_size chunks

trace_label: str

This is the label to be used for trace log file entries and dump file names when household tracing enabled. No tracing occurs if label is empty or None.

Returns
choices_dfpandas.DataFrame

A DataFrame where index should match the index of the choosers DataFrame (except with sample_size rows for each choser row, one row for each alt sample) and columns alt_col_name, prob, rand, pick_count

<alt_col_name>:

alt identifier from alternatives[<alt_col_name>

prob: float

the probability of the chosen alternative

pick_countint

number of duplicate picks for chooser, alt

activitysim.core.interaction_sample.make_sample_choices(choosers, probs, alternatives, sample_size, alternative_count, alt_col_name, allow_zero_probs, trace_label)
Parameters
choosers
probspandas DataFrame

one row per chooser and one column per alternative

alternatives

dataframe with index containing alt ids

sample_sizeint

number of samples/choices to make

alternative_count
alt_col_namestr
trace_label

Simulate

Methods for expression handling, solving, choosing (i.e. making choices) from a fixed set of choices defined in the specification file.

API

activitysim.core.simulate.compute_base_probabilities(nested_probabilities, nests, spec)

compute base probabilities for nest leaves Base probabilities will be the nest-adjusted probabilities of all leaves This flattens or normalizes all the nested probabilities so that they have the proper global relative values (the leaf probabilities sum to 1 for each row.)

Parameters
nested_probabilitiespandas.DataFrame

dataframe with the nested probabilities for nest leafs and nodes

nestsdict

Nest tree dict from the model spec yaml file

specpandas.Dataframe

simple simulate spec so we can return columns in appropriate order

Returns
——-
base_probabilitiespandas.DataFrame

Will have the index of nested_probabilities and columns for leaf base probabilities

activitysim.core.simulate.compute_nested_exp_utilities(raw_utilities, nest_spec)

compute exponentiated nest utilities based on nesting coefficients

For nest nodes this is the exponentiated logsum of alternatives adjusted by nesting coefficient

leaf <- exp( raw_utility ) nest <- exp( ln(sum of exponentiated raw_utility of leaves) * nest_coefficient)

Parameters
raw_utilitiespandas.DataFrame

dataframe with the raw alternative utilities of all leaves (what in non-nested logit would be the utilities of all the alternatives)

nest_specdict

Nest tree dict from the model spec yaml file

Returns
nested_utilitiespandas.DataFrame

Will have the index of raw_utilities and columns for exponentiated leaf and node utilities

activitysim.core.simulate.compute_nested_probabilities(nested_exp_utilities, nest_spec, trace_label)

compute nested probabilities for nest leafs and nodes probability for nest alternatives is simply the alternatives’s local (to nest) probability computed in the same way as the probability of non-nested alternatives in multinomial logit i.e. the fractional share of the sum of the exponentiated utility of itself and its siblings except in nested logit, its sib group is restricted to the nest

Parameters
nested_exp_utilitiespandas.DataFrame

dataframe with the exponentiated nested utilities of all leaves and nodes

nest_specdict

Nest tree dict from the model spec yaml file

Returns
——-
nested_probabilitiespandas.DataFrame

Will have the index of nested_exp_utilities and columns for leaf and node probabilities

activitysim.core.simulate.dump_mapped_coefficients(model_settings)

dump template_df with coefficient values

activitysim.core.simulate.eval_mnl(choosers, spec, locals_d, custom_chooser, estimator, log_alt_losers=False, want_logsums=False, trace_label=None, trace_choice_name=None, trace_column_names=None)

Run a simulation for when the model spec does not involve alternative specific data, e.g. there are no interactions with alternative properties and no need to sample from alternatives.

Each row in spec computes a partial utility for each alternative, by providing a spec expression (often a boolean 0-1 trigger) and a column of utility coefficients for each alternative.

We compute the utility of each alternative by matrix-multiplication of eval results with the utility coefficients in the spec alternative columns yielding one row per chooser and one column per alternative

Parameters
chooserspandas.DataFrame
specpandas.DataFrame

A table of variable specifications and coefficient values. Variable expressions should be in the table index and the table should have a column for each alternative.

locals_dDict or None

This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @

custom_chooserfunction(probs, choosers, spec, trace_label) returns choices, rands

custom alternative to logit.make_choices

estimatorEstimator object

called to report intermediate table results (used for estimation)

trace_label: str

This is the label to be used for trace log file entries and dump file names when household tracing enabled. No tracing occurs if label is empty or None.

trace_choice_name: str

This is the column label to be used in trace file csv dump of choices

trace_column_names: str or list of str

chooser columns to include when tracing expression_values

Returns
choicespandas.Series

Index will be that of choosers, values will match the columns of spec.

activitysim.core.simulate.eval_mnl_logsums(choosers, spec, locals_d, trace_label=None)

like eval_nl except return logsums instead of making choices

Returns
logsumspandas.Series

Index will be that of choosers, values will be logsum across spec column values

activitysim.core.simulate.eval_nl(choosers, spec, nest_spec, locals_d, custom_chooser, estimator, log_alt_losers=False, want_logsums=False, trace_label=None, trace_choice_name=None, trace_column_names=None)

Run a nested-logit simulation for when the model spec does not involve alternative specific data, e.g. there are no interactions with alternative properties and no need to sample from alternatives.

Parameters
chooserspandas.DataFrame
specpandas.DataFrame

A table of variable specifications and coefficient values. Variable expressions should be in the table index and the table should have a column for each alternative.

nest_spec:

dictionary specifying nesting structure and nesting coefficients (from the model spec yaml file)

locals_dDict or None

This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @

custom_chooserfunction(probs, choosers, spec, trace_label) returns choices, rands

custom alternative to logit.make_choices

estimatorEstimator object

called to report intermediate table results (used for estimation)

trace_label: str

This is the label to be used for trace log file entries and dump file names when household tracing enabled. No tracing occurs if label is empty or None.

trace_choice_name: str

This is the column label to be used in trace file csv dump of choices

trace_column_names: str or list of str

chooser columns to include when tracing expression_values

Returns
choicespandas.Series

Index will be that of choosers, values will match the columns of spec.

activitysim.core.simulate.eval_nl_logsums(choosers, spec, nest_spec, locals_d, trace_label=None)

like eval_nl except return logsums instead of making choices

Returns
logsumspandas.Series

Index will be that of choosers, values will be nest logsum based on spec column values

activitysim.core.simulate.eval_utilities(spec, choosers, locals_d=None, trace_label=None, have_trace_targets=False, trace_all_rows=False, estimator=None, trace_column_names=None, log_alt_losers=False)
Parameters
specpandas.DataFrame

A table of variable specifications and coefficient values. Variable expressions should be in the table index and the table should have a column for each alternative.

chooserspandas.DataFrame
locals_dDict or None

This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @

trace_label: str
have_trace_targets: boolean - choosers has targets to trace
trace_all_rows: boolean - trace all chooser rows, bypassing tracing.trace_targets
estimator :

called to report intermediate table results (used for estimation)

trace_column_names: str or list of str

chooser columns to include when tracing expression_values

activitysim.core.simulate.eval_variables(exprs, df, locals_d=None)

Evaluate a set of variable expressions from a spec in the context of a given data table.

There are two kinds of supported expressions: “simple” expressions are evaluated in the context of the DataFrame using DataFrame.eval. This is the default type of expression.

Python expressions are evaluated in the context of this function using Python’s eval function. Because we use Python’s eval this type of expression supports more complex operations than a simple expression. Python expressions are denoted by beginning with the @ character. Users should take care that these expressions must result in a Pandas Series.

# FIXME - for performance, it is essential that spec and expression_values # FIXME - not contain booleans when dotted with spec values # FIXME - or the arrays will be converted to dtype=object within dot()

Parameters
exprssequence of str
dfpandas.DataFrame
locals_dDict

This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @

Returns
variablespandas.DataFrame

Will have the index of df and columns of eval results of exprs.

activitysim.core.simulate.get_segment_coefficients(model_settings, segment_name)

Return a dict mapping generic coefficient names to segment-specific coefficient values

some specs mode_choice logsums have the same espression values with different coefficients for various segments (e.g. eatout, .. ,atwork) and a template file that maps a flat list of coefficients into segment columns.

This allows us to provide a coefficient file with just the coefficients for a specific segment, that works with generic coefficient names in the spec. For instance coef_ivt can take on the values of segment-specific coefficients coef_ivt_school_univ, coef_ivt_work, coef_ivt_atwork,…

coefficients_df
                              value constrain
coefficient_name
coef_ivt_eatout_escort_...  -0.0175         F
coef_ivt_school_univ        -0.0224         F
coef_ivt_work               -0.0134         F
coef_ivt_atwork             -0.0188         F

template_df

coefficient_name     eatout                       school                 school                 work
coef_ivt             coef_ivt_eatout_escort_...   coef_ivt_school_univ   coef_ivt_school_univ   coef_ivt_work

For school segment this will return the generic coefficient name with the segment-specific coefficient value
e.g. {'coef_ivt': -0.0224, ...}
...
activitysim.core.simulate.read_model_coefficient_template(model_settings)

Read the coefficient template specified by COEFFICIENT_TEMPLATE model setting

activitysim.core.simulate.read_model_coefficients(model_settings=None, file_name=None)

Read the coefficient file specified by COEFFICIENTS model setting

activitysim.core.simulate.read_model_spec(file_name)

Read a CSV model specification into a Pandas DataFrame or Series.

file_path : str absolute or relative path to file

The CSV is expected to have columns for component descriptions and expressions, plus one or more alternatives.

The CSV is required to have a header with column names. For example:

Description,Expression,alt0,alt1,alt2

Parameters
model_settingsdict

name of spec_file is in model_settings[‘SPEC’] and file is relative to configs

file_namestr

file_name id spec file in configs folder

description_namestr, optional

Name of the column in fname that contains the component description.

expression_namestr, optional

Name of the column in fname that contains the component expression.

Returns
specpandas.DataFrame

The description column is dropped from the returned data and the expression values are set as the table index.

activitysim.core.simulate.set_skim_wrapper_targets(df, skims)

Add the dataframe to the SkimWrapper object so that it can be dereferenced using the parameters of the skims object.

Parameters
dfpandas.DataFrame

Table to which to add skim data as new columns. df is modified in-place.

skimsSkimWrapper or Skim3dWrapper object, or a list or dict of skims

The skims object is used to contain multiple matrices of origin-destination impedances. Make sure to also add it to the locals_d below in order to access it in expressions. The only job of this method in regards to skims is to call set_df with the dataframe that comes back from interacting choosers with alternatives. See the skims module for more documentation on how the skims object is intended to be used.

activitysim.core.simulate.simple_simulate(choosers, spec, nest_spec, skims=None, locals_d=None, chunk_size=0, custom_chooser=None, log_alt_losers=False, want_logsums=False, estimator=None, trace_label=None, trace_choice_name=None, trace_column_names=None)

Run an MNL or NL simulation for when the model spec does not involve alternative specific data, e.g. there are no interactions with alternative properties and no need to sample from alternatives.

activitysim.core.simulate.simple_simulate_by_chunk_id(choosers, spec, nest_spec, skims=None, locals_d=None, chunk_size=0, custom_chooser=None, log_alt_losers=False, want_logsums=False, estimator=None, trace_label=None, trace_choice_name=None)

chunk_by_chunk_id wrapper for simple_simulate

activitysim.core.simulate.simple_simulate_logsums(choosers, spec, nest_spec, skims=None, locals_d=None, chunk_size=0, trace_label=None, chunk_tag=None)

like simple_simulate except return logsums instead of making choices

Returns
logsumspandas.Series

Index will be that of choosers, values will be nest logsum based on spec column values

activitysim.core.simulate.spec_for_segment(model_settings, spec_id, segment_name, estimator)

Select spec for specified segment from omnibus spec containing columns for each segment

Parameters
model_specpandas.DataFrame

omnibus spec file with expressions in index and one column per segment

segment_namestr

segment_name that is also column name in model_spec

Returns
pandas.dataframe

canonical spec file with expressions in index and single column with utility coefficients

Simulate with Interaction

Methods for expression handling, solving, choosing (i.e. making choices), with interaction with the chooser table.

API

activitysim.core.interaction_simulate.eval_interaction_utilities(spec, df, locals_d, trace_label, trace_rows, estimator=None, log_alt_losers=False)

Compute the utilities for a single-alternative spec evaluated in the context of df

We could compute the utilities for interaction datasets just as we do for simple_simulate specs with multiple alternative columns by calling eval_variables and then computing the utilities by matrix-multiplication of eval results with the utility coefficients in the spec alternative columns.

But interaction simulate computes the utilities of each alternative in the context of a separate row in interaction dataset df, and so there is only one alternative in spec. This turns out to be quite a bit faster (in this special case) than the pandas dot function.

For efficiency, we combine eval_variables and multiplication of coefficients into a single step, so we don’t have to create a separate column for each partial utility. Instead, we simply multiply the eval result by a single alternative coefficient and sum the partial utilities.

specdataframe

one row per spec expression and one col with utility coefficient

dfdataframe

cross join (cartesian product) of choosers with alternatives combines columns of choosers and alternatives len(df) == len(choosers) * len(alternatives) index values (non-unique) are index values from alternatives df

interaction_utilitiesdataframe

the utility of each alternative is sum of the partial utilities determined by the various spec expressions and their corresponding coefficients yielding a dataframe with len(interaction_df) rows and one utility column having the same index as interaction_df (non-unique values from alternatives df)

Returns
utilitiespandas.DataFrame

Will have the index of df and a single column of utilities

activitysim.core.interaction_simulate.interaction_simulate(choosers, alternatives, spec, log_alt_losers=False, skims=None, locals_d=None, sample_size=None, chunk_size=0, trace_label=None, trace_choice_name=None, estimator=None)

Run a simulation in the situation in which alternatives must be merged with choosers because there are interaction terms or because alternatives are being sampled.

optionally (if chunk_size > 0) iterates over choosers in chunk_size chunks

Parameters
chooserspandas.DataFrame

DataFrame of choosers

alternativespandas.DataFrame

DataFrame of alternatives - will be merged with choosers, currently without sampling

specpandas.DataFrame

A Pandas DataFrame that gives the specification of the variables to compute and the coefficients for each variable. Variable specifications must be in the table index and the table should have only one column of coefficients.

skimsSkims object

The skims object is used to contain multiple matrices of origin-destination impedances. Make sure to also add it to the locals_d below in order to access it in expressions. The only job of this method in regards to skims is to call set_df with the dataframe that comes back from interacting choosers with alternatives. See the skims module for more documentation on how the skims object is intended to be used.

locals_dDict

This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @

sample_sizeint, optional

Sample alternatives with sample of given size. By default is None, which does not sample alternatives.

chunk_sizeint

if chunk_size > 0 iterates over choosers in chunk_size chunks

trace_label: str

This is the label to be used for trace log file entries and dump file names when household tracing enabled. No tracing occurs if label is empty or None.

trace_choice_name: str

This is the column label to be used in trace file csv dump of choices

Returns
choicespandas.Series

A series where index should match the index of the choosers DataFrame and values will match the index of the alternatives DataFrame - choices are simulated in the standard Monte Carlo fashion

Simulate with Sampling and Interaction

Methods for expression handling, solving, sampling (i.e. making multiple choices), and choosing (i.e. making choices), with interaction with the chooser table.

API

activitysim.core.interaction_sample_simulate.interaction_sample_simulate(choosers, alternatives, spec, choice_column, allow_zero_probs=False, zero_prob_choice_val=None, log_alt_losers=False, want_logsums=False, skims=None, locals_d=None, chunk_size=0, chunk_tag=None, trace_label=None, trace_choice_name=None, estimator=None)

Run a simulation in the situation in which alternatives must be merged with choosers because there are interaction terms or because alternatives are being sampled.

optionally (if chunk_size > 0) iterates over choosers in chunk_size chunks

Parameters
chooserspandas.DataFrame

DataFrame of choosers

alternativespandas.DataFrame

DataFrame of alternatives - will be merged with choosers index domain same as choosers, but repeated for each alternative

specpandas.DataFrame

A Pandas DataFrame that gives the specification of the variables to compute and the coefficients for each variable. Variable specifications must be in the table index and the table should have only one column of coefficients.

skimsSkims object

The skims object is used to contain multiple matrices of origin-destination impedances. Make sure to also add it to the locals_d below in order to access it in expressions. The only job of this method in regards to skims is to call set_df with the dataframe that comes back from interacting choosers with alternatives. See the skims module for more documentation on how the skims object is intended to be used.

locals_dDict

This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @

chunk_sizeint

if chunk_size > 0 iterates over choosers in chunk_size chunks

trace_label: str

This is the label to be used for trace log file entries and dump file names when household tracing enabled. No tracing occurs if label is empty or None.

trace_choice_name: str

This is the column label to be used in trace file csv dump of choices

Returns
if want_logsums is False:
choicespandas.Series

A series where index should match the index of the choosers DataFrame and values will match the index of the alternatives DataFrame - choices are simulated in the standard Monte Carlo fashion

if want_logsums is True:
choicespandas.DataFrame

choices[‘choice’] : same as choices series when logsums is False choices[‘logsum’] : float logsum of choice utilities across alternatives

Assign

Alternative version of the expression evaluators in activitysim.core.simulate that supports temporary variable assignment. Temporary variables are identified in the expressions as starting with “_”, such as “_hh_density_bin”. These fields are not saved to the data pipeline store. This feature is used by the Accessibility model.

API

activitysim.core.assign.assign_variables(assignment_expressions, df, locals_dict, df_alias=None, trace_rows=None, trace_label=None, chunk_log=None)

Evaluate a set of variable expressions from a spec in the context of a given data table.

Expressions are evaluated using Python’s eval function. Python expressions have access to variables in locals_d (and df being accessible as variable df.) They also have access to previously assigned targets as the assigned target name.

lowercase variables starting with underscore are temp variables (e.g. _local_var) and not returned except in trace_results

uppercase variables starting with underscore are temp singular variables (e.g. _LOCAL_SCALAR) and not returned except in trace_assigned_locals This is useful for defining general purpose local variables that don’t vary across choosers or alternatives and therefore don’t need to be stored as series/columns in the main choosers dataframe from which utilities are computed.

Users should take care that expressions (other than temp scalar variables) should result in a Pandas Series (scalars will be automatically promoted to series.)

Parameters
assignment_expressionspandas.DataFrame of target assignment expressions

target: target column names expression: pandas or python expression to evaluate

dfpandas.DataFrame
locals_dDict

This is a dictionary of local variables that will be the environment for an evaluation of “python” expression.

trace_rows: series or array of bools to use as mask to select target rows to trace
Returns
variablespandas.DataFrame

Will have the index of df and columns named by target and containing the result of evaluating expression

trace_dfpandas.DataFrame or None

a dataframe containing the eval result values for each assignment expression

activitysim.core.assign.evaluate_constants(expressions, constants)

Evaluate a list of constant expressions - each one can depend on the one before it. These are usually used for the coefficients which have relationships to each other. So ivt=.7 and then ivt_lr=ivt*.9.

Parameters
expressionsSeries

the index are the names of the expressions which are used in subsequent evals - thus naming the expressions is required.

constantsdict

will be passed as the scope of eval - usually a separate set of constants are passed in here

Returns
ddict
activitysim.core.assign.local_utilities()

Dict of useful modules and functions to provides as locals for use in eval of expressions

Returns
utility_dictdict

name, entity pairs of locals

activitysim.core.assign.read_assignment_spec(file_name, description_name='Description', target_name='Target', expression_name='Expression')

Read a CSV model specification into a Pandas DataFrame or Series.

The CSV is expected to have columns for component descriptions targets, and expressions,

The CSV is required to have a header with column names. For example:

Description,Target,Expression

Parameters
file_namestr

Name of a CSV spec file.

description_namestr, optional

Name of the column in fname that contains the component description.

target_namestr, optional

Name of the column in fname that contains the component target.

expression_namestr, optional

Name of the column in fname that contains the component expression.

Returns
specpandas.DataFrame

dataframe with three columns: [‘description’ ‘target’ ‘expression’]

activitysim.core.assign.uniquify_key(dict, key, template='{} ({})')

rename key so there are no duplicates with keys in dict

e.g. if there is already a key named “dog”, the second key will be reformatted to “dog (2)”

Choice Models

Logit

Multinomial logit (MNL) or Nested logit (NL) choice model. These choice models depend on the foundational components of ActivitySim, such as the expressions and data handling described in the Execution Flow section.

To specify and solve an MNL model:

  • either specify LOGIT_TYPE: MNL in the model configuration YAML file or omit the setting

  • call either simulate.simple_simulate() or simulate.interaction_simulate() depending if the alternatives are interacted with the choosers or because alternatives are sampled

To specify and solve an NL model:

  • specify LOGIT_TYPE: NL in the model configuration YAML file

  • specify the nesting structure via the NESTS setting in the model configuration YAML file. An example nested logit NESTS entry can be found in example/configs/tour_mode_choice.yaml

  • call simulate.simple_simulate(). The simulate.interaction_simulate() functionality is not yet supported for NL.

API

class activitysim.core.logit.Nest(name=None, level=0)

Data for a nest-logit node or leaf

This object is passed on yield when iterate over nest nodes (branch or leaf) The nested logit design is stored in a yaml file as a tree of dict objects, but using an object to pass the nest data makes the code a little more readable

An example nest specification is in the example tour mode choice model yaml configuration file - example/configs/tour_mode_choice.yaml.

activitysim.core.logit.count_nests(nest_spec)

count the nests in nest_spec, return 0 if nest_spec is none

activitysim.core.logit.each_nest(nest_spec, type=None, post_order=False)

Iterate over each nest or leaf node in the tree (of subtree)

Parameters
nest_specdict

Nest tree dict from the model spec yaml file

typestr

Nest class type to yield None yields all nests ‘leaf’ yields only leaf nodes ‘branch’ yields only branch nodes

post_orderBool

Should we iterate over the nodes of the tree in post-order or pre-order? (post-order means we yield the alternatives sub-tree before current node.)

Yields
nestNest

Nest object with info about the current node (nest or leaf)

activitysim.core.logit.interaction_dataset(choosers, alternatives, sample_size=None, alt_index_id=None, chooser_index_id=None)

Combine choosers and alternatives into one table for the purposes of creating interaction variables and/or sampling alternatives.

Any duplicate column names in choosers table will be renamed with an ‘_chooser’ suffix.

Parameters
chooserspandas.DataFrame
alternativespandas.DataFrame
sample_sizeint, optional

If sampling from alternatives for each chooser, this is how many to sample.

Returns
alts_samplepandas.DataFrame

Merged choosers and alternatives with data repeated either len(alternatives) or sample_size times.

activitysim.core.logit.make_choices(probs, trace_label=None, trace_choosers=None, allow_bad_probs=False)

Make choices for each chooser from among a set of alternatives.

Parameters
probspandas.DataFrame

Rows for choosers and columns for the alternatives from which they are choosing. Values are expected to be valid probabilities across each row, e.g. they should sum to 1.

trace_chooserspandas.dataframe

the choosers df (for interaction_simulate) to facilitate the reporting of hh_id by report_bad_choices because it can’t deduce hh_id from the interaction_dataset which is indexed on index values from alternatives df

Returns
choicespandas.Series

Maps chooser IDs (from probs index) to a choice, where the choice is an index into the columns of probs.

randspandas.Series

The random numbers used to make the choices (for debugging, tracing)

activitysim.core.logit.report_bad_choices(bad_row_map, df, trace_label, msg, trace_choosers=None, raise_error=True)
Parameters
bad_row_map
dfpandas.DataFrame

utils or probs dataframe

msgstr

message describing the type of bad choice that necessitates error being thrown

trace_chooserspandas.dataframe

the choosers df (for interaction_simulate) to facilitate the reporting of hh_id because we can’t deduce hh_id from the interaction_dataset which is indexed on index values from alternatives df

Returns
raises RuntimeError
activitysim.core.logit.utils_to_logsums(utils, exponentiated=False, allow_zero_probs=False)

Convert a table of utilities to logsum series.

Parameters
utilspandas.DataFrame

Rows should be choosers and columns should be alternatives.

exponentiatedbool

True if utilities have already been exponentiated

Returns
logsumspandas.Series

Will have the same index as utils.

activitysim.core.logit.utils_to_probs(utils, trace_label=None, exponentiated=False, allow_zero_probs=False, trace_choosers=None)

Convert a table of utilities to probabilities.

Parameters
utilspandas.DataFrame

Rows should be choosers and columns should be alternatives.

trace_labelstr

label for tracing bad utility or probability values

exponentiatedbool

True if utilities have already been exponentiated

allow_zero_probsbool

if True value rows in which all utility alts are EXP_UTIL_MIN will result in rows in probs to have all zero probability (and not sum to 1.0) This is for the benefit of calculating probabilities of nested logit nests

trace_chooserspandas.dataframe

the choosers df (for interaction_simulate) to facilitate the reporting of hh_id by report_bad_choices because it can’t deduce hh_id from the interaction_dataset which is indexed on index values from alternatives df

Returns
probspandas.DataFrame

Will have the same index and columns as utils.

Person Time Windows

The departure time and duration models require person time windows. Time windows are adjacent time periods that are available for travel. Time windows are stored in a timetable table and each row is a person and each time period (in the case of MTC TM1 is 5am to midnight in 1 hr increments) is a column. Each column is coded as follows:

  • 0 - unscheduled, available

  • 2 - scheduled, start of a tour, is available as the last period of another tour

  • 4 - scheduled, end of a tour, is available as the first period of another tour

  • 6 - scheduled, end or start of a tour, available for this period only

  • 7 - scheduled, unavailable, middle of a tour

A good example of a time window expression is @tt.previous_tour_ends(df.person_id, df.start). This uses the person id and the tour start period to check if a previous tour ends in the same time period.

API

class activitysim.core.timetable.TimeTable(windows_df, tdd_alts_df, table_name=None)
tdd_alts_df      tdd_footprints_df
start  end      '0' '1' '2' '3' '4'...
5      5    ==>  0   6   0   0   0 ...
5      6    ==>  0   2   4   0   0 ...
5      7    ==>  0   2   7   4   0 ...
adjacent_window_after(window_row_ids, periods)

Return number of adjacent periods after specified period that are available (not in the middle of another tour.)

Implements MTC TM1 macro @@adjWindowAfterThisPeriodAlt Function name is kind of a misnomer, but parallels that used in MTC TM1 UECs

Parameters
window_row_idspandas Series int

series of window_row_ids indexed by tour_id

periodspandas series int

series of tdd_alt ids, index irrelevant

Returns
pandas Series int

Number of adjacent windows indexed by window_row_ids.index

adjacent_window_before(window_row_ids, periods)

Return number of adjacent periods before specified period that are available (not in the middle of another tour.)

Implements MTC TM1 macro @@getAdjWindowBeforeThisPeriodAlt Function name is kind of a misnomer, but parallels that used in MTC TM1 UECs

Parameters
window_row_idspandas Series int

series of window_row_ids indexed by tour_id

periodspandas series int

series of tdd_alt ids, index irrelevant

Returns
pandas Series int

Number of adjacent windows indexed by window_row_ids.index

adjacent_window_run_length(window_row_ids, periods, before)

Return the number of adjacent periods before or after specified period that are available (not in the middle of another tour.)

Internal DRY method to implement adjacent_window_before and adjacent_window_after

Parameters
window_row_idspandas Series int

series of window_row_ids indexed by tour_id

periodspandas series int

series of tdd_alt ids, index irrelevant

beforebool

Specify desired run length is of adjacent window before (True) or after (False)

assign(window_row_ids, tdds)

Assign tours (represented by tdd alt ids) to persons

Updates self.windows numpy array. Assignments will not ‘take’ outside this object until/unless replace_table called or updated timetable retrieved by get_windows_df

Parameters
window_row_idspandas Series

series of window_row_ids indexed by tour_id

tddspandas series

series of tdd_alt ids, index irrelevant

assign_footprints(window_row_ids, footprints)

assign footprints for specified window_row_ids

This method is used for initialization of joint_tour timetables based on the combined availability of the joint tour participants

Parameters
window_row_idspandas Series

series of window_row_ids index irrelevant, but we want to use map()

footprintsnumpy array

with one row per window_row_id and one column per time period

assign_subtour_mask(window_row_ids, tdds)
index     window_row_ids   tdds
20973389  20973389           26
44612864  44612864            3
48954854  48954854            7

tour footprints
[[0 0 2 7 7 7 7 7 7 4 0 0 0 0 0 0 0 0 0 0 0]
[0 2 7 7 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 2 7 7 7 7 7 7 4 0 0 0 0 0 0 0 0 0 0 0 0]]

subtour_mask
[[7 7 0 0 0 0 0 0 0 0 7 7 7 7 7 7 7 7 7 7 7]
[7 0 0 0 0 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7]
[7 0 0 0 0 0 0 0 0 7 7 7 7 7 7 7 7 7 7 7 7]]
begin_transaction(transaction_loggers)

begin a transaction for an estimator or list of estimators this permits rolling timetable back to the state at the start of the transaction so that timetables can be built for scheduling override choices

max_time_block_available(window_row_ids)

determine the length of the maximum time block available in the persons day

Parameters
window_row_ids: pandas.Series
Returns
pandas.Series with same index as window_row_ids, and integer max_run_length of
previous_tour_begins(window_row_ids, periods)

Does a previously scheduled tour begin in the specified period?

Implements MTC TM1 @@prevTourBeginsThisArrivalPeriodAlt

Parameters
window_row_idspandas Series int

series of window_row_ids indexed by tour_id

periodspandas series int

series of tdd_alt ids, index irrelevant

Returns
pandas Series boolean

indexed by window_row_ids.index

previous_tour_ends(window_row_ids, periods)

Does a previously scheduled tour end in the specified period?

Implements MTC TM1 @@prevTourEndsThisDeparturePeriodAlt

Parameters
window_row_idspandas Series int

series of window_row_ids indexed by tour_id

periodspandas series int

series of tdd_alt ids, index irrelevant (one period per window_row_id)

Returns
pandas Series boolean

indexed by window_row_ids.index

remaining_periods_available(window_row_ids, starts, ends)

Determine number of periods remaining available after the time window from starts to ends is hypothetically scheduled

Implements MTC TM1 @@remainingPeriodsAvailableAlt

The start and end periods will always be available after scheduling, so ignore them. The periods between start and end must be currently unscheduled, so assume they will become unavailable after scheduling this window.

Parameters
window_row_idspandas Series int

series of window_row_ids indexed by tour_id

startspandas series int

series of tdd_alt ids, index irrelevant (one per window_row_id)

endspandas series int

series of tdd_alt ids, index irrelevant (one per window_row_id)

Returns
availablepandas Series int

number periods available indexed by window_row_ids.index

replace_table()

Save or replace windows_df DataFrame to pipeline with saved table name (specified when object instantiated.)

This is a convenience function in case caller instantiates object in one context (e.g. dependency injection) where it knows the pipeline table name, but wants to checkpoint the table in another context where it does not know that name.

slice_windows_by_row_id(window_row_ids)

return windows array slice containing rows for specified window_row_ids (in window_row_ids order)

tour_available(window_row_ids, tdds)

test whether time window allows tour with specific tdd alt’s time window

Parameters
window_row_idspandas Series

series of window_row_ids indexed by tour_id

tddspandas series

series of tdd_alt ids, index irrelevant

Returns
availablepandas Series of bool

with same index as window_row_ids.index (presumably tour_id, but we don’t care)

window_periods_in_states(window_row_ids, periods, states)

Return boolean array indicating whether specified window periods are in list of states.

Internal DRY method to implement previous_tour_ends and previous_tour_begins

Parameters
window_row_idspandas Series int

series of window_row_ids indexed by tour_id

periodspandas series int

series of tdd_alt ids, index irrelevant (one period per window_row_id)

stateslist of int

presumably (e.g. I_EMPTY, I_START…)

Returns
pandas Series boolean

indexed by window_row_ids.index

activitysim.core.timetable.create_timetable_windows(rows, tdd_alts)

create an empty (all available) timetable with one window row per rows.index

Parameters
rows - pd.DataFrame or Series

all we care about is the index

tdd_alts - pd.DataFrame

We expect a start and end column, and create a timetable to accomodate all alts (with on window of padding at each end)

so if start is 5 and end is 23, we return something like this:

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

person_id
30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
109 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Returns
pd.DataFrame indexed by rows.index, and one column of int8 for each time window (plus padding)

Transit Virtual Path Builder

Transit virtual path builder (TVPB) for three zone system (see example_multiple_zones) transit path utility calculations. TAP to TAP skims and walk access and egress times between MAZs and TAPs are input to the demand model. ActivitySim then assembles the total transit path utility based on the user specified TVPB expression files for the respective components:

  • from MAZ to first boarding TAP +

  • from first boarding to final alighting TAP +

  • from alighting TAP to destination MAZ

This assembling is done via the TVPB, which considers all the possible combinations of nearby boarding and alighting TAPs for each origin destination MAZ pair and selects the user defined N best paths to represent the transit mode. After selecting N best paths, the logsum across N best paths is calculated and exposed to the mode choice models and a random number is drawn and a path is chosen. The boarding TAP, alighting TAP, and TAP to TAP skim set for the chosen path is saved to the chooser table.

The initialize TVPB submodel (see Initialize LOS) pre-computes TAP to TAP total utilities for the user defined attribute_segments, which are typically demographic segment (for example household income bin), time-of-day, and access/egress mode. This submodel can be run in both single process and multiprocess mode, with single process excellent for development/debugging and multiprocess excellent for application. ActivitySim saves the pre-calculated TAP to TAP total utilities to a memory mapped cache file for reuse by downstream models such as tour mode choice. In tour mode choice, the pre-computed TAP to TAP total utilities for the attribute_segment, along with the access and egress impedances, are used to evaluate the best N TAP pairs for each origin MAZ destination MAZ pair being evaluated. Assembling the total transit path impedance and then picking the best N is quick since it is done in a de-duplicated manner within each chunk of multiprocessed choosers.

A model with TVPB can take considerably longer to run than a traditional TAZ based model since it does an order of magnitude more calculations. Thus, it is important to be mindful of your approach to your network model as well, especially the number of TAPs accessible to each MAZ, which is the key determinant of runtime.

API

class activitysim.core.pathbuilder.TransitVirtualPathBuilder(network_los)

Transit virtual path builder for three zone systems

compute_tap_tap_utilities(recipe, access_df, egress_df, chooser_attributes, path_info, trace_label, trace)

create transit_df and compute utilities for all atap-btap pairs between omaz in access and dmaz in egress_df compute the utilities using the tap_tap utility expressions file specified in tap_tap_settings

transit_df contains all possible access omaz/btap to egress dmaz/atap transit path pairs for each chooser

trace should be True as we don’t encourage/support dynamic utility computation except when tracing (precompute being fairly fast)

Parameters
recipe: str

‘recipe’ key in network_los.yaml TVPB_SETTINGS e.g. tour_mode_choice

access_df: pandas.DataFrame

dataframe with ‘idx’ and ‘omaz’ columns

egress_df: pandas.DataFrame

dataframe with ‘idx’ and ‘dmaz’ columns

chooser_attributes: dict
path_info
trace_label: str
trace: boolean
Returns
transit_df: pandas.dataframe
lookup_tap_tap_utilities(recipe, maz_od_df, access_df, egress_df, chooser_attributes, path_info, trace_label)

create transit_df and compute utilities for all atap-btap pairs between omaz in access and dmaz in egress_df look up the utilities in the precomputed tap_cache data (which is indexed by uid_calculator unique_ids) (unique_id can used as a zero-based index into the data array)

transit_df contains all possible access omaz/btap to egress dmaz/atap transit path pairs for each chooser

Parameters
recipe
maz_od_df
access_df
egress_df
chooser_attributes
path_info
trace_label
class activitysim.core.pathbuilder.TransitVirtualPathLogsumWrapper(pathbuilder, orig_key, dest_key, tod_key, segment_key, recipe, cache_choices, trace_label, tag)

Transit virtual path builder logsum wrapper for three zone systems

set_df(df)

Set the dataframe

Parameters
dfDataFrame

The dataframe which contains the origin and destination ids

Returns
self (to facilitiate chaining)
activitysim.core.pathbuilder.compute_utilities(network_los, model_settings, choosers, model_constants, trace_label, trace=False, trace_column_names=None)

Compute utilities

Cache API

class activitysim.core.pathbuilder_cache.TVPBCache(network_los, uid_calculator, cache_tag)

Transit virtual path builder cache for three zone systems

allocate_data_buffer(shared=False)

allocate fully_populated_shape data buffer for cached data

if shared, return a multiprocessing.Array that can be shared across subprocesses if not shared, return a numpy ndarrray

Parameters
shared: boolean
Returns
multiprocessing.Array or numpy ndarray sized to hole fully_populated utility array
cleanup()

Called prior to

close(trace=False)

write any changes, free data, and mark as closed

get_data_and_lock_from_buffers()

return shared data buffer previously allocated by allocate_data_buffer and injected mp_tasks.run_simulation Returns ——- either multiprocessing.Array and lock or multiprocessing.RawArray and None according to RAWARRAY

open()

open STATIC cache and populate with cached data

if multiprocessing

always STATIC cache with data fully_populated preloaded shared data buffer

class activitysim.core.pathbuilder_cache.TapTapUidCalculator(network_los)

Transit virtual path builder TAP to TAP unique ID calculator for three zone systems

get_od_dataframe(scalar_attributes)

return tap-tap od dataframe with unique_id index for ‘skim_offset’ for scalar_attributes

i.e. a dataframe which may be used to compute utilities, together with scalar or column attributes

Parameters
scalar_attributes: dict of scalar attribute name:value pairs
Returns
pandas.Dataframe
get_unique_ids(df, scalar_attributes)

compute canonical unique_id for each row in df btap and atap will be in dataframe, but the other attributes may be either df columns or scalar_attributes

Parameters
df: pandas DataFrame

with btap, atap, and optionally additional attribute columns

scalar_attributes: dict

dict of scalar attributes e.g. {‘tod’: ‘AM’, ‘demographic_segment’: 0}

Returns
——-
ndarray of integer uids

Helpers

Chunk

Chunking management.

Note

The definition of chunk_size has changed from previous versions of ActivitySim. The revised definition of chunk_size simplifies model setup since it is the approximate amount of RAM available to ActivitySim as opposed to the obscure number of doubles (64-bit numbers) in a chunk of a choosers table.

The chunk_size is the approximate amount of RAM in GBs to allocate to ActivitySim for batch processing choosers across all processes. It is specified in bytes, for example chunk_size: 500_000_000_000 is 500 GBs. If set chunk_training_mode: disabled then no chunking will be performed and ActivitySim will attempt to solve all the choosers at once across all the processes. Chunking is required when all the chooser data required to process all the choosers cannot fit within the available RAM and so ActivitySim must split the choosers into batches and then process the batches in sequence.

Configuration of the chunk_size depends on several parameters:

  • The amount of machine RAM

  • The number of machine processors (CPUs/cores)

  • The number of households (and number of zones for aggregate models)

  • The amount of headroom required for shared data across processes, such as the skims/network LOS data

  • The desired runtimes

An example helps illustrate configuration of the chunk_size. If the example model has 1 million households and the current submodel is auto ownership, then there are 1 million choosers since every household participates in the auto ownership model. In single process mode, ActivitySim would create one chooser table with 1 million rows, assuming this table and the additional extra data such as the skims can fit within the available memory (RAM). If the 1 million row table cannot fit within memory then chunking needs to be setup to split the choosers table into batches that are processed in sequence and small enough to fit within the available memory. For example, the choosers table is split into 2 chunks of 500,000 choosers each and then processed in sequence. In multi process mode, for example with 10 processes, ActivitySim splits the 1 million households into 10 mini processes each with 100,000 households. Then for the auto ownership submodel, the chooser table within each process is the 100,000 choosers and there must be enough RAM to simultaneously solve all 10 processes with each 100,000 choosers at once. If not, then chunking can be setup so each mini process table of choosers is split into chunks for sequential processing, for example from 10 tables of 100,000 choosers to 20 tables of 50,000 choosers.

If the user desires the fastest runtimes possible given their hardware, model inputs, and model configuration, then ActivitySim should be configured to use most of the CPUs/cores (physical, not virtual), most of the RAM, and with the MKL Settings. For example, if the machine has 12 cores and 256 GB of RAM, then try configuring the model with num_processes: 10 and chunk_size: 0 to start and seeing if the model can fit the problem into the available RAM. If not, then try setting chunk_size to something like 225 GB, chunk_size: 225_000_000_000. Experimentation of the desired configuration of the CPUs and RAM should be done for each new machine and model setup (with respect to the number of households, skims, and model configuration). In general, more processors means faster runtimes and more RAM means faster runtimes, but the relationship of processors to RAM is not linear as processors can only go so fast and because there is more to runtime than processors and RAM, including cache speed, disk speed, etc. Also, the amount of RAM to use is approximate and ActivitySim often pushes a bit above the user specified amount due to pandas/numpy memory spikes for memory intensive operations and so it is recommended to leave some RAM unallocated. The exact amount to leave unallocated depends on the parameters above.

To configure chunking behavior, ActivitySim must first be trained with the model setup and machine. To do so, first run the model with chunk_training_mode: training. This tracks the amount of memory used by each table by submodel and writes the results to a cache file that is then re-used for production runs. This training mode is significantly slower than production mode since it does significantly more memory inspection. For a training mode run, set num_processors to about 80% of the avaiable logical processors and chunk_size to about 80% of the available RAM. This will run the model and create the chunk_cache.csv file in outputcache for reuse. After creating the chunk cache file, the model can be run with chunk_training_mode: production and the desired num_processors and chunk_size. The model will read the chunk cache file from the outputcache folder, similar to how it reads cached skims if specified. The software trains on the size of problem so the cache file can be re-used and only needs to be updated due to significant revisions in population, expression, skims/network LOS, or changes in machine specs. If run in production mode and no cache file is found then ActivitySim falls back to training mode. A third chunk_training_mode is adaptive, which if a cache file exists, runs the model with the starting cache settings but also updates the cache settings based on additional memory inspection. This may additionally improve the cache setttings to reduce runtimes when run in production mode. If resume_after is set, then the chunk cache file is not overwritten in cache directory since the list of submodels would be incomplete. A foruth chunk_training_mode is disabled, which assumes the model can be run without chunking due to an abundance of RAM.

The following chunk_methods are supported to calculate memory overhead when chunking is enabled:

  • bytes - expected rowsize based on actual size (as reported by numpy and pandas) of explicitly allocated data this can underestimate overhead due to transient data requirements of operations (e.g. merge, sort, transpose)

  • uss - expected rowsize based on change in (unique set size) (uss) both as a result of explicit data allocation, and readings by MemMonitor sniffer thread that measures transient uss during time-consuming numpy and pandas operations

  • hybrid_uss - hybrid_uss avoids problems with pure uss, especially with small chunk sizes (e.g. initial training chunks) as numpy may recycle cached blocks and show no increase in uss even though data was allocated and logged

  • rss - like uss, but for resident set size (rss), which is the portion of memory occupied by a process that is held in RAM

  • hybrid_rss - like hybrid_uss, but for rss

RSS is reported by psutil.memory_info and USS is reported by psutil.memory_full_info. USS is the memory which is private to a process and which would be freed if the process were terminated. This is the metric that most closely matches the rather vague notion of memory “in use” (the meaning of which is difficult to pin down in operating systems with virtual memory where memory can (but sometimes can’t) be swapped or mapped to disk. hybrid_uss performs best and is most reliable and is therefore the default.

Additional chunking settings:

  • min_available_chunk_ratio: 0.05 - minimum fraction of total chunk_size to reserve for adaptive chunking

  • default_initial_rows_per_chunk: 500 - initial number of chooser rows for first chunk in training mode, when there is no pre-existing chunk_cache to set initial value, ordinarily bigger is better as long as it is not so big it causes memory issues (e.g. accessibility with lots of zones)

  • keep_chunk_logs: True - whether to preserve or delete subprocess chunk logs when they are consolidated at end of multiprocess run

  • keep_mem_logs: True - whether to preserve or delete subprocess mem logs when they are consolidated at end of multiprocess run

API

class activitysim.core.chunk.ChunkHistorian

Utility for estimating row_size

class activitysim.core.chunk.ChunkLedger(trace_label, chunk_size, baseline_rss, baseline_uss, headroom)
class activitysim.core.chunk.ChunkSizer(chunk_tag, trace_label, num_choosers=0, chunk_size=0)
activitysim.core.chunk.DEFAULT_CHUNK_METHOD = 'hybrid_uss'

The chunk_cache table is a record of the memory usage and observed row_size required for chunking the various models. The row size differs depending on whether memory usage is calculated by rss, uss, or explicitly allocated bytes. We record all three during training so the mode can be changed without necessitating retraining.

tag, num_rows, rss, uss, bytes, uss_row_size, hybrid_uss_row_size, bytes_row_size atwork_subtour_frequency.simple, 3498, 86016, 81920, 811536, 24, 232, 232 atwork_subtour_mode_choice.simple, 704, 20480, 20480, 1796608, 30, 2552, 2552 atwork_subtour_scheduling.tour_1, 701, 24576, 24576, 45294082, 36, 64614, 64614 atwork_subtour_scheduling.tour_n, 3, 20480, 20480, 97734, 6827, 32578, 32578 auto_ownership_simulate.simulate, 5000, 77824, 24576, 1400000, 5, 280, 280

MODE_RETRAIN

rebuild chunk_cache table and save/replace in output/cache/chunk_cache.csv preforms a complete rebuild of chunk_cache table by doing adaptive chunking starting with based on default initial settings (DEFAULT_INITIAL_ROWS_PER_CHUNK) and observing rss, uss, and allocated bytes to compute rows_size. This will run somewhat slower than the other modes because of overhead of small first chunk, and possible instability in the second chunk due to inaccuracies caused by small initial chunk_size sample

MODE_ADAPTIVE

Use the existing chunk_cache to determine the sizing for the first chunk for each model, but also use the observed row_size to adjust the estimated row_size for subsequent chunks. At the end of hte run, writes the updated chunk_cache to the output directory, but doesn’t overwrite the ‘official’ cache file. If the user wishes they can replace the chunk_cache with the updated versions but this is not done automatically as it is not clear this would be the desired behavior. (Might become clearer over time as this is exercised further.)

MODE_PRODUCTION

Since overhead changes we don’t necessarily want the same number of rows per chunk every time but we do use the row_size from cache which we trust is stable (the whole point of MODE_PRODUCTION is to avoid the cost of observing overhead) which is stored in self.initial_row_size because initial_rows_per_chunk used it for the first chunk

MODE_CHUNKLESS

Do not do chunking, and also do not check or log memory usage, so ActivitySim can focus on performance assuming there is abundant RAM.

class activitysim.core.chunk.MemMonitor(trace_label, stop_snooping)
run()

Method representing the thread’s activity.

You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.

activitysim.core.chunk.adaptive_chunked_choosers_and_alts(choosers, alternatives, chunk_size, trace_label, chunk_tag=None)

generator to iterate over choosers and alternatives in chunk_size chunks

like chunked_choosers, but also chunks alternatives for use with sampled alternatives which will have different alternatives (and numbers of alts)

There may be up to sample_size (or as few as one) alternatives for each chooser because alternatives may have been sampled more than once, but pick_count for those alternatives will always sum to sample_size.

When we chunk the choosers, we need to take care chunking the alternatives as there are varying numbers of them for each chooser. Since alternatives appear in the same order as choosers, we can use cumulative pick_counts to identify boundaries of sets of alternatives

Parameters
choosers
alternativespandas DataFrame

sample alternatives including pick_count column in same order as choosers

rows_per_chunkint
Yields
iint

one-based index of current chunk

num_chunksint

total number of chunks that will be yielded

chooserspandas DataFrame slice

chunk of choosers

alternativespandas DataFrame slice

chunk of alternatives for chooser chunk

activitysim.core.chunk.overhead_for_chunk_method(overhead, method=None)

return appropriate overhead for row_size calculation based on current chunk_method

  • by ChunkSizer.adaptive_rows_per_chunk to determine observed_row_size based on cum_overhead and cum_rows

  • by ChunkSizer.initial_rows_per_chunk to determine initial row_size using cached_history and current chunk_method

  • by consolidate_logs to add informational row_size column to cache file based on chunk_method for training run

Parameters
overhead: dict keyed by metric or DataFrame with columns
Returns
chunk_method overhead (possibly hybrid, depending on chunk_method)

Utilities

Vectorized helper functions

API

activitysim.core.util.assign_in_place(df, df2)

update existing row values in df from df2, adding columns to df if they are not there

Parameters
dfpd.DataFrame

assignment left-hand-side (dest)

df2: pd.DataFrame

assignment right-hand-side (source)

Returns
——-
activitysim.core.util.iprod(ints)

Return the product of hte ints in the list or tuple as an unlimited precision python int

Specifically intended to compute arrray/buffer size for skims where np.proc might overflow for default dtypes. (Narrowing rules for np.prod are different on Windows and linux) an alternative to the unwieldy: int(np.prod(ints, dtype=np.int64))

Parameters
ints: list or tuple of ints or int wannabees
Returns
returns python int
activitysim.core.util.left_merge_on_index_and_col(left_df, right_df, join_col, target_col)

like pandas left merge, but join on both index and a specified join_col

FIXME - for now return a series of ov values from specified right_df target_col

Parameters
left_dfpandas DataFrame

index name assumed to be same as that of right_df

right_dfpandas DataFrame

index name assumed to be same as that of left_df

join_colstr

name of column to join on (in addition to index values) should have same name in both dataframes

target_colstr

name of column from right_df whose joined values should be returned as series

Returns
target_seriespandas Series

series of target_col values with same index as left_df i.e. values joined to left_df from right_df with index of left_df

activitysim.core.util.other_than(groups, bools)

Construct a Series that has booleans indicating the presence of something- or someone-else with a certain property within a group.

Parameters
groupspandas.Series

A column with the same index as bools that defines the grouping of bools. The bools Series will be used to index groups and then the grouped values will be counted.

boolspandas.Series

A boolean Series indicating where the property of interest is present. Should have the same index as groups.

Returns
otherspandas.Series

A boolean Series with the same index as groups and bools indicating whether there is something- or something-else within a group with some property (as indicated by bools).

activitysim.core.util.quick_loc_df(loc_list, target_df, attribute=None)

faster replacement for target_df.loc[loc_list] or target_df.loc[loc_list][attribute]

pandas DataFrame.loc[] indexing doesn’t scale for large arrays (e.g. > 1,000,000 elements)

Parameters
loc_listlist-like (numpy.ndarray, pandas.Int64Index, or pandas.Series)
target_dfpandas.DataFrame containing column named attribute
attributename of column from loc_list to return (or none for all columns)
Returns
pandas.DataFrame or, if attribbute specified, pandas.Series
activitysim.core.util.quick_loc_series(loc_list, target_series)

faster replacement for target_series.loc[loc_list]

pandas Series.loc[] indexing doesn’t scale for large arrays (e.g. > 1,000,000 elements)

Parameters
loc_listlist-like (numpy.ndarray, pandas.Int64Index, or pandas.Series)
target_seriespandas.Series
Returns
pandas.Series
activitysim.core.util.reindex(series1, series2)

This reindexes the first series by the second series. This is an extremely common operation that does not appear to be in Pandas at this time. If anyone knows of an easier way to do this in Pandas, please inform the UrbanSim developers.

The canonical example would be a parcel series which has an index which is parcel_ids and a value which you want to fetch, let’s say it’s land_area. Another dataset, let’s say of buildings has a series which indicate the parcel_ids that the buildings are located on, but which does not have land_area. If you pass parcels.land_area as the first series and buildings.parcel_id as the second series, this function returns a series which is indexed by buildings and has land_area as values and can be added to the buildings dataset.

In short, this is a join on to a different table using a foreign key stored in the current table, but with only one attribute rather than for a full dataset.

This is very similar to the pandas “loc” function or “reindex” function, but neither of those functions return the series indexed on the current table. In both of those cases, the series would be indexed on the foreign table and would require a second step to change the index.

Parameters
series1, series2pandas.Series
Returns
reindexedpandas.Series
activitysim.core.util.reindex_i(series1, series2, dtype=<class 'numpy.int8'>)

version of reindex that replaces missing na values and converts to int helpful in expression files that compute counts (e.g. num_work_tours)

Config

Helper functions for configuring a model run

API

exception activitysim.core.config.SettingsFileNotFound(file_name, configs_dir)
activitysim.core.config.base_settings_file_path(file_name)
Parameters
file_name
Returns
path to base settings file or None if not found
activitysim.core.config.expand_input_file_list(input_files)

expand list by unglobbing globs globs

activitysim.core.config.filter_warnings()

set warning filter to ‘strict’ if specified in settings

activitysim.core.config.future_model_settings(model_name, model_settings, future_settings)

Warn users of new required model settings, and substitute default values

Parameters
model_name: str

name of model

model_settings: dict

model_settings from settigns file

future_settings: dict

default values for new required settings

Returns
dict

model_settings with any missing future_settings added

activitysim.core.config.get_cache_dir()

return path of cache directory in output_dir (creating it, if need be)

cache directory is used to store

skim memmaps created by skim+dict_factories tvpb tap_tap table cache

Returns
str path
activitysim.core.config.get_global_constants()

Read global constants from settings file

Returns
constantsdict

dictionary of constants to add to locals for use by expressions in model spec

activitysim.core.config.get_logit_model_settings(model_settings)

Read nest spec (for nested logit) from model settings file

Returns
nestsdict

dictionary specifying nesting structure and nesting coefficients

constantsdict

dictionary of constants to add to locals for use by expressions in model spec

activitysim.core.config.get_model_constants(model_settings)

Read constants from model settings file

Returns
constantsdict

dictionary of constants to add to locals for use by expressions in model spec

activitysim.core.config.logger = <Logger activitysim.core.config (WARNING)>

default injectables

activitysim.core.config.read_model_settings(file_name, mandatory=False)
Parameters
file_namestr

yaml file name

mandatorybool

throw error if file empty or not found

Returns
——-
activitysim.core.config.read_settings_file(file_name, mandatory=True, include_stack=[], configs_dir_list=None)

look for first occurence of yaml file named <file_name> in directories in configs_dir list, read settings from yaml file and return as dict.

Settings file may contain directives that affect which file settings are returned:

inherit_settings: boolean

backfill settings in the current file with values from the next settings file in configs_dir list

include_settings: string <include_file_name>

read settings from specified include_file in place of the current file settings (to avoid confusion, this directive must appea ALONE in fiel, without any additional settings or directives.)

Parameters
file_name
mandatory: booelan

if true, raise SettingsFileNotFound exception if no settings file, otherwise return empty dict

include_stack: boolean

only used for recursive calls to provide list of files included so far to detect cycles

Returns: dict

settings from speciified settings file/s

——-

Inject

Model orchestration and data pipeline interaction.

API

activitysim.core.inject.add_table(table_name, table, replace=False)

Add new table and raise assertion error if the table already exists. Silently replace if replace=True.

activitysim.core.inject.reinject_decorated_tables()

reinject the decorated tables (and columns)

Mem

Helper functions for tracking memory usage

API

activitysim.core.mem.consolidate_logs()

Consolidate and aggregate subprocess mem logs

activitysim.core.mem.shared_memory_size(data_buffers=None)

return total size of the multiprocessing shared memory block in data_buffers

Output

Write output files and track skim usage.

API

activitysim.core.steps.output.previous_write_data_dictionary(output_dir)

Write table_name, number of rows, columns, and bytes for each checkpointed table

Parameters
output_dir: str
activitysim.core.steps.output.track_skim_usage(output_dir)

write statistics on skim usage (diagnostic to detect loading of un-needed skims)

FIXME - have not yet implemented a facility to avoid loading of unused skims

FIXME - if resume_after, this will only reflect skims used after resume

Parameters
output_dir: str
activitysim.core.steps.output.write_data_dictionary(output_dir)

Write table schema for all tables

model settings

txt_format: output text file name (default data_dict.txt) or empty to suppress txt output csv_format: output csv file name (default data_dict.tcsvxt) or empty to suppress txt output

schema_tables: list of tables to include in output (defaults to all checkpointed tables)

for each table, write column names, dtype, and checkpoint added)

text format writes individual table schemas to a single text file csv format writes all tables together with an additional table_name column

Parameters
output_dir: str
activitysim.core.steps.output.write_tables(output_dir)

Write pipeline tables as csv files (in output directory) as specified by output_tables list in settings file.

‘output_tables’ can specify either a list of output tables to include or to skip if no output_tables list is specified, then all checkpointed tables will be written

To write all output tables EXCEPT the households and persons tables:

output_tables:
  action: skip
  tables:
    - households
    - persons

To write ONLY the households table:

output_tables:
  action: include
  tables:
     - households

To write tables into a single HDF5 store instead of individual CSVs, use the h5_store flag:

output_tables:
  h5_store: True
  action: include
  tables:
     - households
Parameters
output_dir: str

Tests

See activitysim.core.test