Core Components¶
ActivitySim’s core components include features for multiprocessing, data management, utility expressions, choice models, person time window management, and helper functions. These core components include the multiprocessor, network LOS (skim) manager, the data pipeline manager, the random number manager, the tracer, sampling methods, simulation methods, model specification readers and expression evaluators, choice models, timetable, transit virtual path builder, and helper functions.
Multiprocessing¶
Parallelization using multiprocessing
API¶
- activitysim.core.mp_tasks.MEM_TRACE_TICKS = 5¶
mp_tasks - activitysim multiprocessing overview
Activitysim runs a list of models sequentially, performing various computational operations on tables. Model steps can modify values in existing tables, add columns, or create additional tables. Activitysim provides the facility, via expression files, to specify vectorized operations on data tables. The ability to vectorize operations depends upon the independence of the computations performed on the vectorized elements.
Python is agonizingly slow performing scalar operations sequentially on large datasets, so vectorization (using pandas and/or numpy) is essential for good performance.
Fortunately most activity based model simulation steps are row independent at the household, person, tour, or trip level. The decisions for one household are independent of the choices made by other households. Thus it is (generally speaking) possible to run an entire simulation on a household sample with only one household, and get the same result for that household as you would running the simulation on a thousand households. (See the shared data section below for an exception to this highly convenient situation.)
The random number generator supports this goal by providing streams of random numbers for each households and person that are mutually independent and repeatable across model runs and processes.
To the extent that simulation model steps are row independent, we can implement most simulations as a series of vectorized operations on pandas DataFrames and numpy arrays. These vectorized operations are much faster than sequential python because they are implemented by native code (compiled C) and are to some extent multi-threaded. But the benefits of numpy multi-processing are limited because they only apply to atomic numpy or pandas calls, and as soon as control returns to python it is single-threaded and slow.
Multi-threading is not an attractive strategy to get around the python performance problem because of the limitations imposed by python’s global interpreter lock (GIL). Rather than struggling with python multi-threading, this module uses the python multiprocessing to parallelize certain models.
Because of activitysim’s modular and extensible architecture, we don’t hardwire the multiprocessing architecture. The specification of which models should be run in parallel, how many processers should be used, and the segmentation of the data between processes are all specified in the settings config file. For conceptual simplicity, the single processing model as treated as dominant (because even though in practice multiprocessing may be the norm for production runs, the single-processing model will be used in development and debugging and keeping it dominant will tend to concentrate the multiprocessing-specific code in one place and prevent multiprocessing considerations from permeating the code base obscuring the model-specific logic.
The primary function of the multiprocessing settings are to identify distinct stages of computation, and to specify how many simultaneous processes should be used to perform them, and how the data to be treated should be apportioned between those processes. We assume that the data can be apportioned between subprocesses according to the index of a single primary table (e.g. households) or else are by derivative or dependent tables that reference that table’s index (primary key) with a ref_col (foreign key) sharing the name of the primary table’s key.
Generally speaking, we assume that any new tables that are created are directly dependent on the previously existing tables, and all rows in new tables are either attributable to previously existing rows in the pipeline tables, or are global utility tables that are identical across sub-processes.
Note: There are a few exceptions to ‘row independence’, such as school and location choice models, where the model behavior is externally constrained or adjusted. For instance, we want school location choice to match known aggregate school enrollments by zone. Similarly, a parking model (not yet implemented) might be constrained by availability. These situations require special handling.
models: ### mp_initialize step - initialize_landuse - compute_accessibility - initialize_households ### mp_households step - school_location - workplace_location - auto_ownership_simulate - free_parking ### mp_summarize step - write_tables multiprocess_steps: - name: mp_initialize begin: initialize_landuse - name: mp_households begin: school_location num_processes: 2 slice: tables: - households - persons - name: mp_summarize begin: write_tables
The multiprocess_steps setting above annotates the models list to indicate that the simulation should be broken into three steps.
The first multiprocess_step (mp_initialize) begins with the initialize_landuse step and is implicity single-process because there is no ‘slice’ key indicating how to apportion the tables. This first step includes all models listed in the ‘models’ setting up until the first step in the next multiprocess_steps.
The second multiprocess_step (mp_households) starts with the school location model and continues through auto_ownership_simulate. The ‘slice’ info indicates that the tables should be sliced by households, and that persons is a dependent table and so and persons with a ref_col (foreign key column with the same name as the Households table index) referencing a household record should be taken to ‘belong’ to that household. Similarly, any other table that either share an index (i.e. having the same name) with either the households or persons table, or have a ref_col to either of their indexes, should also be considered a dependent table.
The num_processes setting of 2 indicates that the pipeline should be split in two, and half of the households should be apportioned into each subprocess pipeline, and all dependent tables should likewise be apportioned accordingly. All other tables (e.g. land_use) that do share an index (name) or have a ref_col should be considered mirrored and be included in their entirety.
The primary table is sliced by num_processes-sized strides. (e.g. for num_processes == 2, the sub-processes get every second record starting at offsets 0 and 1 respectively. All other dependent tables slices are based (directly or indirectly) on this primary stride segmentation of the primary table index.
Two separate sub-process are launched (num_processes == 2) and each passed the name of their apportioned pipeline file. They execute independently and if they terminate successfully, their contents are then coalesced into a single pipeline file whose tables should then be essentially the same as it had been generated by a single process.
We assume that any new tables that are created by the sub-processes are directly dependent on the previously primary tables or are mirrored. Thus we can coalesce the sub-process pipelines by concatenating the primary and dependent tables and simply retaining any copy of the mirrored tables (since they should all be identical.)
The third multiprocess_step (mp_summarize) then is handled in single-process mode and runs the write_tables model, writing the results, but also leaving the tables in the pipeline, with essentially the same tables and results as if the whole simulation had been run as a single process.
This is called by the main process to allocate memory buffer to share with subprocs
- Returns
- multiprocessing.RawArray
This is called by the main process to allocate shared memory buffer to share with subprocs
Note: Buffers must be allocated BEFORE network_los.load_data
- Returns
- skim_buffersdict {<skim_tag>: <multiprocessing.RawArray>}
- activitysim.core.mp_tasks.apportion_pipeline(sub_proc_names, step_info)¶
apportion pipeline for multiprocessing step
create pipeline files for sub_procs, apportioning data based on slice_rules
Called at the beginning of a multiprocess step prior to launching the sub-processes Pipeline files have well known names (pipeline file name prefixed by subjob name)
- Parameters
- sub_proc_nameslist of str
names of the sub processes to apportion
- step_infodict
step_info from multiprocess_steps for step we are apportioning pipeline tables for
- Returns
- creates apportioned pipeline files for each sub job
- activitysim.core.mp_tasks.build_slice_rules(slice_info, pipeline_tables)¶
based on slice_info for current step from run_list, generate a recipe for slicing the tables in the pipeline (passed in tables parameter)
- slice_info is a dict with two well-known keys:
‘tables’: required list of table names (order matters!) ‘except’: optional list of tables not to slice even if they have a sliceable index name
Note: tables listed in slice_info must appear in same order and before any others in tables dict
The index of the first table in the ‘tables’ list is the primary_slicer.
Any other tables listed are dependent tables with either ref_cols to the primary_slicer or with the same index (i.e. having an index with the same name). This cascades, so any tables dependent on the primary_table can in turn have dependent tables that will be sliced by index or ref_col.
For instance, if the primary_slicer is households, then persons can be sliced because it has a ref_col to (column with the same same name as) the household table index. And the tours table can be sliced since it has a ref_col to persons. Tables can also be sliced by index. For instance the person_windows table can be sliced because it has an index with the same names as the persons table.
slice_info from multiprocess_steps
slice: tables: - households - persons
tables from pipeline
Table Name
Index
ref_col
households
household_id
persons
person_id
household_id
person_windows
person_id
accessibility
zone_id
generated slice_rules dict
households: slice_by: primary <- primary table is sliced in num_processors-sized strides persons: source: households slice_by: column column: household_id <- slice by ref_col (foreign key) to households person_windows: source: persons slice_by: index <- slice by index of persons table accessibility: slice_by: <- mirrored (non-dependent) tables don't get sliced land_use: slice_by:
- Parameters
- slice_infodict
‘slice’ info from run_list for this step
- pipeline_tablesdict {<table_name>, <pandas.DataFrame>}
dict of all tables from the pipeline keyed by table name
- Returns
- slice_rulesdict
- activitysim.core.mp_tasks.coalesce_pipelines(sub_proc_names, slice_info)¶
Coalesce the data in the sub_processes apportioned pipelines back into a single pipeline
We use slice_rules to distinguish sliced (apportioned) tables from mirrored tables.
Sliced tables are concatenated to create a single omnibus table with data from all sub_procs but mirrored tables are the same across all sub_procs, so we can grab a copy from any pipeline.
- Parameters
- sub_proc_nameslist of str
- slice_infodict
slice_info from multiprocess_steps
- Returns
- creates an omnibus pipeline with coalesced data from individual sub_proc pipelines
- activitysim.core.mp_tasks.drop_breadcrumb(step_name, crumb, value=True)¶
Add (crumb: value) to specified step in breadcrumbs and flush breadcrumbs to file run can be resumed with resume_after
Breadcrumbs provides a record of steps that have been run for use when resuming Basically, we want to know which steps have been run, which phases completed (i.e. apportion, simulate, coalesce). For multi-processed simulate steps, we also want to know which sub-processes completed successfully, because if resume_after is LAST_CHECKPOINT we don’t have to rerun the successful ones.
- Parameters
- step_namestr
- crumbstr
- valueyaml-writable value
- activitysim.core.mp_tasks.get_breadcrumbs(run_list)¶
Read, validate, and annotate breadcrumb file from previous run
if resume_after specifies a model name, we need to determine which step it falls within, drop any subsequent steps, and set the ‘simulate’ and ‘coalesce’ to None so
Extract from breadcrumbs file showing completed mp_households step with 2 processes:
- apportion: true completed: [mp_households_0, mp_households_1] name: mp_households simulate: true coalesce: true
- Parameters
- run_listdict
validated and annotated run_list from settings
- Returns
- breadcrumbsdict
validated and annotated breadcrumbs file from previous run
- activitysim.core.mp_tasks.get_run_list()¶
validate and annotate run_list from settings
Assign defaults to missing settings (e.g. chunk_size) Build individual step model lists based on step starts If resuming, read breadcrumbs file for info on previous run execution status
# annotated run_list with two steps, the second with 2 processors
resume_after: None multiprocess: True models: - initialize_landuse - compute_accessibility - initialize_households - school_location - workplace_location multiprocess_steps: step: mp_initialize begin: initialize_landuse name: mp_initialize models: - initialize_landuse - compute_accessibility - initialize_households num_processes: 1 chunk_size: 0 step_num: 0 step: mp_households begin: school_location slice: {'tables': ['households', 'persons']} name: mp_households models: - school_location - workplace_location num_processes: 2 chunk_size: 10000 step_num: 1
- Returns
- run_listdict
validated and annotated run_list
- activitysim.core.mp_tasks.if_sub_task(if_is, if_isnt)¶
select one of two values depending whether current process is primary process or subtask
This is primarily intended for use in yaml files to select between (e.g.) logging levels so main log file can display only warnings and errors from subtasks
In yaml file, it can be used like this:
level: !!python/object/apply:activitysim.core.mp_tasks.if_sub_task [WARNING, NOTSET]
- Parameters
- if_is(any type) value to return if process is a subtask
- if_isnt(any type) value to return if process is not a subtask
- Returns
- (any type) (one of parameters if_is or if_isnt)
- activitysim.core.mp_tasks.mp_apportion_pipeline(injectables, sub_proc_names, step_info)¶
mp entry point for apportion_pipeline
- Parameters
- injectablesdict
injectables from parent
- sub_proc_nameslist of str
names of the sub processes to apportion
- step_infodict
step_info for multiprocess_step we are apportioning
- activitysim.core.mp_tasks.mp_coalesce_pipelines(injectables, sub_proc_names, slice_info)¶
mp entry point for coalesce_pipeline
- Parameters
- injectablesdict
injectables from parent
- sub_proc_nameslist of str
names of the sub processes to apportion
- slice_infodict
slice_info from multiprocess_steps
- activitysim.core.mp_tasks.mp_run_simulation(locutor, queue, injectables, step_info, resume_after, **kwargs)¶
mp entry point for run_simulation
- Parameters
- locutor
- queue
- injectables
- step_info
- resume_afterbool
- kwargsdict
shared_data_buffers passed as kwargs to avoid picking dict
- activitysim.core.mp_tasks.mp_setup_skims(injectables, **kwargs)¶
Sub process to load skim data into shared_data
There is no particular necessity to perform this in a sub process instead of the parent except to ensure that this heavyweight task has no side-effects (e.g. loading injectables)
- Parameters
- injectablesdict
injectables from parent
- kwargsdict
shared_data_buffers passed as kwargs to avoid picking dict
- activitysim.core.mp_tasks.pipeline_table_keys(pipeline_store)¶
return dict of current (as of last checkpoint) pipeline tables and their checkpoint-specific hdf5_keys
This facilitates reading pipeline tables directly from a ‘raw’ open pandas.HDFStore without opening it as a pipeline (e.g. when apportioning and coalescing pipelines)
We currently only ever need to do this from the last checkpoint, so the ability to specify checkpoint_name is not required, and thus omitted.
- Parameters
- pipeline_storeopen hdf5 pipeline_store
- Returns
- checkpoint_namename of the checkpoint
- checkpoint_tablesdict {<table_name>: <table_key>}
- activitysim.core.mp_tasks.print_run_list(run_list, output_file=None)¶
Print run_list to stdout or file (informational - not read back in)
- Parameters
- run_listdict
- output_fileopen file
- activitysim.core.mp_tasks.read_breadcrumbs()¶
Read breadcrumbs file from previous run
write_breadcrumbs wrote OrderedDict steps as list so ordered is preserved (step names are duplicated in steps)
- Returns
- breadcrumbsOrderedDict
- activitysim.core.mp_tasks.run_multiprocess(injectables)¶
run the steps in run_list, possibly resuming after checkpoint specified by resume_after
we never open the pipeline since that is all done within multi-processing steps - mp_apportion_pipeline, run_sub_simulations, mp_coalesce_pipelines - each of which opens the pipeline/s and closes it/them within the sub-process This ‘feature’ makes the pipeline state a bit opaque to us, for better or worse…
Steps may be either single or multi process. For multi-process steps, we need to apportion pipelines before running sub processes and coalesce them afterwards
injectables arg allows propagation of setting values that were overridden on the command line (parent process command line arguments are not available to sub-processes in Windows)
allocate shared data buffers for skims and shadow_pricing
load shared skim data from OMX files
run each (single or multiprocess) step in turn
Drop breadcrumbs along the way to facilitate resuming in a later run
- Parameters
- run_listdict
annotated run_list (including prior run breadcrumbs if resuming)
- injectablesdict
dict of values to inject in sub-processes
- activitysim.core.mp_tasks.run_simulation(queue, step_info, resume_after, shared_data_buffer)¶
run step models as subtask
called once to run each individual sub process in multiprocess step
Unless actually resuming resuming, resume_after will be None for first step, and then FINAL for subsequent steps so pipelines opened to resume where previous step left off
- Parameters
- queuemultiprocessing.Queue
- step_infodict
step_info for current step from multiprocess_steps
- resume_afterstr or None
- shared_data_bufferdict
dict of shared data (e.g. skims and shadow_pricing)
- activitysim.core.mp_tasks.run_sub_simulations(injectables, shared_data_buffers, step_info, process_names, resume_after, previously_completed, fail_fast)¶
Launch sub processes to run models in step according to specification in step_info.
If resume_after is LAST_CHECKPOINT, then pick up where previous run left off, using breadcrumbs from previous run. If some sub-processes completed in the prior run, then skip rerunning them.
If resume_after specifies a checkpiont, skip checkpoints that precede the resume_after
Drop ‘completed’ breadcrumbs for this run as sub-processes terminate
Wait for all sub-processes to terminate and return list of those that completed successfully.
- Parameters
- injectablesdict
values to inject in subprocesses
- shared_data_buffersdict
dict of shared_data for sub-processes (e.g. skim and shadow pricing data)
- step_infodict
step_info from run_list
- process_nameslist of str
list of sub process names to in parallel
- resume_afterstr or None
name of simulation to resume after, or LAST_CHECKPOINT to resume where previous run left off
- previously_completedlist of str
names of processes that successfully completed in previous run
- fail_fastbool
whether to raise error if a sub process terminates with nonzero exitcode
- Returns
- completedlist of str
names of sub_processes that completed successfully
- activitysim.core.mp_tasks.run_sub_task(p)¶
Run process p synchroneously,
Return when sub process terminates, or raise error if exitcode is nonzero
- Parameters
- pmultiprocessing.Process
- activitysim.core.mp_tasks.setup_injectables_and_logging(injectables, locutor=True)¶
Setup injectables (passed by parent process) within sub process
we sometimes want only one of the sub-processes to perform an action (e.g. write shadow prices) the locutor flag indicates that this sub process is the designated singleton spokesperson
- Parameters
- injectablesdict {<injectable_name>: <value>}
dict of injectables passed by parent process
- locutorbool
is this sub process the designated spokesperson
- Returns
- injects injectables
- activitysim.core.mp_tasks.write_breadcrumbs(breadcrumbs)¶
Write breadcrumbs file with execution history of multiprocess run
Write steps as array so order is preserved (step names are duplicated in steps)
Extract from breadcrumbs file showing completed mp_households step with 2 processes:
- apportion: true coalesce: true completed: [mp_households_0, mp_households_1] name: mp_households simulate: true
- Parameters
- breadcrumbsOrderedDict
Data Management¶
Input¶
Input data table functions
API¶
- activitysim.core.input.read_from_table_info(table_info)¶
Read input text files and return cleaned up DataFrame.
table_info is a dictionary that specifies the following input params.
See input_table_list in settings.yaml in the example folder for a working example
key
description
tablename
name of pipeline table in which to store dataframe
filename
name of csv file to read (in data_dir)
column_map
list of input columns to rename from_name: to_name
index_col
name of column to set as dataframe index column
drop_columns
list of column names of columns to drop
h5_tablename
name of target table in HDF5 file
- activitysim.core.input.read_input_table(tablename, required=True)¶
Reads input table name and returns cleaned DataFrame.
Uses settings found in input_table_list in global settings file
- Parameters
- tablenamestring
- Returns
- pandas DataFrame
LOS¶
Network Level of Service (LOS) data access
API¶
- class activitysim.core.los.Network_LOS(los_settings_file_name='network_los.yaml')¶
singleton object to manage skims and skim-related tables los_settings_file_name: str # e.g. 'network_los.yaml' skim_dtype_name:str # e.g. 'float32' dict_factory_name: str # e.g. 'NumpyArraySkimFactory' zone_system: str # str (ONE_ZONE, TWO_ZONE, or THREE_ZONE) skim_time_periods = None # list of str e.g. ['AM', 'MD', 'PM'' skims_info: dict # dict of SkimInfo keyed by skim_tag skim_buffers: dict # if multiprocessing, dict of multiprocessing.Array buffers keyed by skim_tag skim_dicts: dice # dict of SkimDict keyed by skim_tag # TWO_ZONE and THREE_ZONE maz_taz_df: pandas.DataFrame # DataFrame with two columns, MAZ and TAZ, mapping MAZ to containing TAZ maz_to_maz_df: pandas.DataFrame # maz_to_maz attributes for MazSkimDict sparse skims # indexed by synthetic omaz/dmaz index for faster get_mazpairs lookup) maz_ceiling: int # max maz_id + 1 (to compute synthetic omaz/dmaz index by get_mazpairs) max_blend_distance: dict # dict of int maz_to_maz max_blend_distance values keyed by skim_tag # THREE_ZONE only tap_df: pandas.DataFrame tap_lines_df: pandas.DataFrame # if specified in settings, list of transit lines served, indexed by TAP # use to prune maz_to_tap_dfs to drop more distant TAPS with redundant service # since a TAP can serve multiple lines, tap_lines_df TAP index is not unique maz_to_tap_dfs: dict # dict of maz_to_tap DataFrames indexed by access mode (e.g. 'walk', 'drive') # maz_to_tap dfs have OMAZ and DMAZ columns plus additional attribute columns tap_tap_uid: TapTapUidCalculator
Allocate multiprocessing.RawArray shared data buffers sized to hold data for the omx skims. Only called when multiprocessing - BEFORE load_data()
Returns dict of allocated buffers so they can be added to mp_tasks can add them to dict of data to be shared with subprocesses.
Note: we are only allocating storage, but not loading any skim data into it
- Returns
- dict of multiprocessing.RawArray keyed by skim_tag
- create_skim_dict(skim_tag)¶
Create a new SkimDict of type specified by skim_tag (e.g. ‘taz’, ‘maz’ or ‘tap’)
- Parameters
- skim_tag: str
- Returns
- SkimDict or subclass (e.g. MazSkimDict)
- get_default_skim_dict()¶
Get the default (non-transit) skim dict for the (1, 2, or 3) zone_system
- Returns
- TAZ SkimDict for ONE_ZONE, MazSkimDict for TWO_ZONE and THREE_ZONE
- get_mazpairs(omaz, dmaz, attribute)¶
look up attribute values of maz od pairs in sparse maz_to_maz df
- Parameters
- omaz: array-like list of omaz zone_ids
- dmaz: array-like list of omaz zone_ids
- attribute: str name of attribute column in maz_to_maz_df
- Returns
- Numpy.ndarray: list of attribute values for od pairs
- get_skim_dict(skim_tag)¶
Get SkimDict for the specified skim_tag (e.g. ‘taz’, ‘maz’, or ‘tap’)
- Returns
- SkimDict or subclass (e.g. MazSkimDict)
- get_tappairs3d(otap, dtap, dim3, key)¶
TAP skim lookup
FIXME - why do we provide this for taps, but use skim wrappers for TAZ?
- Parameters
- otap: pandas.Series
origin (boarding tap) zone_ids
- dtap: pandas.Series
dest (aligting tap) zone_ids
- dim3: pandas.Series or str
dim3 (e.g. tod) str
- key
skim key (e.g. ‘IWAIT_SET1’)
- Returns
- Numpy.ndarray: list of tap skim values for odt tuples
- load_data()¶
Load tables and skims from files specified in network_los settigns
- load_settings()¶
Read setting file and initialize object variables (see class docstring for list of object variables)
Load omx skim data into shared_data buffers Only called when multiprocessing - BEFORE any models are run or any call to load_data()
- Parameters
- shared_data_buffers: dict of multiprocessing.RawArray keyed by skim_tag
- load_skim_info()¶
read skim info from omx files into SkimInfo, and store in self.skims_info dict keyed by skim_tag
ONE_ZONE and TWO_ZONE systems have only TAZ skims THREE_ZONE systems have both TAZ and TAP skims
- multiprocess()¶
return True if this is a multiprocessing run (even if it is a main or single-process subprocess)
- Returns
- bool
- omx_file_names(skim_tag)¶
Return list of omx file names from network_los settings file for the specified skim_tag (e.g. ‘taz’)
- Parameters
- skim_tag: str (e.g. ‘taz’)
- Returns
- list of str
- skim_time_period_label(time_period)¶
convert time period times to skim time period labels (e.g. 9 -> ‘AM’)
- Parameters
- time_periodpandas Series
- Returns
- numpy.array
string time period labels
Skims¶
Skims data access
API¶
- class activitysim.core.skim_dict_factory.AbstractSkimFactory(network_los)¶
Provide access to skim data from store.
- load_skim_info(skim_tag: str): dict
Read omx files for skim <skim_tag> (e.g. ‘TAZ’) and build skim_info dict
- get_skim_data(skim_tag: str, skim_info: dict): SkimData
Read skim data from backing store and return it as a 3D ndarray quack-alike SkimData object
- allocate_skim_buffer(skim_info, shared: bool): 1D array buffer sized for 3D SkimData
Allocate a ram skim buffer (ndarray or multiprocessing.Array) to use as frombuffer for SkimData
- allocate_skim_buffer(skim_info, shared=False)¶
For multiprocessing
Does subclass support shareable data for multiprocessing
- Returns
- boolean
- class activitysim.core.skim_dict_factory.JitMemMapSkimData(skim_cache_path, skim_info)¶
SkimData subclass for just-in-time memmap.
Since opening a memmap is fast, open the memmap read the data on demand and immediately close it. This essentially eliminates RAM usage, but it means we are loading the data every time we access the skim, which may be significantly slower, depending on patterns of usage.
- property shape¶
- Returns
- list-like shape tuple as returned by numpy.shape
- class activitysim.core.skim_dict_factory.MemMapSkimFactory(network_los)¶
The numpy.memmap docs states: The memmap object can be used anywhere an ndarray is accepted. You might think that since memmap duck-types ndarray, we could simply wrap it in a SkimData object.
But, as the numpy.memmap docs also say: “Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory.”
The words “small segments” are not accidental, because, as you gradually access all the parts of the memmapped array, memory usage increases as all the memory is loaded into RAM.
Under this scenario, the MemMapSkimFactory operates as a just-in-time loader, with no net savings in RAM footprint (other than potentially avoiding loading any unused skims).
Alternatively, since opening a memmap is fast, you could just open the memmap read the data on demand, and immediately close it. This essentially eliminates RAM usage, but it means you are loading the data every time you access the skim, which, depending on you patterns of usage, may or may not be acceptable.
- get_skim_data(skim_tag, skim_info)¶
Read skim data from backing store and return it as a 3D ndarray quack-alike SkimData object (either a JitMemMapSkimData or a memmap backed SkimData object)
- Parameters
- skim_tag: str
- skim_info: string
- Returns
- SkimData or subclass
- class activitysim.core.skim_dict_factory.NumpyArraySkimFactory(network_los)¶
- allocate_skim_buffer(skim_info, shared=False)¶
Allocate a ram skim buffer to use as frombuffer for SkimData If shared is True, return a shareable multiprocessing.RawArray, otherwise a numpy.ndarray
- Parameters
- skim_info: dict
- shared: boolean
- Returns
- multiprocessing.RawArray or numpy.ndarray
- get_skim_data(skim_tag, skim_info)¶
Read skim data from backing store and return it as a 3D ndarray quack-alike SkimData object
- Parameters
- skim_tag: str
- skim_info: string
- Returns
- SkimData
- load_skims_to_buffer(skim_info, skim_buffer)¶
Load skims from disk store (omx or cache) into ram skim buffer (multiprocessing.RawArray or numpy.ndarray)
- Parameters
- skim_info: doct
- skim_buffer: 1D buffer sized to hold all skims (multiprocessing.RawArray or numpy.ndarray)
Does subclass support shareable data for multiprocessing
- Returns
- boolean
- class activitysim.core.skim_dict_factory.SkimData(skim_data)¶
A facade for 3D skim data exposing numpy indexing and shape The primary purpose is to document and police the api used to access skim data Subclasses using a different backing store to perform additional/alternative only need to implement the methods exposed here.
For instance, to open/close memmapped files just in time, or to access backing data via an alternate api
- property shape¶
- Returns
- list-like shape tuple as returned by numpy.shape
- class activitysim.core.skim_dictionary.DataFrameMatrix(df)¶
Utility class to allow a pandas dataframe to be treated like a 2-D array, indexed by rowid, colname
For use in vectorized expressions where the desired values depend on both a row column selector e.g. size_terms.get(df.dest_taz, df.purpose)
df = pd.DataFrame({'a': [1,2,3,4,5], 'b': [10,20,30,40,50]}, index=[100,101,102,103,104]) dfm = DataFrameMatrix(df) dfm.get(row_ids=[100,100,103], col_ids=['a', 'b', 'a']) returns [1, 10, 4]
- get(row_ids, col_ids)¶
- Parameters
- row_ids - list of row_ids (df index values)
- col_ids - list of column names, one per row_id,
specifying column from which the value for that row should be retrieved
- Returns
- series with one row per row_id, with the value from the column specified in col_ids
- class activitysim.core.skim_dictionary.MazSkimDict(skim_tag, network_los, taz_skim_dict)¶
MazSkimDict provides a facade that allows skim-like lookup by maz orig,dest zone_id when there are often too many maz zones to create maz skims.
Dependencies: network_los.load_data must have already loaded: taz skim_dict, maz_to_maz_df, and maz_taz_df
It performs lookups from a sparse list of maz-maz od pairs on selected attributes (e.g. WALKDIST) where accuracy for nearby od pairs is critical. And is backed by a fallback taz skim dict to return values of for more distant pairs (or for skims that are not attributes in the maz-maz table.)
- get_skim_usage()¶
return set of keys of skims looked up. e.g. {‘DIST’, ‘SOV’}
- Returns
- set:
- lookup(orig, dest, key)¶
Return list of skim values of skims(s) at orig/dest in skim with the specified key (e.g. ‘DIST’)
Look up in sparse table (backed by taz skims) if key is a sparse_key, otherwise look up in taz skims For taz skim lookups, the offset_mapper will convert maz zone_ids directly to taz skim indexes.
- Parameters
- orig: list of orig zone_ids
- dest: list of dest zone_ids
- key: str
- Returns
- Numpy.ndarray: list of skim values for od pairs
- sparse_lookup(orig, dest, key)¶
Get impedence values for a set of origin, destination pairs.
- Parameters
- orig1D array
- dest1D array
- keystr
skim key
- Returns
- valuesnumpy 1D array
- class activitysim.core.skim_dictionary.OffsetMapper(offset_int=None, offset_list=None, offset_series=None)¶
Utility to map skim zone ids to ordinal offsets (e.g. numpy array indices)
Can map either by a fixed offset (e.g. -1 to map 1-based to 0-based) or by an explicit mapping of zone id to offset (slower but more flexible)
Internally, there are two representations:
- offset_int:
int offset which when added to zone_id yields skim array index (e.g. -1 to map 1-based zones to 0-based index)
- offset_series:
pandas series with zone_id index and skim array offset values. Ordinarily, index is just range(0, omx_size) if series has duplicate offset values, this can map multiple zone_ids to a single skim array index (e.g. can map maz zone_ids to corresponding taz skim offset)
- map(zone_ids)¶
map zone_ids to skim indexes
- Parameters
- zone_idslist-like (numpy.ndarray, pandas.Int64Index, or pandas.Series)
- Returns
- offsetsnumpy array of int
- set_offset_int(offset_int)¶
specify int offset which when added to zone_id yields skim array index (e.g. -1 to map 1-based to 0-based)
- Parameters
- offset_intint
- set_offset_list(offset_list)¶
Convenience method to set offset_series using an integer list the same size as target skim dimension with implicit skim index mapping (e.g. an omx mapping as returned by omx_file.mapentries)
- Parameters
- offset_listlist of int
- set_offset_series(offset_series)¶
- Parameters
- offset_series: pandas.Series
series with zone_id index and skim array offset values (can map many zone_ids to skim array index)
- class activitysim.core.skim_dictionary.Skim3dWrapper(skim_dict, orig_key, dest_key, dim3_key)¶
This works the same as a SkimWrapper above, except the third dim3 is also supplied, and a 3D lookup is performed using orig, dest, and dim3.
- Parameters
- skims: Skims
This is the Skims object to wrap
- dim3_keystr
This identifies the column in the dataframe which is used to select among Skim object using the SECOND item in each tuple (see above for a more complete description)
- set_df(df)¶
Set the dataframe
- Parameters
- dfDataFrame
The dataframe which contains the orig, dest, and dim3 values
- Returns
- self (to facilitiate chaining)
- class activitysim.core.skim_dictionary.SkimDict(skim_tag, skim_info, skim_data)¶
A SkimDict object is a wrapper around a dict of multiple skim objects, where each object is identified by a key.
Note that keys are either strings or tuples of two strings (to support stacking of skims.)
- get_skim_usage()¶
return set of keys of skims looked up. e.g. {‘DIST’, ‘SOV’}
- Returns
- set:
- lookup(orig, dest, key)¶
Return list of skim values of skims(s) at orig/dest in skim with the specified key (e.g. ‘DIST’)
- Parameters
- orig: list of orig zone_ids
- dest: list of dest zone_ids
- key: str
- Returns
- Numpy.ndarray: list of skim values for od pairs
- lookup_3d(orig, dest, dim3, key)¶
3D lookup of skim values of skims(s) at orig/dest for stacked skims indexed by dim3 selector
The idea is that skims may be stacked in groups with a base key and a dim3 key (usually a time of day key)
On import (from omx) skims stacks are represented by base and dim3 keys seperated by a double_underscore
e.g. DRV_COM_WLK_BOARDS__AM indicates base skim key DRV_COM_WLK_BOARDS with a time of day (dim3) of ‘AM’
Since all the skimsa re stored in a single contiguous 3D array, we can use the dim3 key as a third index and thus rapidly get skim values for a list of (orig, dest, tod) tuples using index arrays (‘fancy indexing’)
- Parameters
- orig: list of orig zone_ids
- dest: list of dest zone_ids
- block_offsets: list with one dim3 key for each orig/dest pair
- Returns
- Numpy.ndarray: list of skim values
- wrap(orig_key, dest_key)¶
return a SkimWrapper for self
- wrap_3d(orig_key, dest_key, dim3_key)¶
return a SkimWrapper for self
- property zone_ids¶
Return list of zone_ids we grok in skim index order
- Returns
- ndarray of int domain zone_ids
- class activitysim.core.skim_dictionary.SkimWrapper(skim_dict, orig_key, dest_key)¶
A SkimWrapper object is an access wrapper around a SkimDict of multiple skim objects, where each object is identified by a key.
This is just a way to simplify expression files by hiding the and orig, dest arguments when the orig and dest vectors are in a dataframe with known column names (specified at init time) The dataframe is identified by set_df because it may not be available (e.g. due to chunking) at the time the SkimWrapper is instantiated.
When the user calls skims[key], key is an identifier for which skim to use, and the object automatically looks up impedances of that skim using the specified orig_key column in df as the origin and the dest_key column in df as the destination. In this way, the user does not do the O-D lookup by hand and only specifies which skim to use for this lookup. This is the only purpose of this object: to abstract away the O-D lookup and use skims by specifying which skim to use in the expressions.
Note that keys are either strings or tuples of two strings (to support stacking of skims.)
- lookup(key, reverse=False)¶
Generally not called by the user - use __getitem__ instead
- Parameters
- keyhashable
The key (identifier) for this skim object
- odbool (optional)
od=True means lookup standard origin-destination skim value od=False means lookup destination-origin skim value
- Returns
- impedances: pd.Series
A Series of impedances which are elements of the Skim object and with the same index as df
- max(key)¶
return max skim value in either o-d or d-o direction
- reverse(key)¶
return skim value in reverse (d-o) direction
- set_df(df)¶
Set the dataframe
- Parameters
- dfDataFrame
The dataframe which contains the origin and destination ids
- Returns
- self (to facilitiate chaining)
Pipeline¶
Data pipeline manager, which manages the list of model steps, runs them, reads and writes data tables from/to the pipeline datastore, and supports restarting of the pipeline at any model step.
API¶
- activitysim.core.pipeline.add_checkpoint(checkpoint_name)¶
Create a new checkpoint with specified name, write all data required to restore the simulation to its current state.
Detect any changed tables , re-wrap them and write the current version to the pipeline store. Write the current state of the random number generator.
- Parameters
- checkpoint_namestr
- activitysim.core.pipeline.checkpointed_tables()¶
Return a list of the names of all checkpointed tables
- activitysim.core.pipeline.cleanup_pipeline()¶
Cleanup pipeline after successful run
Open main pipeline if not already open (will be closed if multiprocess) Create a single-checkpoint pipeline file with latest version of all checkpointed tables, Delete main pipeline and any subprocess pipelines
Called if cleanup_pipeline_after_run setting is True
- Returns
- nothing, but with changed state: pipeline file that was open on call is closed and deleted
- activitysim.core.pipeline.close_pipeline()¶
Close any known open files
- activitysim.core.pipeline.extend_table(table_name, df, axis=0)¶
add new table or extend (add rows) to an existing table
- Parameters
- table_namestr
orca/inject table name
- dfpandas DataFrame
- activitysim.core.pipeline.get_checkpoints()¶
Get pandas dataframe of info about all checkpoints stored in pipeline
pipeline doesn’t have to be open
- Returns
- checkpoints_dfpandas.DataFrame
- activitysim.core.pipeline.get_pipeline_store()¶
Return the open pipeline hdf5 checkpoint store or return None if it not been opened
- activitysim.core.pipeline.get_rn_generator()¶
Return the singleton random number object
- Returns
- activitysim.random.Random
- activitysim.core.pipeline.get_table(table_name, checkpoint_name=None)¶
Return pandas dataframe corresponding to table_name
if checkpoint_name is None, return the current (most recent) version of the table. The table can be a checkpointed table or any registered orca table (e.g. function table)
if checkpoint_name is specified, return table as it was at that checkpoint (the most recently checkpointed version of the table at or before checkpoint_name)
- Parameters
- table_namestr
- checkpoint_namestr or None
- Returns
- dfpandas.DataFrame
- activitysim.core.pipeline.last_checkpoint()¶
- Returns
- last_checkpoint: str
name of last checkpoint
- activitysim.core.pipeline.load_checkpoint(checkpoint_name)¶
Load dataframes and restore random number channel state from pipeline hdf5 file. This restores the pipeline state that existed at the specified checkpoint in a prior simulation. This allows us to resume the simulation after the specified checkpoint
- Parameters
- checkpoint_namestr
model_name of checkpoint to load (resume_after argument to open_pipeline)
- activitysim.core.pipeline.open_pipeline(resume_after=None)¶
Start pipeline, either for a new run or, if resume_after, loading checkpoint from pipeline.
If resume_after, then we expect the pipeline hdf5 file to exist and contain checkpoints from a previous run, including a checkpoint with name specified in resume_after
- Parameters
- resume_afterstr or None
name of checkpoint to load from pipeline store
- activitysim.core.pipeline.open_pipeline_store(overwrite=False)¶
Open the pipeline checkpoint store
- Parameters
- overwritebool
delete file before opening (unless resuming)
- activitysim.core.pipeline.read_df(table_name, checkpoint_name=None)¶
Read a pandas dataframe from the pipeline store.
We store multiple versions of all simulation tables, for every checkpoint in which they change, so we need to know both the table_name and the checkpoint_name of hte desired table.
The only exception is the checkpoints dataframe, which just has a table_name
An error will be raised by HDFStore if the table is not found
- Parameters
- table_namestr
- checkpoint_namestr
- Returns
- dfpandas.DataFrame
the dataframe read from the store
- activitysim.core.pipeline.registered_tables()¶
Return a list of the names of all currently registered dataframe tables
- activitysim.core.pipeline.replace_table(table_name, df)¶
Add or replace a orca table, removing any existing added orca columns
The use case for this function is a method that calls to_frame on an orca table, modifies it and then saves the modified.
orca.to_frame returns a copy, so no changes are saved, and adding multiple column with add_column adds them in an indeterminate order.
Simply replacing an existing the table “behind the pipeline’s back” by calling orca.add_table risks pipeline to failing to detect that it has changed, and thus not checkpoint the changes.
- Parameters
- table_namestr
orca/pipeline table name
- dfpandas DataFrame
- activitysim.core.pipeline.rewrap(table_name, df=None)¶
Add or replace an orca registered table as a unitary DataFrame-backed DataFrameWrapper table
if df is None, then get the dataframe from orca (table_name should be registered, or an error will be thrown) which may involve evaluating added columns, etc.
If the orca table already exists, deregister it along with any associated columns before re-registering it.
The net result is that the dataframe is a registered orca DataFrameWrapper table with no computed or added columns.
- Parameters
- table_name
- df
- Returns
- the underlying df of the rewrapped table
- activitysim.core.pipeline.run(models, resume_after=None)¶
run the specified list of models, optionally loading checkpoint and resuming after specified checkpoint.
Since we use model_name as checkpoint name, the same model may not be run more than once.
If resume_after checkpoint is specified and a model with that name appears in the models list, then we only run the models after that point in the list. This allows the user always to pass the same list of models, but specify a resume_after point if desired.
- Parameters
- models[str]
list of model_names
- resume_afterstr or None
model_name of checkpoint to load checkpoint and AFTER WHICH to resume model run
- returns:
nothing, but with pipeline open
- activitysim.core.pipeline.run_model(model_name)¶
Run the specified model and add checkpoint for model_name
Since we use model_name as checkpoint name, the same model may not be run more than once.
- Parameters
- model_namestr
model_name is assumed to be the name of a registered orca step
- activitysim.core.pipeline.split_arg(s, sep, default='')¶
split str s in two at first sep, returning empty string as second result if no sep
- activitysim.core.pipeline.write_df(df, table_name, checkpoint_name=None)¶
Write a pandas dataframe to the pipeline store.
We store multiple versions of all simulation tables, for every checkpoint in which they change, so we need to know both the table_name and the checkpoint_name to label the saved table
The only exception is the checkpoints dataframe, which just has a table_name
- Parameters
- dfpandas.DataFrame
dataframe to store
- table_namestr
also conventionally the injected table name
- checkpoint_namestr
the checkpoint at which the table was created/modified
Random¶
ActivitySim’s random number generation has a number of important features unique to AB modeling:
Regression testing, debugging - run the exact model with the same inputs and get exactly the same results.
Debugging models - run the exact model with the same inputs but with changes to expression files and get the same results except where the equations differ.
Since runs can take a while, the above cases need to work with a restartable pipeline.
Debugging Multithreading - run the exact model with different multithreading configurations and get the same results.
Repeatable household-level choices - results for a household are repeatable when run with different sample sizes
Repeatable household level results with different scenarios - results for a household are repeatable with different scenario configurations sequentially up to the point at which those differences emerge, and in alternate submodels in which those differences do not apply.
Random number generation is done using the numpy Mersenne Twister PNRG. ActivitySim seeds on-the-fly and uses a stream of random numbers seeded by the household id, person id, tour id, trip id, the model step offset, and the global seed. The logic for calculating the seed is something along the lines of:
chooser_table.index * number_of_models_for_chooser + chooser_model_offset + global_seed_offset
for example
1425 * 2 + 0 + 1
where:
1425 = household table index - households.id
2 = number of household level models - auto ownership and cdap
0 = first household model - auto ownership
1 = global seed offset for testing the same model under different random global seeds
ActivitySim generates a separate, distinct, and stable random number stream for each tour type and tour number in order to maintain as much stability as is possible across alternative scenarios. This is done for trips as well, by direction (inbound versus outbound).
Note
The Random module contains max model steps constants by chooser type - household, person, tour, trip - needs to be equal to the number of chooser sub-models.
API¶
- class activitysim.core.random.SimpleChannel(channel_name, base_seed, domain_df, step_name)¶
We need to ensure that we generate the same random streams (when re-run or even across different simulations.) We do this by generating a random seed for each domain_df row that is based on the domain_df index (which implies that generated tables like tours and trips are also created with stable, predictable, repeatable row indexes.
Because we need to generate a distinct stream for each step, we can’t just use the domain_df index - we need a strategy for handling multiple steps without generating collisions between streams (i.e. choosing the same seed for more than one stream.)
The easiest way to do this would be to use an array of integers to seed the generator, with a global seed, a channel seed, a row seed, and a step seed. Unfortunately, seeding numpy RandomState with arrays is a LOT slower than with a single integer seed, and speed matters because we reseed on-the-fly for every call because creating a different RandomState object for each row uses too much memory (5K per RandomState object)
numpy random seeds are unsigned int32 so there are 4,294,967,295 available seeds. That is probably just about enough to distribute evenly, for most cities, depending on the number of households, persons, tours, trips, and steps.
So we use (global_seed + channel_seed + step_seed + row_index) % (1 << 32) to get an int32 seed rather than a tuple.
We do read in the whole households and persons tables at start time, so we could note the max index values. But we might then want a way to ensure stability between the test, example, and full datasets. I am punting on this for now.
- begin_step(step_name)¶
Reset channel state for a new state
- Parameters
- step_namestr
pipeline step name for this step
- choice_for_df(df, step_name, a, size, replace)¶
Apply numpy.random.choice once for each row in df using the appropriate random channel for each row.
Concatenate the the choice arrays for every row into a single 1-D ndarray The resulting array will be of length: size * len(df.index) This method is designed to support creation of a interaction_dataset
The columns in df are ignored; the index name and values are used to determine which random number sequence to to use.
- Parameters
- dfpandas.DataFrame
df with index name and values corresponding to a registered channel
- step_namestr
current step name so we can update row_states seed info
- The remaining parameters are passed through as arguments to numpy.random.choice
- a1-D array-like or int
If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a was np.arange(n)
- sizeint or tuple of ints
Output shape
- replaceboolean
Whether the sample is with or without replacement
- Returns
- choices1-D ndarray of length: size * len(df.index)
The generated random samples for each row concatenated into a single (flat) array
- extend_domain(domain_df)¶
Extend or create row_state df by adding seed info for each row in domain_df
If extending, the index values of new tables must be disjoint so there will be no ambiguity/collisions between rows
- Parameters
- domain_dfpandas.DataFrame
domain dataframe with index values for which random streams are to be generated and well-known index name corresponding to the channel
- init_row_states_for_step(row_states)¶
initialize row states (in place) for new step
with stable, predictable, repeatable row_seeds for that domain_df index value
See notes on the seed generation strategy in class comment above.
- Parameters
- row_states
- normal_for_df(df, step_name, mu, sigma, lognormal=False)¶
Return a floating point random number in normal (or lognormal) distribution for each row in df using the appropriate random channel for each row.
Subsequent calls (in the same step) will return the next rand for each df row
The resulting array will be the same length (and order) as df This method is designed to support alternative selection from a probability array
The columns in df are ignored; the index name and values are used to determine which random number sequence to to use.
If “true pseudo random” behavior is desired (i.e. NOT repeatable) the set_base_seed method (q.v.) may be used to globally reseed all random streams.
- Parameters
- dfpandas.DataFrame or Series
df or series with index name and values corresponding to a registered channel
- mufloat or pd.Series or array of floats with one value per df row
- sigmafloat or array of floats with one value per df row
- Returns
- rands2-D ndarray
array the same length as df, with n floats in range [0, 1) for each df row
- random_for_df(df, step_name, n=1)¶
Return n floating point random numbers in range [0, 1) for each row in df using the appropriate random channel for each row.
Subsequent calls (in the same step) will return the next rand for each df row
The resulting array will be the same length (and order) as df This method is designed to support alternative selection from a probability array
The columns in df are ignored; the index name and values are used to determine which random number sequence to to use.
If “true pseudo random” behavior is desired (i.e. NOT repeatable) the set_base_seed method (q.v.) may be used to globally reseed all random streams.
- Parameters
- dfpandas.DataFrame
df with index name and values corresponding to a registered channel
- nint
number of rands desired per df row
- Returns
- rands2-D ndarray
array the same length as df, with n floats in range [0, 1) for each df row
- activitysim.core.random.hash32(s)¶
- Parameters
- s: str
- Returns
- 32 bit unsigned hash
Tracing¶
Household tracer. If a household trace ID is specified, then ActivitySim will output a comprehensive set of trace files for all calculations for all household members:
hhtrace.log
- household trace log file, which specifies the CSV files traced. The order of output files is consistent with the model sequence.various CSV files
- every input, intermediate, and output data table - chooser, expressions/utilities, probabilities, choices, etc. - for the trace household for every sub-model
With the set of output CSV files, the user can trace ActivitySim’s calculations in order to ensure they are correct and/or to help debug data and/or logic errors.
API¶
- activitysim.core.tracing.config_logger(basic=False)¶
Configure logger
look for conf file in configs_dir, if not found use basicConfig
- Returns
- Nothing
- activitysim.core.tracing.delete_output_files(file_type, ignore=None, subdir=None)¶
Delete files in output directory of specified type
- Parameters
- output_dir: str
Directory of trace output CSVs
- Returns
- Nothing
- activitysim.core.tracing.delete_trace_files()¶
Delete CSV files in output_dir
- Returns
- Nothing
- activitysim.core.tracing.deregister_traceable_table(table_name)¶
un-register traceable table
- Parameters
- df: pandas.DataFrame
traced dataframe
- Returns
- Nothing
- activitysim.core.tracing.get_trace_target(df, slicer, column=None)¶
get target ids and column or index to identify target trace rows in df
- Parameters
- df: pandas.DataFrame
dataframe to slice
- slicer: str
name of column or index to use for slicing
- Returns
- (target, column) tuple
- targetint or list of ints
id or ids that identify tracer target rows
- columnstr
name of column to search for targets or None to search index
- activitysim.core.tracing.hh_id_for_chooser(id, choosers)¶
- Parameters
- id - scalar id (or list of ids) from chooser index
- choosers - pandas dataframe whose index contains ids
- Returns
- scalar household_id or series of household_ids
- activitysim.core.tracing.interaction_trace_rows(interaction_df, choosers, sample_size=None)¶
Trace model design for interaction_simulate
- Parameters
- interaction_df: pandas.DataFrame
traced model_design dataframe
- choosers: pandas.DataFrame
interaction_simulate choosers (needed to filter the model_design dataframe by traced hh or person id)
- sample_size int or None
int for constant sample size, or None if choosers have different numbers of alternatives
- Returns
- ——-
- trace_rowsnumpy.ndarray
array of booleans to flag which rows in interaction_df to trace
- trace_idstuple (str, numpy.ndarray)
column name and array of trace_ids mapping trace_rows to their target_id for use by trace_interaction_eval_results which needs to know target_id so it can create separate tables for each distinct target for readability
- activitysim.core.tracing.no_results(trace_label)¶
standard no-op to write tracing when a model produces no results
- activitysim.core.tracing.print_summary(label, df, describe=False, value_counts=False)¶
Print summary
- Parameters
- label: str
tracer name
- df: pandas.DataFrame
traced dataframe
- describe: boolean
print describe?
- value_counts: boolean
print value counts?
- Returns
- Nothing
- activitysim.core.tracing.register_traceable_table(table_name, df)¶
Register traceable table
- Parameters
- df: pandas.DataFrame
traced dataframe
- Returns
- Nothing
- activitysim.core.tracing.slice_ids(df, ids, column=None)¶
slice a dataframe to select only records with the specified ids
- Parameters
- df: pandas.DataFrame
traced dataframe
- ids: int or list of ints
slice ids
- column: str
column to slice (slice using index if None)
- Returns
- df: pandas.DataFrame
sliced dataframe
- activitysim.core.tracing.trace_df(df, label, slicer=None, columns=None, index_label=None, column_labels=None, transpose=True, warn_if_empty=False)¶
Slice dataframe by traced household or person id dataframe and write to CSV
- Parameters
- df: pandas.DataFrame
traced dataframe
- label: str
tracer name
- slicer: Object
slicer for subsetting
- columns: list
columns to write
- index_label: str
index name
- column_labels: [str, str]
labels for columns in csv
- transpose: boolean
whether to transpose file for legibility
- warn_if_empty: boolean
write warning if sliced df is empty
- Returns
- Nothing
- activitysim.core.tracing.trace_id_for_chooser(id, choosers)¶
- Parameters
- id - scalar id (or list of ids) from chooser index
- choosers - pandas dataframe whose index contains ids
- Returns
- scalar household_id or series of household_ids
- activitysim.core.tracing.trace_interaction_eval_results(trace_results, trace_ids, label)¶
Trace model design eval results for interaction_simulate
- Parameters
- trace_results: pandas.DataFrame
traced model_design dataframe
- trace_idstuple (str, numpy.ndarray)
column name and array of trace_ids from interaction_trace_rows() used to filter the trace_results dataframe by traced hh or person id
- label: str
tracer name
- Returns
- Nothing
- activitysim.core.tracing.write_csv(df, file_name, index_label=None, columns=None, column_labels=None, transpose=True)¶
Print write_csv
- Parameters
- df: pandas.DataFrame or pandas.Series
traced dataframe
- file_name: str
output file name
- index_label: str
index name
- columns: list
columns to write
- transpose: bool
whether to transpose dataframe (ignored for series)
- Returns
- ——-
- Nothing
Utility Expressions¶
Much of the power of ActivitySim comes from being able to specify Python, pandas, and numpy expressions for calculations. Refer to the pandas help for a general introduction to expressions. ActivitySim provides two ways to evaluate expressions:
Simple table expressions are evaluated using
DataFrame.eval()
. pandas’ eval operates on the current table.Python expressions, denoted by beginning with
@
, are evaluated with Python’s eval().
Simple table expressions can only refer to columns in the current DataFrame. Python expressions can refer to any Python objects currently in memory.
Conventions¶
There are a few conventions for writing expressions in ActivitySim:
each expression is applied to all rows in the table being operated on
expressions must be vectorized expressions and can use most numpy and pandas expressions
global constants are specified in the settings file
comments are specified with
#
you can refer to the current table being operated on as
df
often an object called
skims
,skims_od
, or similar is available and is used to lookup the relevant skim information. See LOS for more information.when editing the CSV files in Excel, use single quote ‘ or space at the start of a cell to get Excel to accept the expression
Example Expressions File¶
An expressions file has the following basic form:
Label |
Description |
Expression |
cars0 |
cars1 |
---|---|---|---|---|
util_drivers_2 |
2 Adults (age 16+) |
drivers==2 |
coef_cars1_drivers_2 |
|
util_persons_25_34 |
Persons age 25-34 |
num_young_adults |
coef_cars1_persons_25_34 |
|
util_num_workers_clip_3 |
Number of workers, capped at 3 |
@df.workers.clip(upper=3) |
coef_cars1_num_workers_clip_3 |
|
util_dist_0_1 |
Distance, from 0 to 1 miles |
@skims[‘DIST’].clip(1) |
coef_dist_0_1 |
In the Tour Mode Choice model expression file example shown below, the @c_ivt*(@odt_skims['SOV_TIME'] + dot_skims['SOV_TIME'])
expression is travel time for the tour origin to destination at the tour start time plus the tour destination to tour origin at the tour end time.
The odt_skims
and dot_skims
objects are setup ahead-of-time to refer to the relevant skims for this model. The @c_ivt
comes from the
tour mode choice coefficient file. The tour mode choice model is a nested logit (NL) model and the nesting structure (including nesting
coefficients) is specified in the YAML settings file.
Label |
Description |
Expression |
DRIVEALONEFREE |
DRIVEALONEPAY |
---|---|---|---|---|
util_DRIVEALONEFREE_Unavailable |
DRIVEALONEFREE - Unavailable |
sov_available == False |
-999 |
|
util_DRIVEALONEFREE_In_vehicle_time |
DRIVEALONEFREE - In-vehicle time |
odt_skims[‘SOV_TIME’] + dot_skims[‘SOV_TIME’] |
coef_ivt |
|
util_DRIVEALONEFREE_Unavailable_for_persons_less_than_16 |
DRIVEALONEFREE - Unavailable for persons less than 16 |
age < 16 |
-999 |
|
util_DRIVEALONEFREE_Unavailable_for_joint_tours |
DRIVEALONEFREE - Unavailable for joint tours |
is_joint == True |
-999 |
Rows are vectorized expressions that will be calculated for every record in the current table being operated on
The Label column is the unique expression name (used for model estimation integration)
The Description column describes the expression
The Expression column contains a valid vectorized Python/pandas/numpy expression. In the example above,
drivers
is a column in the current table. Use@
to refer to data outside the current tableThere is a column for each alternative and its relevant coefficient from the submodel coefficient file
There are some variations on this setup, but the functionality is similar. For example, in the example destination choice model, the size terms expressions file has market segments as rows and employment type coefficients as columns. Broadly speaking, there are currently four types of model expression configurations:
Simple Simulate choice model - select from a fixed set of choices defined in the specification file, such as the example above.
Simulate with Interaction choice model - combine the choice expressions with the choice alternatives files since the alternatives are not listed in the expressions file. The Non-Mandatory Tour Destination Choice model implements this approach.
Combinatorial choice model - first generate a set of alternatives based on a combination of alternatives across choosers, and then make choices. The Coordinated Daily Activity Pattern model implements this approach.
Expressions¶
The expressions class is often used for pre- and post-processor table annotation, which read a CSV file of expression, calculate a number of additional table fields, and join the fields to the target table. An example table annotation expressions file is found in the example configuration files for households for the CDAP model - annotate_households_cdap.csv.
- activitysim.core.expressions.assign_columns(df, model_settings, locals_dict={}, trace_label=None)¶
Evaluate expressions in context of df and assign resulting target columns to df
Can add new or modify existing columns (if target same as existing df column name)
Parameters - same as for compute_columns except df must not be None Returns - nothing since we modify df in place
- activitysim.core.expressions.compute_columns(df, model_settings, locals_dict={}, trace_label=None)¶
Evaluate expressions_spec in context of df, with optional additional pipeline tables in locals
- Parameters
- dfpandas DataFrame
or if None, expect name of pipeline table to be specified by DF in model_settings
- model_settingsdict or str
- dict with keys:
DF - df_alias and (additionally, if df is None) name of pipeline table to load as df SPEC - name of expressions file (csv suffix optional) if different from model_settings TABLES - list of pipeline tables to load and make available as (read only) locals
- str:
name of yaml file in configs_dir to load dict from
- locals_dictdict
dict of locals (e.g. utility functions) to add to the execution environment
- trace_label
- Returns
- results: pandas.DataFrame
one column for each expression (except temps with ALL_CAP target names) same index as df
Sampling with Interaction¶
Methods for expression handling, solving, and sampling (i.e. making multiple choices), with interaction with the chooser table.
Sampling is done with replacement and a sample correction factor is calculated. The factor is calculated as follows:
freq = how often an alternative is sampled (i.e. the pick_count)
prob = probability of the alternative
correction_factor = log(freq/prob)
#for example:
freq 1.00 2.00 3.00 4.00 5.00
prob 0.30 0.30 0.30 0.30 0.30
correction factor 1.20 1.90 2.30 2.59 2.81
As the alternative is oversampled, its utility goes up for final selection. The unique set of alternatives is passed to the final choice model and the correction factor is included in the utility.
API¶
- activitysim.core.interaction_sample.interaction_sample(choosers, alternatives, spec, sample_size, alt_col_name, allow_zero_probs=False, log_alt_losers=False, skims=None, locals_d=None, chunk_size=0, chunk_tag=None, trace_label=None)¶
Run a simulation in the situation in which alternatives must be merged with choosers because there are interaction terms or because alternatives are being sampled.
optionally (if chunk_size > 0) iterates over choosers in chunk_size chunks
- Parameters
- chooserspandas.DataFrame
DataFrame of choosers
- alternativespandas.DataFrame
DataFrame of alternatives - will be merged with choosers and sampled
- specpandas.DataFrame
A Pandas DataFrame that gives the specification of the variables to compute and the coefficients for each variable. Variable specifications must be in the table index and the table should have only one column of coefficients.
- sample_sizeint, optional
Sample alternatives with sample of given size. By default is None, which does not sample alternatives.
- alt_col_name: str
name to give the sampled_alternative column
- skimsSkims object
The skims object is used to contain multiple matrices of origin-destination impedances. Make sure to also add it to the locals_d below in order to access it in expressions. The only job of this method in regards to skims is to call set_df with the dataframe that comes back from interacting choosers with alternatives. See the skims module for more documentation on how the skims object is intended to be used.
- locals_dDict
This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @
- chunk_sizeint
if chunk_size > 0 iterates over choosers in chunk_size chunks
- trace_label: str
This is the label to be used for trace log file entries and dump file names when household tracing enabled. No tracing occurs if label is empty or None.
- Returns
- choices_dfpandas.DataFrame
A DataFrame where index should match the index of the choosers DataFrame (except with sample_size rows for each choser row, one row for each alt sample) and columns alt_col_name, prob, rand, pick_count
- <alt_col_name>:
alt identifier from alternatives[<alt_col_name>
- prob: float
the probability of the chosen alternative
- pick_countint
number of duplicate picks for chooser, alt
- activitysim.core.interaction_sample.make_sample_choices(choosers, probs, alternatives, sample_size, alternative_count, alt_col_name, allow_zero_probs, trace_label)¶
- Parameters
- choosers
- probspandas DataFrame
one row per chooser and one column per alternative
- alternatives
dataframe with index containing alt ids
- sample_sizeint
number of samples/choices to make
- alternative_count
- alt_col_namestr
- trace_label
Simulate¶
Methods for expression handling, solving, choosing (i.e. making choices) from a fixed set of choices defined in the specification file.
API¶
- activitysim.core.simulate.compute_base_probabilities(nested_probabilities, nests, spec)¶
compute base probabilities for nest leaves Base probabilities will be the nest-adjusted probabilities of all leaves This flattens or normalizes all the nested probabilities so that they have the proper global relative values (the leaf probabilities sum to 1 for each row.)
- Parameters
- nested_probabilitiespandas.DataFrame
dataframe with the nested probabilities for nest leafs and nodes
- nestsdict
Nest tree dict from the model spec yaml file
- specpandas.Dataframe
simple simulate spec so we can return columns in appropriate order
- Returns
- ——-
- base_probabilitiespandas.DataFrame
Will have the index of nested_probabilities and columns for leaf base probabilities
- activitysim.core.simulate.compute_nested_exp_utilities(raw_utilities, nest_spec)¶
compute exponentiated nest utilities based on nesting coefficients
For nest nodes this is the exponentiated logsum of alternatives adjusted by nesting coefficient
leaf <- exp( raw_utility ) nest <- exp( ln(sum of exponentiated raw_utility of leaves) * nest_coefficient)
- Parameters
- raw_utilitiespandas.DataFrame
dataframe with the raw alternative utilities of all leaves (what in non-nested logit would be the utilities of all the alternatives)
- nest_specdict
Nest tree dict from the model spec yaml file
- Returns
- nested_utilitiespandas.DataFrame
Will have the index of raw_utilities and columns for exponentiated leaf and node utilities
- activitysim.core.simulate.compute_nested_probabilities(nested_exp_utilities, nest_spec, trace_label)¶
compute nested probabilities for nest leafs and nodes probability for nest alternatives is simply the alternatives’s local (to nest) probability computed in the same way as the probability of non-nested alternatives in multinomial logit i.e. the fractional share of the sum of the exponentiated utility of itself and its siblings except in nested logit, its sib group is restricted to the nest
- Parameters
- nested_exp_utilitiespandas.DataFrame
dataframe with the exponentiated nested utilities of all leaves and nodes
- nest_specdict
Nest tree dict from the model spec yaml file
- Returns
- ——-
- nested_probabilitiespandas.DataFrame
Will have the index of nested_exp_utilities and columns for leaf and node probabilities
- activitysim.core.simulate.dump_mapped_coefficients(model_settings)¶
dump template_df with coefficient values
- activitysim.core.simulate.eval_mnl(choosers, spec, locals_d, custom_chooser, estimator, log_alt_losers=False, want_logsums=False, trace_label=None, trace_choice_name=None, trace_column_names=None)¶
Run a simulation for when the model spec does not involve alternative specific data, e.g. there are no interactions with alternative properties and no need to sample from alternatives.
Each row in spec computes a partial utility for each alternative, by providing a spec expression (often a boolean 0-1 trigger) and a column of utility coefficients for each alternative.
We compute the utility of each alternative by matrix-multiplication of eval results with the utility coefficients in the spec alternative columns yielding one row per chooser and one column per alternative
- Parameters
- chooserspandas.DataFrame
- specpandas.DataFrame
A table of variable specifications and coefficient values. Variable expressions should be in the table index and the table should have a column for each alternative.
- locals_dDict or None
This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @
- custom_chooserfunction(probs, choosers, spec, trace_label) returns choices, rands
custom alternative to logit.make_choices
- estimatorEstimator object
called to report intermediate table results (used for estimation)
- trace_label: str
This is the label to be used for trace log file entries and dump file names when household tracing enabled. No tracing occurs if label is empty or None.
- trace_choice_name: str
This is the column label to be used in trace file csv dump of choices
- trace_column_names: str or list of str
chooser columns to include when tracing expression_values
- Returns
- choicespandas.Series
Index will be that of choosers, values will match the columns of spec.
- activitysim.core.simulate.eval_mnl_logsums(choosers, spec, locals_d, trace_label=None)¶
like eval_nl except return logsums instead of making choices
- Returns
- logsumspandas.Series
Index will be that of choosers, values will be logsum across spec column values
- activitysim.core.simulate.eval_nl(choosers, spec, nest_spec, locals_d, custom_chooser, estimator, log_alt_losers=False, want_logsums=False, trace_label=None, trace_choice_name=None, trace_column_names=None)¶
Run a nested-logit simulation for when the model spec does not involve alternative specific data, e.g. there are no interactions with alternative properties and no need to sample from alternatives.
- Parameters
- chooserspandas.DataFrame
- specpandas.DataFrame
A table of variable specifications and coefficient values. Variable expressions should be in the table index and the table should have a column for each alternative.
- nest_spec:
dictionary specifying nesting structure and nesting coefficients (from the model spec yaml file)
- locals_dDict or None
This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @
- custom_chooserfunction(probs, choosers, spec, trace_label) returns choices, rands
custom alternative to logit.make_choices
- estimatorEstimator object
called to report intermediate table results (used for estimation)
- trace_label: str
This is the label to be used for trace log file entries and dump file names when household tracing enabled. No tracing occurs if label is empty or None.
- trace_choice_name: str
This is the column label to be used in trace file csv dump of choices
- trace_column_names: str or list of str
chooser columns to include when tracing expression_values
- Returns
- choicespandas.Series
Index will be that of choosers, values will match the columns of spec.
- activitysim.core.simulate.eval_nl_logsums(choosers, spec, nest_spec, locals_d, trace_label=None)¶
like eval_nl except return logsums instead of making choices
- Returns
- logsumspandas.Series
Index will be that of choosers, values will be nest logsum based on spec column values
- activitysim.core.simulate.eval_utilities(spec, choosers, locals_d=None, trace_label=None, have_trace_targets=False, trace_all_rows=False, estimator=None, trace_column_names=None, log_alt_losers=False)¶
- Parameters
- specpandas.DataFrame
A table of variable specifications and coefficient values. Variable expressions should be in the table index and the table should have a column for each alternative.
- chooserspandas.DataFrame
- locals_dDict or None
This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @
- trace_label: str
- have_trace_targets: boolean - choosers has targets to trace
- trace_all_rows: boolean - trace all chooser rows, bypassing tracing.trace_targets
- estimator :
called to report intermediate table results (used for estimation)
- trace_column_names: str or list of str
chooser columns to include when tracing expression_values
- activitysim.core.simulate.eval_variables(exprs, df, locals_d=None)¶
Evaluate a set of variable expressions from a spec in the context of a given data table.
There are two kinds of supported expressions: “simple” expressions are evaluated in the context of the DataFrame using DataFrame.eval. This is the default type of expression.
Python expressions are evaluated in the context of this function using Python’s eval function. Because we use Python’s eval this type of expression supports more complex operations than a simple expression. Python expressions are denoted by beginning with the @ character. Users should take care that these expressions must result in a Pandas Series.
# FIXME - for performance, it is essential that spec and expression_values # FIXME - not contain booleans when dotted with spec values # FIXME - or the arrays will be converted to dtype=object within dot()
- Parameters
- exprssequence of str
- dfpandas.DataFrame
- locals_dDict
This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @
- Returns
- variablespandas.DataFrame
Will have the index of df and columns of eval results of exprs.
- activitysim.core.simulate.get_segment_coefficients(model_settings, segment_name)¶
Return a dict mapping generic coefficient names to segment-specific coefficient values
some specs mode_choice logsums have the same espression values with different coefficients for various segments (e.g. eatout, .. ,atwork) and a template file that maps a flat list of coefficients into segment columns.
This allows us to provide a coefficient file with just the coefficients for a specific segment, that works with generic coefficient names in the spec. For instance coef_ivt can take on the values of segment-specific coefficients coef_ivt_school_univ, coef_ivt_work, coef_ivt_atwork,…
coefficients_df value constrain coefficient_name coef_ivt_eatout_escort_... -0.0175 F coef_ivt_school_univ -0.0224 F coef_ivt_work -0.0134 F coef_ivt_atwork -0.0188 F template_df coefficient_name eatout school school work coef_ivt coef_ivt_eatout_escort_... coef_ivt_school_univ coef_ivt_school_univ coef_ivt_work For school segment this will return the generic coefficient name with the segment-specific coefficient value e.g. {'coef_ivt': -0.0224, ...} ...
- activitysim.core.simulate.read_model_coefficient_template(model_settings)¶
Read the coefficient template specified by COEFFICIENT_TEMPLATE model setting
- activitysim.core.simulate.read_model_coefficients(model_settings=None, file_name=None)¶
Read the coefficient file specified by COEFFICIENTS model setting
- activitysim.core.simulate.read_model_spec(file_name)¶
Read a CSV model specification into a Pandas DataFrame or Series.
file_path : str absolute or relative path to file
The CSV is expected to have columns for component descriptions and expressions, plus one or more alternatives.
The CSV is required to have a header with column names. For example:
Description,Expression,alt0,alt1,alt2
- Parameters
- model_settingsdict
name of spec_file is in model_settings[‘SPEC’] and file is relative to configs
- file_namestr
file_name id spec file in configs folder
- description_namestr, optional
Name of the column in fname that contains the component description.
- expression_namestr, optional
Name of the column in fname that contains the component expression.
- Returns
- specpandas.DataFrame
The description column is dropped from the returned data and the expression values are set as the table index.
- activitysim.core.simulate.set_skim_wrapper_targets(df, skims)¶
Add the dataframe to the SkimWrapper object so that it can be dereferenced using the parameters of the skims object.
- Parameters
- dfpandas.DataFrame
Table to which to add skim data as new columns. df is modified in-place.
- skimsSkimWrapper or Skim3dWrapper object, or a list or dict of skims
The skims object is used to contain multiple matrices of origin-destination impedances. Make sure to also add it to the locals_d below in order to access it in expressions. The only job of this method in regards to skims is to call set_df with the dataframe that comes back from interacting choosers with alternatives. See the skims module for more documentation on how the skims object is intended to be used.
- activitysim.core.simulate.simple_simulate(choosers, spec, nest_spec, skims=None, locals_d=None, chunk_size=0, custom_chooser=None, log_alt_losers=False, want_logsums=False, estimator=None, trace_label=None, trace_choice_name=None, trace_column_names=None)¶
Run an MNL or NL simulation for when the model spec does not involve alternative specific data, e.g. there are no interactions with alternative properties and no need to sample from alternatives.
- activitysim.core.simulate.simple_simulate_by_chunk_id(choosers, spec, nest_spec, skims=None, locals_d=None, chunk_size=0, custom_chooser=None, log_alt_losers=False, want_logsums=False, estimator=None, trace_label=None, trace_choice_name=None)¶
chunk_by_chunk_id wrapper for simple_simulate
- activitysim.core.simulate.simple_simulate_logsums(choosers, spec, nest_spec, skims=None, locals_d=None, chunk_size=0, trace_label=None, chunk_tag=None)¶
like simple_simulate except return logsums instead of making choices
- Returns
- logsumspandas.Series
Index will be that of choosers, values will be nest logsum based on spec column values
- activitysim.core.simulate.spec_for_segment(model_settings, spec_id, segment_name, estimator)¶
Select spec for specified segment from omnibus spec containing columns for each segment
- Parameters
- model_specpandas.DataFrame
omnibus spec file with expressions in index and one column per segment
- segment_namestr
segment_name that is also column name in model_spec
- Returns
- pandas.dataframe
canonical spec file with expressions in index and single column with utility coefficients
Simulate with Interaction¶
Methods for expression handling, solving, choosing (i.e. making choices), with interaction with the chooser table.
API¶
- activitysim.core.interaction_simulate.eval_interaction_utilities(spec, df, locals_d, trace_label, trace_rows, estimator=None, log_alt_losers=False)¶
Compute the utilities for a single-alternative spec evaluated in the context of df
We could compute the utilities for interaction datasets just as we do for simple_simulate specs with multiple alternative columns by calling eval_variables and then computing the utilities by matrix-multiplication of eval results with the utility coefficients in the spec alternative columns.
But interaction simulate computes the utilities of each alternative in the context of a separate row in interaction dataset df, and so there is only one alternative in spec. This turns out to be quite a bit faster (in this special case) than the pandas dot function.
For efficiency, we combine eval_variables and multiplication of coefficients into a single step, so we don’t have to create a separate column for each partial utility. Instead, we simply multiply the eval result by a single alternative coefficient and sum the partial utilities.
- specdataframe
one row per spec expression and one col with utility coefficient
- dfdataframe
cross join (cartesian product) of choosers with alternatives combines columns of choosers and alternatives len(df) == len(choosers) * len(alternatives) index values (non-unique) are index values from alternatives df
- interaction_utilitiesdataframe
the utility of each alternative is sum of the partial utilities determined by the various spec expressions and their corresponding coefficients yielding a dataframe with len(interaction_df) rows and one utility column having the same index as interaction_df (non-unique values from alternatives df)
- Returns
- utilitiespandas.DataFrame
Will have the index of df and a single column of utilities
- activitysim.core.interaction_simulate.interaction_simulate(choosers, alternatives, spec, log_alt_losers=False, skims=None, locals_d=None, sample_size=None, chunk_size=0, trace_label=None, trace_choice_name=None, estimator=None)¶
Run a simulation in the situation in which alternatives must be merged with choosers because there are interaction terms or because alternatives are being sampled.
optionally (if chunk_size > 0) iterates over choosers in chunk_size chunks
- Parameters
- chooserspandas.DataFrame
DataFrame of choosers
- alternativespandas.DataFrame
DataFrame of alternatives - will be merged with choosers, currently without sampling
- specpandas.DataFrame
A Pandas DataFrame that gives the specification of the variables to compute and the coefficients for each variable. Variable specifications must be in the table index and the table should have only one column of coefficients.
- skimsSkims object
The skims object is used to contain multiple matrices of origin-destination impedances. Make sure to also add it to the locals_d below in order to access it in expressions. The only job of this method in regards to skims is to call set_df with the dataframe that comes back from interacting choosers with alternatives. See the skims module for more documentation on how the skims object is intended to be used.
- locals_dDict
This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @
- sample_sizeint, optional
Sample alternatives with sample of given size. By default is None, which does not sample alternatives.
- chunk_sizeint
if chunk_size > 0 iterates over choosers in chunk_size chunks
- trace_label: str
This is the label to be used for trace log file entries and dump file names when household tracing enabled. No tracing occurs if label is empty or None.
- trace_choice_name: str
This is the column label to be used in trace file csv dump of choices
- Returns
- choicespandas.Series
A series where index should match the index of the choosers DataFrame and values will match the index of the alternatives DataFrame - choices are simulated in the standard Monte Carlo fashion
Simulate with Sampling and Interaction¶
Methods for expression handling, solving, sampling (i.e. making multiple choices), and choosing (i.e. making choices), with interaction with the chooser table.
API¶
- activitysim.core.interaction_sample_simulate.interaction_sample_simulate(choosers, alternatives, spec, choice_column, allow_zero_probs=False, zero_prob_choice_val=None, log_alt_losers=False, want_logsums=False, skims=None, locals_d=None, chunk_size=0, chunk_tag=None, trace_label=None, trace_choice_name=None, estimator=None)¶
Run a simulation in the situation in which alternatives must be merged with choosers because there are interaction terms or because alternatives are being sampled.
optionally (if chunk_size > 0) iterates over choosers in chunk_size chunks
- Parameters
- chooserspandas.DataFrame
DataFrame of choosers
- alternativespandas.DataFrame
DataFrame of alternatives - will be merged with choosers index domain same as choosers, but repeated for each alternative
- specpandas.DataFrame
A Pandas DataFrame that gives the specification of the variables to compute and the coefficients for each variable. Variable specifications must be in the table index and the table should have only one column of coefficients.
- skimsSkims object
The skims object is used to contain multiple matrices of origin-destination impedances. Make sure to also add it to the locals_d below in order to access it in expressions. The only job of this method in regards to skims is to call set_df with the dataframe that comes back from interacting choosers with alternatives. See the skims module for more documentation on how the skims object is intended to be used.
- locals_dDict
This is a dictionary of local variables that will be the environment for an evaluation of an expression that begins with @
- chunk_sizeint
if chunk_size > 0 iterates over choosers in chunk_size chunks
- trace_label: str
This is the label to be used for trace log file entries and dump file names when household tracing enabled. No tracing occurs if label is empty or None.
- trace_choice_name: str
This is the column label to be used in trace file csv dump of choices
- Returns
- if want_logsums is False:
- choicespandas.Series
A series where index should match the index of the choosers DataFrame and values will match the index of the alternatives DataFrame - choices are simulated in the standard Monte Carlo fashion
- if want_logsums is True:
- choicespandas.DataFrame
choices[‘choice’] : same as choices series when logsums is False choices[‘logsum’] : float logsum of choice utilities across alternatives
Assign¶
Alternative version of the expression evaluators in activitysim.core.simulate
that supports temporary variable assignment.
Temporary variables are identified in the expressions as starting with “_”, such as “_hh_density_bin”. These
fields are not saved to the data pipeline store. This feature is used by the Accessibility model.
API¶
- activitysim.core.assign.assign_variables(assignment_expressions, df, locals_dict, df_alias=None, trace_rows=None, trace_label=None, chunk_log=None)¶
Evaluate a set of variable expressions from a spec in the context of a given data table.
Expressions are evaluated using Python’s eval function. Python expressions have access to variables in locals_d (and df being accessible as variable df.) They also have access to previously assigned targets as the assigned target name.
lowercase variables starting with underscore are temp variables (e.g. _local_var) and not returned except in trace_results
uppercase variables starting with underscore are temp singular variables (e.g. _LOCAL_SCALAR) and not returned except in trace_assigned_locals This is useful for defining general purpose local variables that don’t vary across choosers or alternatives and therefore don’t need to be stored as series/columns in the main choosers dataframe from which utilities are computed.
Users should take care that expressions (other than temp scalar variables) should result in a Pandas Series (scalars will be automatically promoted to series.)
- Parameters
- assignment_expressionspandas.DataFrame of target assignment expressions
target: target column names expression: pandas or python expression to evaluate
- dfpandas.DataFrame
- locals_dDict
This is a dictionary of local variables that will be the environment for an evaluation of “python” expression.
- trace_rows: series or array of bools to use as mask to select target rows to trace
- Returns
- variablespandas.DataFrame
Will have the index of df and columns named by target and containing the result of evaluating expression
- trace_dfpandas.DataFrame or None
a dataframe containing the eval result values for each assignment expression
- activitysim.core.assign.evaluate_constants(expressions, constants)¶
Evaluate a list of constant expressions - each one can depend on the one before it. These are usually used for the coefficients which have relationships to each other. So ivt=.7 and then ivt_lr=ivt*.9.
- Parameters
- expressionsSeries
the index are the names of the expressions which are used in subsequent evals - thus naming the expressions is required.
- constantsdict
will be passed as the scope of eval - usually a separate set of constants are passed in here
- Returns
- ddict
- activitysim.core.assign.local_utilities()¶
Dict of useful modules and functions to provides as locals for use in eval of expressions
- Returns
- utility_dictdict
name, entity pairs of locals
- activitysim.core.assign.read_assignment_spec(file_name, description_name='Description', target_name='Target', expression_name='Expression')¶
Read a CSV model specification into a Pandas DataFrame or Series.
The CSV is expected to have columns for component descriptions targets, and expressions,
The CSV is required to have a header with column names. For example:
Description,Target,Expression
- Parameters
- file_namestr
Name of a CSV spec file.
- description_namestr, optional
Name of the column in fname that contains the component description.
- target_namestr, optional
Name of the column in fname that contains the component target.
- expression_namestr, optional
Name of the column in fname that contains the component expression.
- Returns
- specpandas.DataFrame
dataframe with three columns: [‘description’ ‘target’ ‘expression’]
- activitysim.core.assign.uniquify_key(dict, key, template='{} ({})')¶
rename key so there are no duplicates with keys in dict
e.g. if there is already a key named “dog”, the second key will be reformatted to “dog (2)”
Choice Models¶
Logit¶
Multinomial logit (MNL) or Nested logit (NL) choice model. These choice models depend on the foundational components of ActivitySim, such as the expressions and data handling described in the Execution Flow section.
To specify and solve an MNL model:
either specify
LOGIT_TYPE: MNL
in the model configuration YAML file or omit the settingcall either
simulate.simple_simulate()
orsimulate.interaction_simulate()
depending if the alternatives are interacted with the choosers or because alternatives are sampled
To specify and solve an NL model:
specify
LOGIT_TYPE: NL
in the model configuration YAML filespecify the nesting structure via the NESTS setting in the model configuration YAML file. An example nested logit NESTS entry can be found in
example/configs/tour_mode_choice.yaml
call
simulate.simple_simulate()
. Thesimulate.interaction_simulate()
functionality is not yet supported for NL.
API¶
- class activitysim.core.logit.Nest(name=None, level=0)¶
Data for a nest-logit node or leaf
This object is passed on yield when iterate over nest nodes (branch or leaf) The nested logit design is stored in a yaml file as a tree of dict objects, but using an object to pass the nest data makes the code a little more readable
An example nest specification is in the example tour mode choice model yaml configuration file - example/configs/tour_mode_choice.yaml.
- activitysim.core.logit.count_nests(nest_spec)¶
count the nests in nest_spec, return 0 if nest_spec is none
- activitysim.core.logit.each_nest(nest_spec, type=None, post_order=False)¶
Iterate over each nest or leaf node in the tree (of subtree)
- Parameters
- nest_specdict
Nest tree dict from the model spec yaml file
- typestr
Nest class type to yield None yields all nests ‘leaf’ yields only leaf nodes ‘branch’ yields only branch nodes
- post_orderBool
Should we iterate over the nodes of the tree in post-order or pre-order? (post-order means we yield the alternatives sub-tree before current node.)
- Yields
- nestNest
Nest object with info about the current node (nest or leaf)
- activitysim.core.logit.interaction_dataset(choosers, alternatives, sample_size=None, alt_index_id=None, chooser_index_id=None)¶
Combine choosers and alternatives into one table for the purposes of creating interaction variables and/or sampling alternatives.
Any duplicate column names in choosers table will be renamed with an ‘_chooser’ suffix.
- Parameters
- chooserspandas.DataFrame
- alternativespandas.DataFrame
- sample_sizeint, optional
If sampling from alternatives for each chooser, this is how many to sample.
- Returns
- alts_samplepandas.DataFrame
Merged choosers and alternatives with data repeated either len(alternatives) or sample_size times.
- activitysim.core.logit.make_choices(probs, trace_label=None, trace_choosers=None, allow_bad_probs=False)¶
Make choices for each chooser from among a set of alternatives.
- Parameters
- probspandas.DataFrame
Rows for choosers and columns for the alternatives from which they are choosing. Values are expected to be valid probabilities across each row, e.g. they should sum to 1.
- trace_chooserspandas.dataframe
the choosers df (for interaction_simulate) to facilitate the reporting of hh_id by report_bad_choices because it can’t deduce hh_id from the interaction_dataset which is indexed on index values from alternatives df
- Returns
- choicespandas.Series
Maps chooser IDs (from probs index) to a choice, where the choice is an index into the columns of probs.
- randspandas.Series
The random numbers used to make the choices (for debugging, tracing)
- activitysim.core.logit.report_bad_choices(bad_row_map, df, trace_label, msg, trace_choosers=None, raise_error=True)¶
- Parameters
- bad_row_map
- dfpandas.DataFrame
utils or probs dataframe
- msgstr
message describing the type of bad choice that necessitates error being thrown
- trace_chooserspandas.dataframe
the choosers df (for interaction_simulate) to facilitate the reporting of hh_id because we can’t deduce hh_id from the interaction_dataset which is indexed on index values from alternatives df
- Returns
- raises RuntimeError
- activitysim.core.logit.utils_to_logsums(utils, exponentiated=False, allow_zero_probs=False)¶
Convert a table of utilities to logsum series.
- Parameters
- utilspandas.DataFrame
Rows should be choosers and columns should be alternatives.
- exponentiatedbool
True if utilities have already been exponentiated
- Returns
- logsumspandas.Series
Will have the same index as utils.
- activitysim.core.logit.utils_to_probs(utils, trace_label=None, exponentiated=False, allow_zero_probs=False, trace_choosers=None)¶
Convert a table of utilities to probabilities.
- Parameters
- utilspandas.DataFrame
Rows should be choosers and columns should be alternatives.
- trace_labelstr
label for tracing bad utility or probability values
- exponentiatedbool
True if utilities have already been exponentiated
- allow_zero_probsbool
if True value rows in which all utility alts are EXP_UTIL_MIN will result in rows in probs to have all zero probability (and not sum to 1.0) This is for the benefit of calculating probabilities of nested logit nests
- trace_chooserspandas.dataframe
the choosers df (for interaction_simulate) to facilitate the reporting of hh_id by report_bad_choices because it can’t deduce hh_id from the interaction_dataset which is indexed on index values from alternatives df
- Returns
- probspandas.DataFrame
Will have the same index and columns as utils.
Person Time Windows¶
The departure time and duration models require person time windows. Time windows are adjacent time periods that are available for travel. Time windows are stored in a timetable table and each row is a person and each time period (in the case of MTC TM1 is 5am to midnight in 1 hr increments) is a column. Each column is coded as follows:
0 - unscheduled, available
2 - scheduled, start of a tour, is available as the last period of another tour
4 - scheduled, end of a tour, is available as the first period of another tour
6 - scheduled, end or start of a tour, available for this period only
7 - scheduled, unavailable, middle of a tour
A good example of a time window expression is @tt.previous_tour_ends(df.person_id, df.start)
. This
uses the person id and the tour start period to check if a previous tour ends in the same time period.
API¶
- class activitysim.core.timetable.TimeTable(windows_df, tdd_alts_df, table_name=None)¶
tdd_alts_df tdd_footprints_df start end '0' '1' '2' '3' '4'... 5 5 ==> 0 6 0 0 0 ... 5 6 ==> 0 2 4 0 0 ... 5 7 ==> 0 2 7 4 0 ...
- adjacent_window_after(window_row_ids, periods)¶
Return number of adjacent periods after specified period that are available (not in the middle of another tour.)
Implements MTC TM1 macro @@adjWindowAfterThisPeriodAlt Function name is kind of a misnomer, but parallels that used in MTC TM1 UECs
- Parameters
- window_row_idspandas Series int
series of window_row_ids indexed by tour_id
- periodspandas series int
series of tdd_alt ids, index irrelevant
- Returns
- pandas Series int
Number of adjacent windows indexed by window_row_ids.index
- adjacent_window_before(window_row_ids, periods)¶
Return number of adjacent periods before specified period that are available (not in the middle of another tour.)
Implements MTC TM1 macro @@getAdjWindowBeforeThisPeriodAlt Function name is kind of a misnomer, but parallels that used in MTC TM1 UECs
- Parameters
- window_row_idspandas Series int
series of window_row_ids indexed by tour_id
- periodspandas series int
series of tdd_alt ids, index irrelevant
- Returns
- pandas Series int
Number of adjacent windows indexed by window_row_ids.index
- adjacent_window_run_length(window_row_ids, periods, before)¶
Return the number of adjacent periods before or after specified period that are available (not in the middle of another tour.)
Internal DRY method to implement adjacent_window_before and adjacent_window_after
- Parameters
- window_row_idspandas Series int
series of window_row_ids indexed by tour_id
- periodspandas series int
series of tdd_alt ids, index irrelevant
- beforebool
Specify desired run length is of adjacent window before (True) or after (False)
- assign(window_row_ids, tdds)¶
Assign tours (represented by tdd alt ids) to persons
Updates self.windows numpy array. Assignments will not ‘take’ outside this object until/unless replace_table called or updated timetable retrieved by get_windows_df
- Parameters
- window_row_idspandas Series
series of window_row_ids indexed by tour_id
- tddspandas series
series of tdd_alt ids, index irrelevant
- assign_footprints(window_row_ids, footprints)¶
assign footprints for specified window_row_ids
This method is used for initialization of joint_tour timetables based on the combined availability of the joint tour participants
- Parameters
- window_row_idspandas Series
series of window_row_ids index irrelevant, but we want to use map()
- footprintsnumpy array
with one row per window_row_id and one column per time period
- assign_subtour_mask(window_row_ids, tdds)¶
index window_row_ids tdds 20973389 20973389 26 44612864 44612864 3 48954854 48954854 7 tour footprints [[0 0 2 7 7 7 7 7 7 4 0 0 0 0 0 0 0 0 0 0 0] [0 2 7 7 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 2 7 7 7 7 7 7 4 0 0 0 0 0 0 0 0 0 0 0 0]] subtour_mask [[7 7 0 0 0 0 0 0 0 0 7 7 7 7 7 7 7 7 7 7 7] [7 0 0 0 0 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7] [7 0 0 0 0 0 0 0 0 7 7 7 7 7 7 7 7 7 7 7 7]]
- begin_transaction(transaction_loggers)¶
begin a transaction for an estimator or list of estimators this permits rolling timetable back to the state at the start of the transaction so that timetables can be built for scheduling override choices
- max_time_block_available(window_row_ids)¶
determine the length of the maximum time block available in the persons day
- Parameters
- window_row_ids: pandas.Series
- Returns
- pandas.Series with same index as window_row_ids, and integer max_run_length of
- previous_tour_begins(window_row_ids, periods)¶
Does a previously scheduled tour begin in the specified period?
Implements MTC TM1 @@prevTourBeginsThisArrivalPeriodAlt
- Parameters
- window_row_idspandas Series int
series of window_row_ids indexed by tour_id
- periodspandas series int
series of tdd_alt ids, index irrelevant
- Returns
- pandas Series boolean
indexed by window_row_ids.index
- previous_tour_ends(window_row_ids, periods)¶
Does a previously scheduled tour end in the specified period?
Implements MTC TM1 @@prevTourEndsThisDeparturePeriodAlt
- Parameters
- window_row_idspandas Series int
series of window_row_ids indexed by tour_id
- periodspandas series int
series of tdd_alt ids, index irrelevant (one period per window_row_id)
- Returns
- pandas Series boolean
indexed by window_row_ids.index
- remaining_periods_available(window_row_ids, starts, ends)¶
Determine number of periods remaining available after the time window from starts to ends is hypothetically scheduled
Implements MTC TM1 @@remainingPeriodsAvailableAlt
The start and end periods will always be available after scheduling, so ignore them. The periods between start and end must be currently unscheduled, so assume they will become unavailable after scheduling this window.
- Parameters
- window_row_idspandas Series int
series of window_row_ids indexed by tour_id
- startspandas series int
series of tdd_alt ids, index irrelevant (one per window_row_id)
- endspandas series int
series of tdd_alt ids, index irrelevant (one per window_row_id)
- Returns
- availablepandas Series int
number periods available indexed by window_row_ids.index
- replace_table()¶
Save or replace windows_df DataFrame to pipeline with saved table name (specified when object instantiated.)
This is a convenience function in case caller instantiates object in one context (e.g. dependency injection) where it knows the pipeline table name, but wants to checkpoint the table in another context where it does not know that name.
- slice_windows_by_row_id(window_row_ids)¶
return windows array slice containing rows for specified window_row_ids (in window_row_ids order)
- tour_available(window_row_ids, tdds)¶
test whether time window allows tour with specific tdd alt’s time window
- Parameters
- window_row_idspandas Series
series of window_row_ids indexed by tour_id
- tddspandas series
series of tdd_alt ids, index irrelevant
- Returns
- availablepandas Series of bool
with same index as window_row_ids.index (presumably tour_id, but we don’t care)
- window_periods_in_states(window_row_ids, periods, states)¶
Return boolean array indicating whether specified window periods are in list of states.
Internal DRY method to implement previous_tour_ends and previous_tour_begins
- Parameters
- window_row_idspandas Series int
series of window_row_ids indexed by tour_id
- periodspandas series int
series of tdd_alt ids, index irrelevant (one period per window_row_id)
- stateslist of int
presumably (e.g. I_EMPTY, I_START…)
- Returns
- pandas Series boolean
indexed by window_row_ids.index
- activitysim.core.timetable.create_timetable_windows(rows, tdd_alts)¶
create an empty (all available) timetable with one window row per rows.index
- Parameters
- rows - pd.DataFrame or Series
all we care about is the index
- tdd_alts - pd.DataFrame
We expect a start and end column, and create a timetable to accomodate all alts (with on window of padding at each end)
- so if start is 5 and end is 23, we return something like this:
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
- person_id
- 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- 109 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- Returns
- pd.DataFrame indexed by rows.index, and one column of int8 for each time window (plus padding)
Transit Virtual Path Builder¶
Transit virtual path builder (TVPB) for three zone system (see example_multiple_zones) transit path utility calculations. TAP to TAP skims and walk access and egress times between MAZs and TAPs are input to the demand model. ActivitySim then assembles the total transit path utility based on the user specified TVPB expression files for the respective components:
from MAZ to first boarding TAP +
from first boarding to final alighting TAP +
from alighting TAP to destination MAZ
This assembling is done via the TVPB, which considers all the possible combinations of nearby boarding and alighting TAPs for each origin destination MAZ pair and selects the user defined N best paths to represent the transit mode. After selecting N best paths, the logsum across N best paths is calculated and exposed to the mode choice models and a random number is drawn and a path is chosen. The boarding TAP, alighting TAP, and TAP to TAP skim set for the chosen path is saved to the chooser table.
The initialize TVPB submodel (see Initialize LOS) pre-computes TAP to TAP total utilities for the user defined attribute_segments, which are typically demographic segment (for example household income bin), time-of-day, and access/egress mode. This submodel can be run in both single process and multiprocess mode, with single process excellent for development/debugging and multiprocess excellent for application. ActivitySim saves the pre-calculated TAP to TAP total utilities to a memory mapped cache file for reuse by downstream models such as tour mode choice. In tour mode choice, the pre-computed TAP to TAP total utilities for the attribute_segment, along with the access and egress impedances, are used to evaluate the best N TAP pairs for each origin MAZ destination MAZ pair being evaluated. Assembling the total transit path impedance and then picking the best N is quick since it is done in a de-duplicated manner within each chunk of multiprocessed choosers.
A model with TVPB can take considerably longer to run than a traditional TAZ based model since it does an order of magnitude more calculations. Thus, it is important to be mindful of your approach to your network model as well, especially the number of TAPs accessible to each MAZ, which is the key determinant of runtime.
API¶
- class activitysim.core.pathbuilder.TransitVirtualPathBuilder(network_los)¶
Transit virtual path builder for three zone systems
- compute_tap_tap_utilities(recipe, access_df, egress_df, chooser_attributes, path_info, trace_label, trace)¶
create transit_df and compute utilities for all atap-btap pairs between omaz in access and dmaz in egress_df compute the utilities using the tap_tap utility expressions file specified in tap_tap_settings
transit_df contains all possible access omaz/btap to egress dmaz/atap transit path pairs for each chooser
trace should be True as we don’t encourage/support dynamic utility computation except when tracing (precompute being fairly fast)
- Parameters
- recipe: str
‘recipe’ key in network_los.yaml TVPB_SETTINGS e.g. tour_mode_choice
- access_df: pandas.DataFrame
dataframe with ‘idx’ and ‘omaz’ columns
- egress_df: pandas.DataFrame
dataframe with ‘idx’ and ‘dmaz’ columns
- chooser_attributes: dict
- path_info
- trace_label: str
- trace: boolean
- Returns
- transit_df: pandas.dataframe
- lookup_tap_tap_utilities(recipe, maz_od_df, access_df, egress_df, chooser_attributes, path_info, trace_label)¶
create transit_df and compute utilities for all atap-btap pairs between omaz in access and dmaz in egress_df look up the utilities in the precomputed tap_cache data (which is indexed by uid_calculator unique_ids) (unique_id can used as a zero-based index into the data array)
transit_df contains all possible access omaz/btap to egress dmaz/atap transit path pairs for each chooser
- Parameters
- recipe
- maz_od_df
- access_df
- egress_df
- chooser_attributes
- path_info
- trace_label
- class activitysim.core.pathbuilder.TransitVirtualPathLogsumWrapper(pathbuilder, orig_key, dest_key, tod_key, segment_key, recipe, cache_choices, trace_label, tag)¶
Transit virtual path builder logsum wrapper for three zone systems
- set_df(df)¶
Set the dataframe
- Parameters
- dfDataFrame
The dataframe which contains the origin and destination ids
- Returns
- self (to facilitiate chaining)
- activitysim.core.pathbuilder.compute_utilities(network_los, model_settings, choosers, model_constants, trace_label, trace=False, trace_column_names=None)¶
Compute utilities
Cache API¶
- class activitysim.core.pathbuilder_cache.TVPBCache(network_los, uid_calculator, cache_tag)¶
Transit virtual path builder cache for three zone systems
- allocate_data_buffer(shared=False)¶
allocate fully_populated_shape data buffer for cached data
if shared, return a multiprocessing.Array that can be shared across subprocesses if not shared, return a numpy ndarrray
- Parameters
- shared: boolean
- Returns
- multiprocessing.Array or numpy ndarray sized to hole fully_populated utility array
- cleanup()¶
Called prior to
- close(trace=False)¶
write any changes, free data, and mark as closed
- get_data_and_lock_from_buffers()¶
return shared data buffer previously allocated by allocate_data_buffer and injected mp_tasks.run_simulation Returns ——- either multiprocessing.Array and lock or multiprocessing.RawArray and None according to RAWARRAY
- open()¶
open STATIC cache and populate with cached data
- if multiprocessing
always STATIC cache with data fully_populated preloaded shared data buffer
- class activitysim.core.pathbuilder_cache.TapTapUidCalculator(network_los)¶
Transit virtual path builder TAP to TAP unique ID calculator for three zone systems
- get_od_dataframe(scalar_attributes)¶
return tap-tap od dataframe with unique_id index for ‘skim_offset’ for scalar_attributes
i.e. a dataframe which may be used to compute utilities, together with scalar or column attributes
- Parameters
- scalar_attributes: dict of scalar attribute name:value pairs
- Returns
- pandas.Dataframe
- get_unique_ids(df, scalar_attributes)¶
compute canonical unique_id for each row in df btap and atap will be in dataframe, but the other attributes may be either df columns or scalar_attributes
- Parameters
- df: pandas DataFrame
with btap, atap, and optionally additional attribute columns
- scalar_attributes: dict
dict of scalar attributes e.g. {‘tod’: ‘AM’, ‘demographic_segment’: 0}
- Returns
- ——-
- ndarray of integer uids
Helpers¶
Chunk¶
Chunking management.
Note
The definition of chunk_size has changed from previous versions of ActivitySim. The revised definition of chunk_size simplifies model setup since it is the approximate amount of RAM available to ActivitySim as opposed to the obscure number of doubles (64-bit numbers) in a chunk of a choosers table.
The chunk_size
is the approximate amount of RAM in GBs to allocate to ActivitySim for batch
processing choosers across all processes. It is specified in bytes, for example chunk_size: 500_000_000_000
is 500 GBs.
If set chunk_training_mode: disabled
then no chunking will be performed and ActivitySim will attempt to solve all the
choosers at once across all the processes. Chunking is required when all the chooser data required to process all the
choosers cannot fit within the available RAM and so ActivitySim must split the choosers into batches and then process the batches in sequence.
Configuration of the chunk_size
depends on several parameters:
The amount of machine RAM
The number of machine processors (CPUs/cores)
The number of households (and number of zones for aggregate models)
The amount of headroom required for shared data across processes, such as the skims/network LOS data
The desired runtimes
An example helps illustrate configuration of the chunk_size
. If the example model has 1 million households and the
current submodel is auto ownership, then there are 1 million choosers since every household participates in the auto
ownership model. In single process mode, ActivitySim would create one chooser table with 1 million rows, assuming this table
and the additional extra data such as the skims can fit within the available memory (RAM). If the 1 million row table cannot fit
within memory then chunking needs to be setup to split the choosers table into batches that are processed in sequence and small
enough to fit within the available memory. For example, the choosers table is split into 2 chunks of 500,000 choosers each and then
processed in sequence. In multi process mode, for example with 10 processes, ActivitySim splits the 1 million households into 10
mini processes each with 100,000 households. Then for the auto ownership submodel, the chooser table within each process is the
100,000 choosers and there must be enough RAM to simultaneously solve all 10 processes with each 100,000 choosers at once. If not,
then chunking can be setup so each mini process table of choosers is split into chunks for sequential processing, for example from
10 tables of 100,000 choosers to 20 tables of 50,000 choosers.
If the user desires the fastest runtimes possible given their hardware, model inputs, and model configuration, then ActivitySim
should be configured to use most of the CPUs/cores (physical, not virtual), most of the RAM, and with the MKL Settings. For
example, if the machine has 12 cores and 256 GB of RAM, then try configuring the model with num_processes: 10
and
chunk_size: 0
to start and seeing if the model can fit the problem into the available RAM. If not, then try setting chunk_size
to something like 225 GB, chunk_size: 225_000_000_000
. Experimentation of the desired configuration of the CPUs and RAM should be done
for each new machine and model setup (with respect to the number of households, skims, and model configuration). In general, more processors
means faster runtimes and more RAM means faster runtimes, but the relationship of processors to RAM is not linear as processors can only
go so fast and because there is more to runtime than processors and RAM, including cache speed, disk speed, etc. Also, the amount of RAM
to use is approximate and ActivitySim often pushes a bit above the user specified amount due to pandas/numpy memory spikes for
memory intensive operations and so it is recommended to leave some RAM unallocated. The exact amount to leave unallocated depends on the
parameters above.
To configure chunking behavior, ActivitySim must first be trained with the model setup and machine. To do so, first
run the model with chunk_training_mode: training
. This tracks the amount of memory used by each table by submodel and writes the results
to a cache file that is then re-used for production runs. This training mode is significantly slower than production mode since it does
significantly more memory inspection. For a training mode run, set num_processors
to about 80% of the avaiable logical processors and chunk_size
to about 80% of the available RAM. This will run the model and create the chunk_cache.csv
file in outputcache for reuse. After creating
the chunk cache file, the model can be run with chunk_training_mode: production
and the desired num_processors
and chunk_size
. The
model will read the chunk cache file from the outputcache folder, similar to how it reads cached skims if specified.
The software trains on the size of problem so the cache file can be re-used and only needs to be updated due to significant revisions in population,
expression, skims/network LOS, or changes in machine specs. If run in production mode and no cache file is found then ActivitySim falls
back to training mode. A third chunk_training_mode
is adaptive, which if a cache file exists, runs the model with the starting cache
settings but also updates the cache settings based on additional memory inspection. This may additionally improve the cache setttings to
reduce runtimes when run in production mode. If resume_after
is set, then the chunk cache file is not overwritten in cache directory
since the list of submodels would be incomplete. A foruth chunk_training_mode
is disabled, which assumes the model can be run without
chunking due to an abundance of RAM.
The following chunk_methods
are supported to calculate memory overhead when chunking is enabled:
bytes - expected rowsize based on actual size (as reported by numpy and pandas) of explicitly allocated data this can underestimate overhead due to transient data requirements of operations (e.g. merge, sort, transpose)
uss - expected rowsize based on change in (unique set size) (uss) both as a result of explicit data allocation, and readings by MemMonitor sniffer thread that measures transient uss during time-consuming numpy and pandas operations
hybrid_uss - hybrid_uss avoids problems with pure uss, especially with small chunk sizes (e.g. initial training chunks) as numpy may recycle cached blocks and show no increase in uss even though data was allocated and logged
rss - like uss, but for resident set size (rss), which is the portion of memory occupied by a process that is held in RAM
hybrid_rss - like hybrid_uss, but for rss
RSS is reported by psutil.memory_info and USS is reported by psutil.memory_full_info. USS is the memory which is private to
a process and which would be freed if the process were terminated. This is the metric that most closely matches the rather
vague notion of memory “in use” (the meaning of which is difficult to pin down in operating systems with virtual memory
where memory can (but sometimes can’t) be swapped or mapped to disk. hybrid_uss
performs best and is most reliable and
is therefore the default.
Additional chunking settings:
min_available_chunk_ratio: 0.05 - minimum fraction of total chunk_size to reserve for adaptive chunking
default_initial_rows_per_chunk: 500 - initial number of chooser rows for first chunk in training mode, when there is no pre-existing chunk_cache to set initial value, ordinarily bigger is better as long as it is not so big it causes memory issues (e.g. accessibility with lots of zones)
keep_chunk_logs: True - whether to preserve or delete subprocess chunk logs when they are consolidated at end of multiprocess run
keep_mem_logs: True - whether to preserve or delete subprocess mem logs when they are consolidated at end of multiprocess run
API¶
- class activitysim.core.chunk.ChunkHistorian¶
Utility for estimating row_size
- class activitysim.core.chunk.ChunkLedger(trace_label, chunk_size, baseline_rss, baseline_uss, headroom)¶
- class activitysim.core.chunk.ChunkSizer(chunk_tag, trace_label, num_choosers=0, chunk_size=0)¶
- activitysim.core.chunk.DEFAULT_CHUNK_METHOD = 'hybrid_uss'¶
The chunk_cache table is a record of the memory usage and observed row_size required for chunking the various models. The row size differs depending on whether memory usage is calculated by rss, uss, or explicitly allocated bytes. We record all three during training so the mode can be changed without necessitating retraining.
tag, num_rows, rss, uss, bytes, uss_row_size, hybrid_uss_row_size, bytes_row_size atwork_subtour_frequency.simple, 3498, 86016, 81920, 811536, 24, 232, 232 atwork_subtour_mode_choice.simple, 704, 20480, 20480, 1796608, 30, 2552, 2552 atwork_subtour_scheduling.tour_1, 701, 24576, 24576, 45294082, 36, 64614, 64614 atwork_subtour_scheduling.tour_n, 3, 20480, 20480, 97734, 6827, 32578, 32578 auto_ownership_simulate.simulate, 5000, 77824, 24576, 1400000, 5, 280, 280
- MODE_RETRAIN
rebuild chunk_cache table and save/replace in output/cache/chunk_cache.csv preforms a complete rebuild of chunk_cache table by doing adaptive chunking starting with based on default initial settings (DEFAULT_INITIAL_ROWS_PER_CHUNK) and observing rss, uss, and allocated bytes to compute rows_size. This will run somewhat slower than the other modes because of overhead of small first chunk, and possible instability in the second chunk due to inaccuracies caused by small initial chunk_size sample
- MODE_ADAPTIVE
Use the existing chunk_cache to determine the sizing for the first chunk for each model, but also use the observed row_size to adjust the estimated row_size for subsequent chunks. At the end of hte run, writes the updated chunk_cache to the output directory, but doesn’t overwrite the ‘official’ cache file. If the user wishes they can replace the chunk_cache with the updated versions but this is not done automatically as it is not clear this would be the desired behavior. (Might become clearer over time as this is exercised further.)
- MODE_PRODUCTION
Since overhead changes we don’t necessarily want the same number of rows per chunk every time but we do use the row_size from cache which we trust is stable (the whole point of MODE_PRODUCTION is to avoid the cost of observing overhead) which is stored in self.initial_row_size because initial_rows_per_chunk used it for the first chunk
- MODE_CHUNKLESS
Do not do chunking, and also do not check or log memory usage, so ActivitySim can focus on performance assuming there is abundant RAM.
- class activitysim.core.chunk.MemMonitor(trace_label, stop_snooping)¶
- run()¶
Method representing the thread’s activity.
You may override this method in a subclass. The standard run() method invokes the callable object passed to the object’s constructor as the target argument, if any, with sequential and keyword arguments taken from the args and kwargs arguments, respectively.
- activitysim.core.chunk.adaptive_chunked_choosers_and_alts(choosers, alternatives, chunk_size, trace_label, chunk_tag=None)¶
generator to iterate over choosers and alternatives in chunk_size chunks
like chunked_choosers, but also chunks alternatives for use with sampled alternatives which will have different alternatives (and numbers of alts)
There may be up to sample_size (or as few as one) alternatives for each chooser because alternatives may have been sampled more than once, but pick_count for those alternatives will always sum to sample_size.
When we chunk the choosers, we need to take care chunking the alternatives as there are varying numbers of them for each chooser. Since alternatives appear in the same order as choosers, we can use cumulative pick_counts to identify boundaries of sets of alternatives
- Parameters
- choosers
- alternativespandas DataFrame
sample alternatives including pick_count column in same order as choosers
- rows_per_chunkint
- Yields
- iint
one-based index of current chunk
- num_chunksint
total number of chunks that will be yielded
- chooserspandas DataFrame slice
chunk of choosers
- alternativespandas DataFrame slice
chunk of alternatives for chooser chunk
- activitysim.core.chunk.overhead_for_chunk_method(overhead, method=None)¶
return appropriate overhead for row_size calculation based on current chunk_method
by ChunkSizer.adaptive_rows_per_chunk to determine observed_row_size based on cum_overhead and cum_rows
by ChunkSizer.initial_rows_per_chunk to determine initial row_size using cached_history and current chunk_method
by consolidate_logs to add informational row_size column to cache file based on chunk_method for training run
- Parameters
- overhead: dict keyed by metric or DataFrame with columns
- Returns
- chunk_method overhead (possibly hybrid, depending on chunk_method)
Utilities¶
Vectorized helper functions
API¶
- activitysim.core.util.assign_in_place(df, df2)¶
update existing row values in df from df2, adding columns to df if they are not there
- Parameters
- dfpd.DataFrame
assignment left-hand-side (dest)
- df2: pd.DataFrame
assignment right-hand-side (source)
- Returns
- ——-
- activitysim.core.util.iprod(ints)¶
Return the product of hte ints in the list or tuple as an unlimited precision python int
Specifically intended to compute arrray/buffer size for skims where np.proc might overflow for default dtypes. (Narrowing rules for np.prod are different on Windows and linux) an alternative to the unwieldy: int(np.prod(ints, dtype=np.int64))
- Parameters
- ints: list or tuple of ints or int wannabees
- Returns
- returns python int
- activitysim.core.util.left_merge_on_index_and_col(left_df, right_df, join_col, target_col)¶
like pandas left merge, but join on both index and a specified join_col
FIXME - for now return a series of ov values from specified right_df target_col
- Parameters
- left_dfpandas DataFrame
index name assumed to be same as that of right_df
- right_dfpandas DataFrame
index name assumed to be same as that of left_df
- join_colstr
name of column to join on (in addition to index values) should have same name in both dataframes
- target_colstr
name of column from right_df whose joined values should be returned as series
- Returns
- target_seriespandas Series
series of target_col values with same index as left_df i.e. values joined to left_df from right_df with index of left_df
- activitysim.core.util.other_than(groups, bools)¶
Construct a Series that has booleans indicating the presence of something- or someone-else with a certain property within a group.
- Parameters
- groupspandas.Series
A column with the same index as bools that defines the grouping of bools. The bools Series will be used to index groups and then the grouped values will be counted.
- boolspandas.Series
A boolean Series indicating where the property of interest is present. Should have the same index as groups.
- Returns
- otherspandas.Series
A boolean Series with the same index as groups and bools indicating whether there is something- or something-else within a group with some property (as indicated by bools).
- activitysim.core.util.quick_loc_df(loc_list, target_df, attribute=None)¶
faster replacement for target_df.loc[loc_list] or target_df.loc[loc_list][attribute]
pandas DataFrame.loc[] indexing doesn’t scale for large arrays (e.g. > 1,000,000 elements)
- Parameters
- loc_listlist-like (numpy.ndarray, pandas.Int64Index, or pandas.Series)
- target_dfpandas.DataFrame containing column named attribute
- attributename of column from loc_list to return (or none for all columns)
- Returns
- pandas.DataFrame or, if attribbute specified, pandas.Series
- activitysim.core.util.quick_loc_series(loc_list, target_series)¶
faster replacement for target_series.loc[loc_list]
pandas Series.loc[] indexing doesn’t scale for large arrays (e.g. > 1,000,000 elements)
- Parameters
- loc_listlist-like (numpy.ndarray, pandas.Int64Index, or pandas.Series)
- target_seriespandas.Series
- Returns
- pandas.Series
- activitysim.core.util.reindex(series1, series2)¶
This reindexes the first series by the second series. This is an extremely common operation that does not appear to be in Pandas at this time. If anyone knows of an easier way to do this in Pandas, please inform the UrbanSim developers.
The canonical example would be a parcel series which has an index which is parcel_ids and a value which you want to fetch, let’s say it’s land_area. Another dataset, let’s say of buildings has a series which indicate the parcel_ids that the buildings are located on, but which does not have land_area. If you pass parcels.land_area as the first series and buildings.parcel_id as the second series, this function returns a series which is indexed by buildings and has land_area as values and can be added to the buildings dataset.
In short, this is a join on to a different table using a foreign key stored in the current table, but with only one attribute rather than for a full dataset.
This is very similar to the pandas “loc” function or “reindex” function, but neither of those functions return the series indexed on the current table. In both of those cases, the series would be indexed on the foreign table and would require a second step to change the index.
- Parameters
- series1, series2pandas.Series
- Returns
- reindexedpandas.Series
- activitysim.core.util.reindex_i(series1, series2, dtype=<class 'numpy.int8'>)¶
version of reindex that replaces missing na values and converts to int helpful in expression files that compute counts (e.g. num_work_tours)
Config¶
Helper functions for configuring a model run
API¶
- exception activitysim.core.config.SettingsFileNotFound(file_name, configs_dir)¶
- activitysim.core.config.base_settings_file_path(file_name)¶
- Parameters
- file_name
- Returns
- path to base settings file or None if not found
- activitysim.core.config.expand_input_file_list(input_files)¶
expand list by unglobbing globs globs
- activitysim.core.config.filter_warnings()¶
set warning filter to ‘strict’ if specified in settings
- activitysim.core.config.future_model_settings(model_name, model_settings, future_settings)¶
Warn users of new required model settings, and substitute default values
- Parameters
- model_name: str
name of model
- model_settings: dict
model_settings from settigns file
- future_settings: dict
default values for new required settings
- Returns
- dict
model_settings with any missing future_settings added
- activitysim.core.config.get_cache_dir()¶
return path of cache directory in output_dir (creating it, if need be)
- cache directory is used to store
skim memmaps created by skim+dict_factories tvpb tap_tap table cache
- Returns
- str path
- activitysim.core.config.get_global_constants()¶
Read global constants from settings file
- Returns
- constantsdict
dictionary of constants to add to locals for use by expressions in model spec
- activitysim.core.config.get_logit_model_settings(model_settings)¶
Read nest spec (for nested logit) from model settings file
- Returns
- nestsdict
dictionary specifying nesting structure and nesting coefficients
- constantsdict
dictionary of constants to add to locals for use by expressions in model spec
- activitysim.core.config.get_model_constants(model_settings)¶
Read constants from model settings file
- Returns
- constantsdict
dictionary of constants to add to locals for use by expressions in model spec
- activitysim.core.config.logger = <Logger activitysim.core.config (WARNING)>¶
default injectables
- activitysim.core.config.read_model_settings(file_name, mandatory=False)¶
- Parameters
- file_namestr
yaml file name
- mandatorybool
throw error if file empty or not found
- Returns
- ——-
- activitysim.core.config.read_settings_file(file_name, mandatory=True, include_stack=[], configs_dir_list=None)¶
look for first occurence of yaml file named <file_name> in directories in configs_dir list, read settings from yaml file and return as dict.
Settings file may contain directives that affect which file settings are returned:
- inherit_settings: boolean
backfill settings in the current file with values from the next settings file in configs_dir list
- include_settings: string <include_file_name>
read settings from specified include_file in place of the current file settings (to avoid confusion, this directive must appea ALONE in fiel, without any additional settings or directives.)
- Parameters
- file_name
- mandatory: booelan
if true, raise SettingsFileNotFound exception if no settings file, otherwise return empty dict
- include_stack: boolean
only used for recursive calls to provide list of files included so far to detect cycles
- Returns: dict
settings from speciified settings file/s
- ——-
Inject¶
Model orchestration and data pipeline interaction.
API¶
- activitysim.core.inject.add_table(table_name, table, replace=False)¶
Add new table and raise assertion error if the table already exists. Silently replace if replace=True.
- activitysim.core.inject.reinject_decorated_tables()¶
reinject the decorated tables (and columns)
Mem¶
Helper functions for tracking memory usage
API¶
- activitysim.core.mem.consolidate_logs()¶
Consolidate and aggregate subprocess mem logs
return total size of the multiprocessing shared memory block in data_buffers
Output¶
Write output files and track skim usage.
API¶
- activitysim.core.steps.output.previous_write_data_dictionary(output_dir)¶
Write table_name, number of rows, columns, and bytes for each checkpointed table
- Parameters
- output_dir: str
- activitysim.core.steps.output.track_skim_usage(output_dir)¶
write statistics on skim usage (diagnostic to detect loading of un-needed skims)
FIXME - have not yet implemented a facility to avoid loading of unused skims
FIXME - if resume_after, this will only reflect skims used after resume
- Parameters
- output_dir: str
- activitysim.core.steps.output.write_data_dictionary(output_dir)¶
Write table schema for all tables
- model settings
txt_format: output text file name (default data_dict.txt) or empty to suppress txt output csv_format: output csv file name (default data_dict.tcsvxt) or empty to suppress txt output
schema_tables: list of tables to include in output (defaults to all checkpointed tables)
for each table, write column names, dtype, and checkpoint added)
text format writes individual table schemas to a single text file csv format writes all tables together with an additional table_name column
- Parameters
- output_dir: str
- activitysim.core.steps.output.write_tables(output_dir)¶
Write pipeline tables as csv files (in output directory) as specified by output_tables list in settings file.
‘output_tables’ can specify either a list of output tables to include or to skip if no output_tables list is specified, then all checkpointed tables will be written
To write all output tables EXCEPT the households and persons tables:
output_tables: action: skip tables: - households - persons
To write ONLY the households table:
output_tables: action: include tables: - households
To write tables into a single HDF5 store instead of individual CSVs, use the h5_store flag:
output_tables: h5_store: True action: include tables: - households
- Parameters
- output_dir: str
Tests¶
See activitysim.core.test