Software Implementation¶

This page describes the PopulationSim software implementation and how to contribute to PopulationSim.

The implementation starts with the ActivitySim framework, which serves as the foundation for the software. The framework, as briefly described below, includes features for data pipeline management, expression handling, multiprocessing, testing, etc. Built upon the framework are additional core components for population synthesis such as balancers and integerizers. Built upon the population synthesis core components are the model steps that make up a PopulationSim run, such as the inputs pre-processor, setting up the data strucutres, doing the initial seed balancing, etc.

ActivitySim Framework¶

PopulationSim is implemented in the ActivitySim framework. As summarized here, being implemented in the ActivitySim framework means:

Overall Design
- Implemented in Python, and makes heavy use of the vectorized backend C/C++ libraries in pandas and numpy.
- Vectorization instead of for loops when possible
- Runs sub-models that solve Python expression files that operate on data tables
Data Handling
- Inputs are in CSV format, with the exception of settings
- CSVs are read-in as pandas tables and stored in an intermediate HDF5 binary file that is used for data I/O throughout the model run
- Key outputs are written to CSV files
Key Data Structures
- pandas.DataFrame - A data table with rows and columns, similar to an R data frame, Excel worksheet, or database table
- pandas.Series - a vector of data, a column in a DataFrame table or a 1D array
- numpy.array - an N-dimensional array of items of the same type, such as a matrix
Model Orchestrator
- ORCA is used for running the overall model system and for defining dynamic data tables, columns, and injectables (functions). ActivitySim wraps ORCA functionality to make a Data Pipeline tool, which allows for re-starting at any model step.
- Support for multiprocessing to reduce runtime
Expressions
- Model expressions are in CSV files and contain Python expressions, mainly pandas/numpy expression that operate on the input data tables. This helps to avoid modifying Python code when making changes to the model calculations.
Code Documentation
- Python code according to pycodestyle style guide
- Written in reStructuredText markup, built with Sphinx and docstrings written in numpydoc
Testing
- A protected master branch that can only be written to after tests have passed
- pytest for tests
- TravisCI for building and testing with each commit

PopulationSim also requires an optimization library for balancing and integerizing. The software makes use of the open source and easy to install ortools package. The ortools integerization results varies from platform to platform since edge case results depend on the exact ortools/cbc version.

Core Components¶

assign¶

populationsim.assign.assign_variable(target, expression, df, locals_dict, df_alias=None, trace_rows=None)¶

Evaluate an expression of a given data table.

Expressions are evaluated using Python’s eval function. Python expressions have access to variables in locals_d (and df being accessible as variable df.) They also have access to previously assigned targets as the assigned target name.

Users should take care that expressions should result in a Pandas Series (scalars will be automatically promoted to series.)

Parameters

assignment_expressionspandas.DataFrame of target assignment expressions: target: target column names expression: pandas or python expression to evaluate
dfpandas.DataFrame
locals_dDict: This is a dictionary of local variables that will be the environment for an evaluation of “python” expression.
trace_rows: series or array of bools to use as mask to select target rows to trace

Returns

resultpandas.Series: Will have the index of df and columns named by target and containing the result of evaluating expression
trace_dfpandas.Series or None: a series containing the eval result values for the assignment expression

balancer¶

class populationsim.balancer.ListBalancer(incidence_table, initial_weights, control_totals, control_importance_weights, lb_weights, ub_weights, master_control_index, max_iterations)¶

Single-geography list balancer using Newton-Raphson method with control relaxation.

Takes a list of households with initial weights assigned to each household, and updates those weights in such a way as to match the marginal distribution of control variables while minimizing the change to the initial weights. Uses Newton-Raphson method with control relaxation.

The resulting weights are float weights, so need to be integerized to integer household weights

integerizer¶

populationsim.integerizer.do_integerizing(trace_label, control_spec, control_totals, incidence_table, float_weights, total_hh_control_col)¶

Parameters

trace_labelstr: trace label indicating geography zone being integerized (e.g. PUMA_600)
control_specpandas.Dataframe: full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
control_totalspandas.Series: control totals explicitly specified for this zone
incidence_tablepandas.Dataframe
float_weightspandas.Series: balanced float weights to integerize
total_hh_control_colstr: name of total_hh column (preferentially constrain to match this control)

Returns

integerized_weightspandas.Series
statusstr: as defined in integerizer.STATUS_TEXT and STATUS_SUCCESS

populationsim.integerizer.smart_round(int_weights, resid_weights, target_sum)¶

Round weights while ensuring (as far as possible that result sums to target_sum)

Parameters

int_weightsnumpy.ndarray(int)
resid_weightsnumpy.ndarray(float)
target_sumint

Returns

rounded_weightsnumpy.ndarray array of ints

lp¶

populationsim.lp.get_simul_integerizer()¶

Return simul-integerizer function using installed/configured Linear Programming library.

Different LP packages can be used for integerization (e.g. ortools of cvx) and this function hides the specifics of the individual packages so they can be swapped with minimal impact.

Returns

integerizer_funcfunction pointer to simul-integerizer function with call signature:
def np_simul_integerizer(: sub_int_weights, parent_countrol_importance, parent_relax_ge_upper_bound, sub_countrol_importance, sub_float_weights, sub_resid_weights, lp_right_hand_side, parent_hh_constraint_ge_bound, sub_incidence, parent_incidence, total_hh_right_hand_side, relax_ge_upper_bound, parent_lp_right_hand_side, hh_constraint_ge_bound, parent_resid_weights, total_hh_sub_control_index, total_hh_parent_control_index)

populationsim.lp.get_single_integerizer()¶

Return single integerizer function using installed/configured Linear Programming library.

Different LP packages can be used for integerization (e.g. ortools of cvx) and this function hides the specifics of the individual packages so they can be swapped with minimal impact.

Returns

integerizer_funcfunction pointer to single integerizer function with call signature:

def np_integerizer(: incidence, resid_weights, log_resid_weights, control_importance_weights, total_hh_control_index, lp_right_hand_side, relax_ge_upper_bound, hh_constraint_ge_bound)

lp_cvx¶

populationsim.lp_cvx.np_integerizer_cvx(incidence, resid_weights, log_resid_weights, control_importance_weights, total_hh_control_index, lp_right_hand_side, relax_ge_upper_bound, hh_constraint_ge_bound)¶

cvx-based single-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.

Parameters

incidencenumpy.ndarray(control_count, sample_count) float
resid_weightsnumpy.ndarray(sample_count,) float
log_resid_weightsnumpy.ndarray(sample_count,) float
control_importance_weightsnumpy.ndarray(control_count,) float
total_hh_control_indexint
lp_right_hand_sidenumpy.ndarray(control_count,) float
relax_ge_upper_boundnumpy.ndarray(control_count,) float
hh_constraint_ge_boundnumpy.ndarray(control_count,) float

Returns

resid_weights_outnumpy.ndarray(sample_count,)
status_textstr

populationsim.lp_cvx.np_simul_integerizer_cvx(sub_int_weights, parent_countrol_importance, parent_relax_ge_upper_bound, sub_countrol_importance, sub_float_weights, sub_resid_weights, lp_right_hand_side, parent_hh_constraint_ge_bound, sub_incidence, parent_incidence, total_hh_right_hand_side, relax_ge_upper_bound, parent_lp_right_hand_side, hh_constraint_ge_bound, parent_resid_weights, total_hh_sub_control_index)¶

cvx-based siuml-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.

Parameters

sub_int_weightsnumpy.ndarray(sub_zone_count, sample_count) int
parent_countrol_importancenumpy.ndarray(parent_control_count,) float
parent_relax_ge_upper_boundnumpy.ndarray(parent_control_count,) float
sub_countrol_importancenumpy.ndarray(sub_control_count,) float
sub_float_weightsnumpy.ndarray(sub_zone_count, sample_count) float
sub_resid_weightsnumpy.ndarray(sub_zone_count, sample_count) float
lp_right_hand_sidenumpy.ndarray(sub_zone_count, sub_control_count) float
parent_hh_constraint_ge_boundnumpy.ndarray(parent_control_count,) float
sub_incidencenumpy.ndarray(sample_count, sub_control_count) float
parent_incidencenumpy.ndarray(sample_count, parent_control_count) float
total_hh_right_hand_sidenumpy.ndarray(sub_zone_count,) float
relax_ge_upper_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
parent_lp_right_hand_sidenumpy.ndarray(parent_control_count,) float
hh_constraint_ge_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
parent_resid_weightsnumpy.ndarray(sample_count,) float
total_hh_sub_control_indexint

Returns

resid_weights_outnumpy.ndarray of float: residual weights in range [0..1] as solved, or, in case of failure, sub_resid_weights unchanged
status_textstring: STATUS_OPTIMAL, STATUS_FEASIBLE in case of success, or a solver-specific failure status

lp_ortools¶

populationsim.lp_ortools.np_integerizer_ortools(incidence, resid_weights, log_resid_weights, control_importance_weights, total_hh_control_index, lp_right_hand_side, relax_ge_upper_bound, hh_constraint_ge_bound)¶

ortools single-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.

Parameters

incidencenumpy.ndarray(control_count, sample_count) float
resid_weightsnumpy.ndarray(sample_count,) float
log_resid_weightsnumpy.ndarray(sample_count,) float
control_importance_weightsnumpy.ndarray(control_count,) float
total_hh_control_indexint
lp_right_hand_sidenumpy.ndarray(control_count,) float
relax_ge_upper_boundnumpy.ndarray(control_count,) float
hh_constraint_ge_boundnumpy.ndarray(control_count,) float

Returns

resid_weights_outnumpy.ndarray(sample_count,)
status_textstr

populationsim.lp_ortools.np_simul_integerizer_ortools(sub_int_weights, parent_countrol_importance, parent_relax_ge_upper_bound, sub_countrol_importance, sub_float_weights, sub_resid_weights, lp_right_hand_side, parent_hh_constraint_ge_bound, sub_incidence, parent_incidence, total_hh_right_hand_side, relax_ge_upper_bound, parent_lp_right_hand_side, hh_constraint_ge_bound, parent_resid_weights, total_hh_sub_control_index, total_hh_parent_control_index)¶

ortools-based siuml-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.

Parameters

sub_int_weightsnumpy.ndarray(sub_zone_count, sample_count) int
parent_countrol_importancenumpy.ndarray(parent_control_count,) float
parent_relax_ge_upper_boundnumpy.ndarray(parent_control_count,) float
sub_countrol_importancenumpy.ndarray(sub_control_count,) float
sub_float_weightsnumpy.ndarray(sub_zone_count, sample_count) float
sub_resid_weightsnumpy.ndarray(sub_zone_count, sample_count) float
lp_right_hand_sidenumpy.ndarray(sub_zone_count, sub_control_count) float
parent_hh_constraint_ge_boundnumpy.ndarray(parent_control_count,) float
sub_incidencenumpy.ndarray(sample_count, sub_control_count) float
parent_incidencenumpy.ndarray(sample_count, parent_control_count) float
total_hh_right_hand_sidenumpy.ndarray(sub_zone_count,) float
relax_ge_upper_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
parent_lp_right_hand_sidenumpy.ndarray(parent_control_count,) float
hh_constraint_ge_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
parent_resid_weightsnumpy.ndarray(sample_count,) float
total_hh_sub_control_indexint
total_hh_parent_control_indexint

Returns

resid_weights_outnumpy.ndarray of float: residual weights in range [0..1] as solved, or, in case of failure, sub_resid_weights unchanged
status_textstring: STATUS_OPTIMAL, STATUS_FEASIBLE in case of success, or a solver-specific failure status

multi_integerizer¶

populationsim.multi_integerizer.do_no_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, control_spec, total_hh_control_col, sub_control_zones, sub_geography)¶

populationsim.multi_integerizer.do_sequential_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, control_spec, total_hh_control_col, sub_control_zones, sub_geography, combine_results=True)¶

note: this method returns different results depending on the value of combine_results

Parameters

incidence_dfpandas.Dataframe

full incidence_df for all hh samples in seed zone

sub_zone_weightspandas.DataFame

balanced subzone household sample weights to integerize

sub_controls_dfpandas.Dataframe

sub_geography controls (one row per zone indexed by sub_zone id)

control_specpandas.Dataframe

full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …

total_hh_control_colstr

name of total_hh column (so we can preferentially match this control)

sub_geographystr

subzone geography name (e.g. ‘TAZ’)

sub_control_zonespandas.Series

series mapping zone_id (index) to zone label (value) for use in sub_controls_df column names

combine_resultsbool

return all results in a single frame or return infeasible rounded results separately?

Returns

——-

For combined results:

integerized_weights_dfpandas.DataFrame: canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids

for segregated results:

integerized_zone_idsarray(int): zone_ids of feasible (integerized) zones
rounded_zone_idsarray(int): zone_ids of infeasible (rounded) zones
integerized_weights_dfpandas.DataFrame or None if all zones infeasible: integerized weights for feasible zones
rounded_weights_dfpandas.DataFrame or None if all zones feasible: rounded weights for infeasible aones

Results dataframes are canonical form weight table,

with columns for ‘balanced_weight’, ‘integer_weight’

plus columns for household id, and sub_geography zone ids

populationsim.multi_integerizer.do_simul_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, control_spec, total_hh_control_col, sub_geography, sub_control_zones)¶

Wrapper around simultaneous integerizer to handle solver failure for infeasible subzones.

Simultaneous integerize balanced float sub_weights, If simultaneous integerization fails, integerize serially to identify infeasible subzones, remove and smart_round infeasible subzones, and try simultaneous integerization again. (That ought to succeed, but if not, then fall back to all sequential integerization) Finally combine all results into a single result dataframe.

Parameters

incidence_dfpandas.Dataframe: full incidence_df for all hh samples in seed zone
sub_zone_weightspandas.DataFame: balanced subzone household sample weights to integerize
sub_controls_dfpandas.Dataframe: sub_geography controls (one row per zone indexed by sub_zone id)
control_specpandas.Dataframe: full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
total_hh_control_colstr: name of total_hh column (so we can preferentially match this control)
sub_geographystr: subzone geography name (e.g. ‘TAZ’)
sub_control_zonespandas.Series: index is zone id and value is zone label (e.g. TAZ_101) for use in sub_controls_df column names

Returns

integer_weights_dfpandas.DataFrame: canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids

populationsim.multi_integerizer.multi_integerize(incidence_df, sub_zone_weights, sub_controls_df, control_spec, total_hh_control_col, parent_geography, parent_id, sub_geography, sub_control_zones)¶

Parameters

incidence_dfpandas.Dataframe: full incidence_df for all hh samples in seed zone
sub_zone_weightspandas.DataFame: balanced subzone household sample weights to integerize
sub_controls_dfpandas.Dataframe: sub_geography controls (one row per zone indexed by sub_zone id)
control_specpandas.Dataframe: full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
total_hh_control_colstr: name of total_hh column (so we can preferentially match this control)
parent_geographystr: parent geography zone name
parent_idint: parent geography zone id
sub_geographystr: subzone geography name (e.g. ‘TAZ’)
sub_control_zonespandas.Series: index is zone id and value is zone label (e.g. TAZ_101) for use in sub_controls_df column names

Returns

integer_weights_dfpandas.DataFrame: canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, parent and sub_geography zone ids

populationsim.multi_integerizer.reshape_result(float_weights, integerized_weights, sub_geography, sub_control_zones)¶

Reshape results into unstacked form - (same as that returned by sequential integerizer) with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids

Parameters

float_weightspandas.DataFrame: dataframe with one row per sample hh and one column per sub_zone
integerized_weightspandas.DataFrame: dataframe with one row per sample hh and one column per sub_zone
sub_geographystr: name of sub_geography for result column name
sub_control_zonespandas.Series: series mapping zone_id (index) to zone label (value)

Returns

integer_weights_dfpandas.DataFrame: canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids

populationsim.multi_integerizer.try_simul_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, sub_geography, control_spec, total_hh_control_col, sub_control_zones)¶

Attempt simultaneous integerization and return integerized weights if successful

Parameters

incidence_df
sub_weights
sub_controls_df
sub_geography
control_spec
total_hh_control_col
sub_control_zones

Returns

statusstr: str value of integerizer status from STATUS_TEXT dict integerization was successful if status in STATUS_SUCCESS list
integerized_weights_dfpandas.DataFrame or None: canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ or None if integerization failed

simul_balancer¶

class populationsim.simul_balancer.SimultaneousListBalancer(incidence_table, parent_weights, controls, sub_control_zones, total_hh_control_col)¶

Dual-zone simultaneous list balancer using Newton-Raphson method with control relaxation.

Simultaneously balances the household weights across multiple subzones of a parent zone, ensuring that the total weight of each household across sub-zones sums to the parent hh weight.

The resulting weights are float weights, so need to be integerized to integer household weights

populationsim.simul_balancer.np_simul_balancer(sample_count, control_count, zone_count, master_control_index, incidence, parent_weights, weights_lower_bound, weights_upper_bound, sub_weights, parent_controls, controls_importance, sub_controls)¶: Simultaneous balancer using only numpy (no pandas) data types. Separate function to ensure that no pandas data types leak in from object instance variables since they are often silently accepted as numpy arguments but slow things down

Model Steps¶

input_pre_processor¶

populationsim.steps.input_pre_processor.input_pre_processor()¶

Read input text files and save them as pipeline tables for use in subsequent steps.

The files to read as specified by table_list, and array of dicts that specify the input file name, the name of the pipeline table, along with keys allow the specification of pre-processing steps.

By default, reads table_list from ‘input_table_list’ in settings.yaml, unless an alternate table_list name is specified as a model step argument ‘table_list’. (This allows alternate/additional input files to be read for repop)

In the case of repop, this step is being run after an initial run has completed, in which case the input_table_list may specify replacement tables. (e.g. lowest geography controls that will replace the previous low controls dataframe.)

See input_table_list in settings.yaml in the example folder for a working example

key	description
tablename	name of pipeline table in which to store dataframe
filename	name of csv file to read (in data_dir)
column_map	list of input columns to rename from_name: to_name
index_col	name of column to set as dataframe index column
drop_columns	list of column names of columns to drop

setup_data_structures¶

populationsim.steps.setup_data_structures.add_geography_columns(incidence_table, households_df, crosswalk_df)¶

Add seed and meta geography columns to incidence_table

Parameters

incidence_table
households_df
crosswalk_df

populationsim.steps.setup_data_structures.build_crosswalk_table()¶: build crosswalk table filtered to include only zones in lowest geography

populationsim.steps.setup_data_structures.filter_households(households_df, persons_df, crosswalk_df)¶

Filter households and persons tables, removing zero weight households and any households not in seed zones.

Returns filtered households_df and persons_df

populationsim.steps.setup_data_structures.repop_setup_data_structures(households, persons)¶

Setup geographic correspondence (crosswalk), control sets, and incidence tables for repop run.

A new lowest-level geography control tables should already have been read in by rerunning input_pre_processor with a table_list override. The control table contains one row for each zone, with columns specifying control field totals for that control

This step reads in the repop control file, which specifies which control control fields in the control table should be used for balancing, along with their importance and the recipe (seed table and expression) for determining household incidence for that control.

Parameters

households: pipeline table
persons: pipeline table

populationsim.steps.setup_data_structures.setup_data_structures(settings, households, persons)¶

Setup geographic correspondence (crosswalk), control sets, and incidence tables.

A control tables for target geographies should already have been read in by running input_pre_processor. The zone control tables contains one row for each zone, with columns specifying control field totals for that control

This step reads in the global control file, which specifies which control control fields in the control table should be used for balancing, along with their importance and the recipe (seed table and expression) for determining household incidence for that control.

If GROUP_BY_INCIDENCE_SIGNATURE setting is enabled, then incidence table rows are household group ids and and additional household_groups table is created mapping hh group ids to actual hh_ids.

Parameters

settings: dict: contents of settings.yaml as dict
households: pipeline table
persons: pipeline table
creates pipeline tables:: crosswalk controls geography-specific controls incidence_table household_groups (if GROUP_BY_INCIDENCE_SIGNATURE setting is enabled)
modifies tables:: households persons

initial_seed_balancing¶

populationsim.steps.initial_seed_balancing.initial_seed_balancing(settings, crosswalk, control_spec, incidence_table)¶

Balance the household weights for each of the seed geographies (independently) using the seed level controls and the aggregated sub-zone controls totals.

Create the seed_weights table with one row per household and columns contaiing household_id, seed geography (e.g. PUMA), and float preliminary_balanced_weights

Adds seed_weights table to pipeline named <seed_geography>_weights (e.g. PUMA_weights):

index hh_id	PUMA	preliminary_balanced_weight	hh_id
0 1 2 …	600 601 602	0.313555 0.627110 0.313555	0 1 2

Parameters

settingsdict (settings.yaml as dict)
crosswalkpipeline table
control_specpipeline table
incidence_tablepipeline table

meta_control_factoring¶

populationsim.steps.meta_control_factoring.meta_control_factoring(settings, control_spec, incidence_table)¶

Apply simple factoring to summed household fractional weights based on original meta control values relative to summed household fractional weights by meta zone.

The resulting factored meta control weights will be new meta controls appended as additional columns to the seed control table, for final balancing.

Parameters

settingsdict (settings.yaml as dict)
control_specpipeline table
incidence_tablepipeline table

final_seed_balancing¶

populationsim.steps.final_seed_balancing.final_seed_balancing(settings, crosswalk, control_spec, incidence_table)¶

Balance the household weights for each of the seed geographies (independently) using the seed level controls and the aggregated sub-zone controls totals.

Create the seed_weights table with one row per household and columns contaiing household_id, seed geography (e.g. PUMA), and float preliminary_balanced_weights

Adds column balanced_weight to the seed_weights table

Parameters

settingsdict (settings.yaml as dict)
crosswalkpipeline table
control_specpipeline table
incidence_tablepipeline table

integerize_final_seed_weights¶

populationsim.steps.integerize_final_seed_weights.integerize_final_seed_weights(settings, crosswalk, control_spec, incidence_table)¶

Final balancing for each seed (puma) zone with aggregated low and mid-level controls and distributed meta-level controls.

Adds integer_weight column to seed-level weight table

Parameters

settingsdict (settings.yaml as dict)
crosswalkpipeline table
control_specpipeline table
incidence_tablepipeline table

sub_balancing¶

populationsim.steps.sub_balancing.balance(incidence_df, parent_weights, sub_controls_df, control_spec, total_hh_control_col, parent_geography, parent_id, sub_geographies, sub_control_zones)¶

Parameters

incidence_dfpandas.Dataframe: full incidence_df for all hh samples in seed zone
parent_weightspandas.Series: parent zone balanced (possibly integerized) aggregate target weights
sub_controls_dfpandas.Dataframe: sub_geography controls (one row per zone indexed by sub_zone id)
control_specpandas.Dataframe: full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
total_hh_control_colstr: name of total_hh column (so we can preferentially match this control)
parent_geographystr: parent geography zone name
parent_idint: parent geography zone id
sub_geographieslist(str): list of subgeographies in descending order
sub_control_zonespandas.Series: index is zone id and value is zone label (e.g. TAZ_101) for use in sub_controls_df column names

Returns

sub_zone_weightspandas.DataFrame: balanced subzone household float sample weights

populationsim.steps.sub_balancing.balance_and_integerize(incidence_df, parent_weights, sub_controls_df, control_spec, total_hh_control_col, parent_geography, parent_id, sub_geographies, crosswalk_df)¶

Parameters

incidence_dfpandas.Dataframe: full incidence_df for all hh samples in seed zone
parent_weightspandas.Series: parent zone balanced (possibly integerized) aggregate target weights
sub_controls_dfpandas.Dataframe: sub_geography controls (one row per zone indexed by sub_zone id)
control_specpandas.Dataframe: full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
total_hh_control_colstr: name of total_hh column (so we can preferentially match this control)
parent_geographystr: parent geography zone name
parent_idint: parent geography zone id
sub_geographieslist(str): list of subgeographies in descending order
crosswalk_dfpandas.Dataframe: geo crosswork table sliced to current seed geography

Returns

integerized_sub_zone_weights_dfpandas.DataFrame: canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id and sub_geography zone ids

populationsim.steps.sub_balancing.sub_balancing(settings, crosswalk, control_spec, incidence_table)¶

Simul-balance and integerize all zones at a specified geographic level in groups by parent zone.

For instance, if the ‘geography’ step arg is ‘TRACT’ and the parent geography is ‘SEED’, then for each seed zone, we simul-balance the TRACTS it contains.

Creates a weight table for the target geography with float ‘balanced_weight’ and ‘integer_weight’ columns.

Parameters

settingsdict (settings.yaml as dict)
crosswalkpipeline table
control_specpipeline table
incidence_tablepipeline table

expand_households¶

populationsim.steps.expand_households.expand_households()¶

Create a complete expanded synthetic household list with their assigned geographic zone ids.

This is the skeleton synthetic household id list with no household or person attributes, one row per household, with geography columns and seed household table household_id.

Creates pipeline table expanded_household_ids

write_tables¶

Write pipeline tables as csv files (in output directory) as specified by output_tables list in settings file.

‘output_tables’ can specify either a list of output tables to include or to skip if no output_tables list is specified, then all checkpointed tables will be written

To write all output tables EXCEPT the households and persons tables:

output_tables:
  action: skip
  tables:
    - households
    - persons

To write ONLY the households table:

output_tables:
  action: include
  tables:
     - households

To write tables into a single HDF5 store instead of individual CSVs, use the h5_store flag:

output_tables:
  h5_store: True
  action: include
  tables:
     - households

Parameters¶

output_dir: str

write_synthetic_population¶

populationsim.steps.write_synthetic_population.write_synthetic_population(expanded_household_ids, households, persons, output_dir)¶

Write synthetic households and persons tables to output dir as csv files. The settings file allows specification of output file names, household_id column name, and seed data attribute columns to include in output files.

Parameters

expanded_household_idspipeline table
householdspipeline table
personspipeline table
output_dirstr

summarize¶

populationsim.steps.summarize.summarize(crosswalk, incidence_table, control_spec)¶

Write aggregate summary files of controls and weights for all geographic levels to output dir

Parameters

crosswalkpipeline table
incidence_tablepipeline table
control_specpipeline table

repop_balancing¶

populationsim.steps.repop_balancing.repop_balancing(settings, crosswalk, control_spec, incidence_table)¶

Balance and integerize all zones at a lowest geographic level.

Creates a weight table for the repop zones target geography with float ‘balanced_weight’ and ‘integer_weight’ columns.

Parameters

settingsdict (settings.yaml as dict)
crosswalkpipeline table
control_spec: pipeline table
incidence_tablepipeline table

Contribution Guidelines¶

PopulationSim development follows the same development guidelines as ActivitySim.