Software Implementation

This page describes the PopulationSim software implementation and how to contribute to PopulationSim.

The implementation starts with the ActivitySim framework, which serves as the foundation for the software. The framework, as briefly described below, includes features for data pipeline management, expression handling, multiprocessing, testing, etc. Built upon the framework are additional core components for population synthesis such as balancers and integerizers. Built upon the population synthesis core components are the model steps that make up a PopulationSim run, such as the inputs pre-processor, setting up the data strucutres, doing the initial seed balancing, etc.

ActivitySim Framework

PopulationSim is implemented in the ActivitySim framework. As summarized here, being implemented in the ActivitySim framework means:

  • Overall Design

    • Implemented in Python, and makes heavy use of the vectorized backend C/C++ libraries in pandas and numpy.

    • Vectorization instead of for loops when possible

    • Runs sub-models that solve Python expression files that operate on data tables

  • Data Handling

    • Inputs are in CSV format, with the exception of settings

    • CSVs are read-in as pandas tables and stored in an intermediate HDF5 binary file that is used for data I/O throughout the model run

    • Key outputs are written to CSV files

  • Key Data Structures

    • pandas.DataFrame - A data table with rows and columns, similar to an R data frame, Excel worksheet, or database table

    • pandas.Series - a vector of data, a column in a DataFrame table or a 1D array

    • numpy.array - an N-dimensional array of items of the same type, such as a matrix

  • Model Orchestrator

    • ORCA is used for running the overall model system and for defining dynamic data tables, columns, and injectables (functions). ActivitySim wraps ORCA functionality to make a Data Pipeline tool, which allows for re-starting at any model step.

    • Support for multiprocessing to reduce runtime

  • Expressions

    • Model expressions are in CSV files and contain Python expressions, mainly pandas/numpy expression that operate on the input data tables. This helps to avoid modifying Python code when making changes to the model calculations.

  • Code Documentation

  • Testing

    • A protected master branch that can only be written to after tests have passed

    • pytest for tests

    • TravisCI for building and testing with each commit

PopulationSim also requires an optimization library for balancing and integerizing. The software makes use of the open source and easy to install ortools package. The ortools integerization results varies from platform to platform since edge case results depend on the exact ortools/cbc version.

Core Components

assign

populationsim.assign.assign_variable(target, expression, df, locals_dict, df_alias=None, trace_rows=None)

Evaluate an expression of a given data table.

Expressions are evaluated using Python’s eval function. Python expressions have access to variables in locals_d (and df being accessible as variable df.) They also have access to previously assigned targets as the assigned target name.

Users should take care that expressions should result in a Pandas Series (scalars will be automatically promoted to series.)

Parameters
assignment_expressionspandas.DataFrame of target assignment expressions

target: target column names expression: pandas or python expression to evaluate

dfpandas.DataFrame
locals_dDict

This is a dictionary of local variables that will be the environment for an evaluation of “python” expression.

trace_rows: series or array of bools to use as mask to select target rows to trace
Returns
resultpandas.Series

Will have the index of df and columns named by target and containing the result of evaluating expression

trace_dfpandas.Series or None

a series containing the eval result values for the assignment expression

balancer

class populationsim.balancer.ListBalancer(incidence_table, initial_weights, control_totals, control_importance_weights, lb_weights, ub_weights, master_control_index, max_iterations)

Single-geography list balancer using Newton-Raphson method with control relaxation.

Takes a list of households with initial weights assigned to each household, and updates those weights in such a way as to match the marginal distribution of control variables while minimizing the change to the initial weights. Uses Newton-Raphson method with control relaxation.

The resulting weights are float weights, so need to be integerized to integer household weights

integerizer

populationsim.integerizer.do_integerizing(trace_label, control_spec, control_totals, incidence_table, float_weights, total_hh_control_col)
Parameters
trace_labelstr

trace label indicating geography zone being integerized (e.g. PUMA_600)

control_specpandas.Dataframe

full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …

control_totalspandas.Series

control totals explicitly specified for this zone

incidence_tablepandas.Dataframe
float_weightspandas.Series

balanced float weights to integerize

total_hh_control_colstr

name of total_hh column (preferentially constrain to match this control)

Returns
integerized_weightspandas.Series
statusstr

as defined in integerizer.STATUS_TEXT and STATUS_SUCCESS

populationsim.integerizer.smart_round(int_weights, resid_weights, target_sum)

Round weights while ensuring (as far as possible that result sums to target_sum)

Parameters
int_weightsnumpy.ndarray(int)
resid_weightsnumpy.ndarray(float)
target_sumint
Returns
rounded_weightsnumpy.ndarray array of ints

lp

populationsim.lp.get_simul_integerizer()

Return simul-integerizer function using installed/configured Linear Programming library.

Different LP packages can be used for integerization (e.g. ortools of cvx) and this function hides the specifics of the individual packages so they can be swapped with minimal impact.

Returns
integerizer_funcfunction pointer to simul-integerizer function with call signature:
def np_simul_integerizer(

sub_int_weights, parent_countrol_importance, parent_relax_ge_upper_bound, sub_countrol_importance, sub_float_weights, sub_resid_weights, lp_right_hand_side, parent_hh_constraint_ge_bound, sub_incidence, parent_incidence, total_hh_right_hand_side, relax_ge_upper_bound, parent_lp_right_hand_side, hh_constraint_ge_bound, parent_resid_weights, total_hh_sub_control_index, total_hh_parent_control_index)

populationsim.lp.get_single_integerizer()

Return single integerizer function using installed/configured Linear Programming library.

Different LP packages can be used for integerization (e.g. ortools of cvx) and this function hides the specifics of the individual packages so they can be swapped with minimal impact.

Returns
integerizer_funcfunction pointer to single integerizer function with call signature:
def np_integerizer(

incidence, resid_weights, log_resid_weights, control_importance_weights, total_hh_control_index, lp_right_hand_side, relax_ge_upper_bound, hh_constraint_ge_bound)

lp_cvx

populationsim.lp_cvx.np_integerizer_cvx(incidence, resid_weights, log_resid_weights, control_importance_weights, total_hh_control_index, lp_right_hand_side, relax_ge_upper_bound, hh_constraint_ge_bound)

cvx-based single-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.

Parameters
incidencenumpy.ndarray(control_count, sample_count) float
resid_weightsnumpy.ndarray(sample_count,) float
log_resid_weightsnumpy.ndarray(sample_count,) float
control_importance_weightsnumpy.ndarray(control_count,) float
total_hh_control_indexint
lp_right_hand_sidenumpy.ndarray(control_count,) float
relax_ge_upper_boundnumpy.ndarray(control_count,) float
hh_constraint_ge_boundnumpy.ndarray(control_count,) float
Returns
resid_weights_outnumpy.ndarray(sample_count,)
status_textstr
populationsim.lp_cvx.np_simul_integerizer_cvx(sub_int_weights, parent_countrol_importance, parent_relax_ge_upper_bound, sub_countrol_importance, sub_float_weights, sub_resid_weights, lp_right_hand_side, parent_hh_constraint_ge_bound, sub_incidence, parent_incidence, total_hh_right_hand_side, relax_ge_upper_bound, parent_lp_right_hand_side, hh_constraint_ge_bound, parent_resid_weights, total_hh_sub_control_index)

cvx-based siuml-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.

Parameters
sub_int_weightsnumpy.ndarray(sub_zone_count, sample_count) int
parent_countrol_importancenumpy.ndarray(parent_control_count,) float
parent_relax_ge_upper_boundnumpy.ndarray(parent_control_count,) float
sub_countrol_importancenumpy.ndarray(sub_control_count,) float
sub_float_weightsnumpy.ndarray(sub_zone_count, sample_count) float
sub_resid_weightsnumpy.ndarray(sub_zone_count, sample_count) float
lp_right_hand_sidenumpy.ndarray(sub_zone_count, sub_control_count) float
parent_hh_constraint_ge_boundnumpy.ndarray(parent_control_count,) float
sub_incidencenumpy.ndarray(sample_count, sub_control_count) float
parent_incidencenumpy.ndarray(sample_count, parent_control_count) float
total_hh_right_hand_sidenumpy.ndarray(sub_zone_count,) float
relax_ge_upper_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
parent_lp_right_hand_sidenumpy.ndarray(parent_control_count,) float
hh_constraint_ge_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
parent_resid_weightsnumpy.ndarray(sample_count,) float
total_hh_sub_control_indexint
Returns
resid_weights_outnumpy.ndarray of float

residual weights in range [0..1] as solved, or, in case of failure, sub_resid_weights unchanged

status_textstring

STATUS_OPTIMAL, STATUS_FEASIBLE in case of success, or a solver-specific failure status

lp_ortools

populationsim.lp_ortools.np_integerizer_ortools(incidence, resid_weights, log_resid_weights, control_importance_weights, total_hh_control_index, lp_right_hand_side, relax_ge_upper_bound, hh_constraint_ge_bound)

ortools single-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.

Parameters
incidencenumpy.ndarray(control_count, sample_count) float
resid_weightsnumpy.ndarray(sample_count,) float
log_resid_weightsnumpy.ndarray(sample_count,) float
control_importance_weightsnumpy.ndarray(control_count,) float
total_hh_control_indexint
lp_right_hand_sidenumpy.ndarray(control_count,) float
relax_ge_upper_boundnumpy.ndarray(control_count,) float
hh_constraint_ge_boundnumpy.ndarray(control_count,) float
Returns
resid_weights_outnumpy.ndarray(sample_count,)
status_textstr
populationsim.lp_ortools.np_simul_integerizer_ortools(sub_int_weights, parent_countrol_importance, parent_relax_ge_upper_bound, sub_countrol_importance, sub_float_weights, sub_resid_weights, lp_right_hand_side, parent_hh_constraint_ge_bound, sub_incidence, parent_incidence, total_hh_right_hand_side, relax_ge_upper_bound, parent_lp_right_hand_side, hh_constraint_ge_bound, parent_resid_weights, total_hh_sub_control_index, total_hh_parent_control_index)

ortools-based siuml-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.

Parameters
sub_int_weightsnumpy.ndarray(sub_zone_count, sample_count) int
parent_countrol_importancenumpy.ndarray(parent_control_count,) float
parent_relax_ge_upper_boundnumpy.ndarray(parent_control_count,) float
sub_countrol_importancenumpy.ndarray(sub_control_count,) float
sub_float_weightsnumpy.ndarray(sub_zone_count, sample_count) float
sub_resid_weightsnumpy.ndarray(sub_zone_count, sample_count) float
lp_right_hand_sidenumpy.ndarray(sub_zone_count, sub_control_count) float
parent_hh_constraint_ge_boundnumpy.ndarray(parent_control_count,) float
sub_incidencenumpy.ndarray(sample_count, sub_control_count) float
parent_incidencenumpy.ndarray(sample_count, parent_control_count) float
total_hh_right_hand_sidenumpy.ndarray(sub_zone_count,) float
relax_ge_upper_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
parent_lp_right_hand_sidenumpy.ndarray(parent_control_count,) float
hh_constraint_ge_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
parent_resid_weightsnumpy.ndarray(sample_count,) float
total_hh_sub_control_indexint
total_hh_parent_control_indexint
Returns
resid_weights_outnumpy.ndarray of float

residual weights in range [0..1] as solved, or, in case of failure, sub_resid_weights unchanged

status_textstring

STATUS_OPTIMAL, STATUS_FEASIBLE in case of success, or a solver-specific failure status

multi_integerizer

populationsim.multi_integerizer.do_no_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, control_spec, total_hh_control_col, sub_control_zones, sub_geography)
populationsim.multi_integerizer.do_sequential_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, control_spec, total_hh_control_col, sub_control_zones, sub_geography, combine_results=True)

note: this method returns different results depending on the value of combine_results

Parameters
incidence_dfpandas.Dataframe

full incidence_df for all hh samples in seed zone

sub_zone_weightspandas.DataFame

balanced subzone household sample weights to integerize

sub_controls_dfpandas.Dataframe

sub_geography controls (one row per zone indexed by sub_zone id)

control_specpandas.Dataframe

full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …

total_hh_control_colstr

name of total_hh column (so we can preferentially match this control)

sub_geographystr

subzone geography name (e.g. ‘TAZ’)

sub_control_zonespandas.Series

series mapping zone_id (index) to zone label (value) for use in sub_controls_df column names

combine_resultsbool

return all results in a single frame or return infeasible rounded results separately?

Returns
——-
For combined results:
integerized_weights_dfpandas.DataFrame

canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids

for segregated results:
integerized_zone_idsarray(int)

zone_ids of feasible (integerized) zones

rounded_zone_idsarray(int)

zone_ids of infeasible (rounded) zones

integerized_weights_dfpandas.DataFrame or None if all zones infeasible

integerized weights for feasible zones

rounded_weights_dfpandas.DataFrame or None if all zones feasible

rounded weights for infeasible aones

Results dataframes are canonical form weight table,
with columns for ‘balanced_weight’, ‘integer_weight’
plus columns for household id, and sub_geography zone ids
populationsim.multi_integerizer.do_simul_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, control_spec, total_hh_control_col, sub_geography, sub_control_zones)

Wrapper around simultaneous integerizer to handle solver failure for infeasible subzones.

Simultaneous integerize balanced float sub_weights, If simultaneous integerization fails, integerize serially to identify infeasible subzones, remove and smart_round infeasible subzones, and try simultaneous integerization again. (That ought to succeed, but if not, then fall back to all sequential integerization) Finally combine all results into a single result dataframe.

Parameters
incidence_dfpandas.Dataframe

full incidence_df for all hh samples in seed zone

sub_zone_weightspandas.DataFame

balanced subzone household sample weights to integerize

sub_controls_dfpandas.Dataframe

sub_geography controls (one row per zone indexed by sub_zone id)

control_specpandas.Dataframe

full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …

total_hh_control_colstr

name of total_hh column (so we can preferentially match this control)

sub_geographystr

subzone geography name (e.g. ‘TAZ’)

sub_control_zonespandas.Series

index is zone id and value is zone label (e.g. TAZ_101) for use in sub_controls_df column names

Returns
integer_weights_dfpandas.DataFrame

canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids

populationsim.multi_integerizer.multi_integerize(incidence_df, sub_zone_weights, sub_controls_df, control_spec, total_hh_control_col, parent_geography, parent_id, sub_geography, sub_control_zones)
Parameters
incidence_dfpandas.Dataframe

full incidence_df for all hh samples in seed zone

sub_zone_weightspandas.DataFame

balanced subzone household sample weights to integerize

sub_controls_dfpandas.Dataframe

sub_geography controls (one row per zone indexed by sub_zone id)

control_specpandas.Dataframe

full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …

total_hh_control_colstr

name of total_hh column (so we can preferentially match this control)

parent_geographystr

parent geography zone name

parent_idint

parent geography zone id

sub_geographystr

subzone geography name (e.g. ‘TAZ’)

sub_control_zonespandas.Series

index is zone id and value is zone label (e.g. TAZ_101) for use in sub_controls_df column names

Returns
integer_weights_dfpandas.DataFrame

canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, parent and sub_geography zone ids

populationsim.multi_integerizer.reshape_result(float_weights, integerized_weights, sub_geography, sub_control_zones)

Reshape results into unstacked form - (same as that returned by sequential integerizer) with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids

Parameters
float_weightspandas.DataFrame

dataframe with one row per sample hh and one column per sub_zone

integerized_weightspandas.DataFrame

dataframe with one row per sample hh and one column per sub_zone

sub_geographystr

name of sub_geography for result column name

sub_control_zonespandas.Series

series mapping zone_id (index) to zone label (value)

Returns
integer_weights_dfpandas.DataFrame

canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids

populationsim.multi_integerizer.try_simul_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, sub_geography, control_spec, total_hh_control_col, sub_control_zones)

Attempt simultaneous integerization and return integerized weights if successful

Parameters
incidence_df
sub_weights
sub_controls_df
sub_geography
control_spec
total_hh_control_col
sub_control_zones
Returns
statusstr

str value of integerizer status from STATUS_TEXT dict integerization was successful if status in STATUS_SUCCESS list

integerized_weights_dfpandas.DataFrame or None

canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ or None if integerization failed

simul_balancer

class populationsim.simul_balancer.SimultaneousListBalancer(incidence_table, parent_weights, controls, sub_control_zones, total_hh_control_col)

Dual-zone simultaneous list balancer using Newton-Raphson method with control relaxation.

Simultaneously balances the household weights across multiple subzones of a parent zone, ensuring that the total weight of each household across sub-zones sums to the parent hh weight.

The resulting weights are float weights, so need to be integerized to integer household weights

populationsim.simul_balancer.np_simul_balancer(sample_count, control_count, zone_count, master_control_index, incidence, parent_weights, weights_lower_bound, weights_upper_bound, sub_weights, parent_controls, controls_importance, sub_controls)

Simultaneous balancer using only numpy (no pandas) data types. Separate function to ensure that no pandas data types leak in from object instance variables since they are often silently accepted as numpy arguments but slow things down

Model Steps

input_pre_processor

populationsim.steps.input_pre_processor.input_pre_processor()

Read input text files and save them as pipeline tables for use in subsequent steps.

The files to read as specified by table_list, and array of dicts that specify the input file name, the name of the pipeline table, along with keys allow the specification of pre-processing steps.

By default, reads table_list from ‘input_table_list’ in settings.yaml, unless an alternate table_list name is specified as a model step argument ‘table_list’. (This allows alternate/additional input files to be read for repop)

In the case of repop, this step is being run after an initial run has completed, in which case the input_table_list may specify replacement tables. (e.g. lowest geography controls that will replace the previous low controls dataframe.)

See input_table_list in settings.yaml in the example folder for a working example

key

description

tablename

name of pipeline table in which to store dataframe

filename

name of csv file to read (in data_dir)

column_map

list of input columns to rename from_name: to_name

index_col

name of column to set as dataframe index column

drop_columns

list of column names of columns to drop

setup_data_structures

populationsim.steps.setup_data_structures.add_geography_columns(incidence_table, households_df, crosswalk_df)

Add seed and meta geography columns to incidence_table

Parameters
incidence_table
households_df
crosswalk_df
populationsim.steps.setup_data_structures.build_crosswalk_table()

build crosswalk table filtered to include only zones in lowest geography

populationsim.steps.setup_data_structures.filter_households(households_df, persons_df, crosswalk_df)

Filter households and persons tables, removing zero weight households and any households not in seed zones.

Returns filtered households_df and persons_df

populationsim.steps.setup_data_structures.repop_setup_data_structures(households, persons)

Setup geographic correspondence (crosswalk), control sets, and incidence tables for repop run.

A new lowest-level geography control tables should already have been read in by rerunning input_pre_processor with a table_list override. The control table contains one row for each zone, with columns specifying control field totals for that control

This step reads in the repop control file, which specifies which control control fields in the control table should be used for balancing, along with their importance and the recipe (seed table and expression) for determining household incidence for that control.

Parameters
households: pipeline table
persons: pipeline table
populationsim.steps.setup_data_structures.setup_data_structures(settings, households, persons)

Setup geographic correspondence (crosswalk), control sets, and incidence tables.

A control tables for target geographies should already have been read in by running input_pre_processor. The zone control tables contains one row for each zone, with columns specifying control field totals for that control

This step reads in the global control file, which specifies which control control fields in the control table should be used for balancing, along with their importance and the recipe (seed table and expression) for determining household incidence for that control.

If GROUP_BY_INCIDENCE_SIGNATURE setting is enabled, then incidence table rows are household group ids and and additional household_groups table is created mapping hh group ids to actual hh_ids.

Parameters
settings: dict

contents of settings.yaml as dict

households: pipeline table
persons: pipeline table
creates pipeline tables:

crosswalk controls geography-specific controls incidence_table household_groups (if GROUP_BY_INCIDENCE_SIGNATURE setting is enabled)

modifies tables:

households persons

initial_seed_balancing

populationsim.steps.initial_seed_balancing.initial_seed_balancing(settings, crosswalk, control_spec, incidence_table)

Balance the household weights for each of the seed geographies (independently) using the seed level controls and the aggregated sub-zone controls totals.

Create the seed_weights table with one row per household and columns contaiing household_id, seed geography (e.g. PUMA), and float preliminary_balanced_weights

Adds seed_weights table to pipeline named <seed_geography>_weights (e.g. PUMA_weights):

index hh_id

PUMA

preliminary_balanced_weight

hh_id

0 1 2 …

600 601 602

0.313555 0.627110 0.313555

0 1 2

Parameters
settingsdict (settings.yaml as dict)
crosswalkpipeline table
control_specpipeline table
incidence_tablepipeline table

meta_control_factoring

populationsim.steps.meta_control_factoring.meta_control_factoring(settings, control_spec, incidence_table)

Apply simple factoring to summed household fractional weights based on original meta control values relative to summed household fractional weights by meta zone.

The resulting factored meta control weights will be new meta controls appended as additional columns to the seed control table, for final balancing.

Parameters
settingsdict (settings.yaml as dict)
control_specpipeline table
incidence_tablepipeline table

final_seed_balancing

populationsim.steps.final_seed_balancing.final_seed_balancing(settings, crosswalk, control_spec, incidence_table)

Balance the household weights for each of the seed geographies (independently) using the seed level controls and the aggregated sub-zone controls totals.

Create the seed_weights table with one row per household and columns contaiing household_id, seed geography (e.g. PUMA), and float preliminary_balanced_weights

Adds column balanced_weight to the seed_weights table

Parameters
settingsdict (settings.yaml as dict)
crosswalkpipeline table
control_specpipeline table
incidence_tablepipeline table

integerize_final_seed_weights

populationsim.steps.integerize_final_seed_weights.integerize_final_seed_weights(settings, crosswalk, control_spec, incidence_table)

Final balancing for each seed (puma) zone with aggregated low and mid-level controls and distributed meta-level controls.

Adds integer_weight column to seed-level weight table

Parameters
settingsdict (settings.yaml as dict)
crosswalkpipeline table
control_specpipeline table
incidence_tablepipeline table

sub_balancing

populationsim.steps.sub_balancing.balance(incidence_df, parent_weights, sub_controls_df, control_spec, total_hh_control_col, parent_geography, parent_id, sub_geographies, sub_control_zones)
Parameters
incidence_dfpandas.Dataframe

full incidence_df for all hh samples in seed zone

parent_weightspandas.Series

parent zone balanced (possibly integerized) aggregate target weights

sub_controls_dfpandas.Dataframe

sub_geography controls (one row per zone indexed by sub_zone id)

control_specpandas.Dataframe

full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …

total_hh_control_colstr

name of total_hh column (so we can preferentially match this control)

parent_geographystr

parent geography zone name

parent_idint

parent geography zone id

sub_geographieslist(str)

list of subgeographies in descending order

sub_control_zonespandas.Series

index is zone id and value is zone label (e.g. TAZ_101) for use in sub_controls_df column names

Returns
sub_zone_weightspandas.DataFrame

balanced subzone household float sample weights

populationsim.steps.sub_balancing.balance_and_integerize(incidence_df, parent_weights, sub_controls_df, control_spec, total_hh_control_col, parent_geography, parent_id, sub_geographies, crosswalk_df)
Parameters
incidence_dfpandas.Dataframe

full incidence_df for all hh samples in seed zone

parent_weightspandas.Series

parent zone balanced (possibly integerized) aggregate target weights

sub_controls_dfpandas.Dataframe

sub_geography controls (one row per zone indexed by sub_zone id)

control_specpandas.Dataframe

full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …

total_hh_control_colstr

name of total_hh column (so we can preferentially match this control)

parent_geographystr

parent geography zone name

parent_idint

parent geography zone id

sub_geographieslist(str)

list of subgeographies in descending order

crosswalk_dfpandas.Dataframe

geo crosswork table sliced to current seed geography

Returns
integerized_sub_zone_weights_dfpandas.DataFrame

canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id and sub_geography zone ids

populationsim.steps.sub_balancing.sub_balancing(settings, crosswalk, control_spec, incidence_table)

Simul-balance and integerize all zones at a specified geographic level in groups by parent zone.

For instance, if the ‘geography’ step arg is ‘TRACT’ and the parent geography is ‘SEED’, then for each seed zone, we simul-balance the TRACTS it contains.

Creates a weight table for the target geography with float ‘balanced_weight’ and ‘integer_weight’ columns.

Parameters
settingsdict (settings.yaml as dict)
crosswalkpipeline table
control_specpipeline table
incidence_tablepipeline table

expand_households

populationsim.steps.expand_households.expand_households()

Create a complete expanded synthetic household list with their assigned geographic zone ids.

This is the skeleton synthetic household id list with no household or person attributes, one row per household, with geography columns and seed household table household_id.

Creates pipeline table expanded_household_ids

write_tables

Write pipeline tables as csv files (in output directory) as specified by output_tables list in settings file.

‘output_tables’ can specify either a list of output tables to include or to skip if no output_tables list is specified, then all checkpointed tables will be written

To write all output tables EXCEPT the households and persons tables:

output_tables:
  action: skip
  tables:
    - households
    - persons

To write ONLY the households table:

output_tables:
  action: include
  tables:
     - households

To write tables into a single HDF5 store instead of individual CSVs, use the h5_store flag:

output_tables:
  h5_store: True
  action: include
  tables:
     - households

Parameters

output_dir: str

write_synthetic_population

populationsim.steps.write_synthetic_population.write_synthetic_population(expanded_household_ids, households, persons, output_dir)

Write synthetic households and persons tables to output dir as csv files. The settings file allows specification of output file names, household_id column name, and seed data attribute columns to include in output files.

Parameters
expanded_household_idspipeline table
householdspipeline table
personspipeline table
output_dirstr

summarize

populationsim.steps.summarize.summarize(crosswalk, incidence_table, control_spec)

Write aggregate summary files of controls and weights for all geographic levels to output dir

Parameters
crosswalkpipeline table
incidence_tablepipeline table
control_specpipeline table

repop_balancing

populationsim.steps.repop_balancing.repop_balancing(settings, crosswalk, control_spec, incidence_table)

Balance and integerize all zones at a lowest geographic level.

Creates a weight table for the repop zones target geography with float ‘balanced_weight’ and ‘integer_weight’ columns.

Parameters
settingsdict (settings.yaml as dict)
crosswalkpipeline table
control_spec: pipeline table
incidence_tablepipeline table

Contribution Guidelines

PopulationSim development follows the same development guidelines as ActivitySim.