Software Implementation¶
This page describes the PopulationSim software implementation and how to contribute to PopulationSim.
The implementation starts with the ActivitySim framework, which serves as the foundation for the software. The framework, as briefly described below, includes features for data pipeline management, expression handling, multiprocessing, testing, etc. Built upon the framework are additional core components for population synthesis such as balancers and integerizers. Built upon the population synthesis core components are the model steps that make up a PopulationSim run, such as the inputs pre-processor, setting up the data strucutres, doing the initial seed balancing, etc.
ActivitySim Framework¶
PopulationSim is implemented in the ActivitySim framework. As summarized here, being implemented in the ActivitySim framework means:
Overall Design
Data Handling
Inputs are in CSV format, with the exception of settings
CSVs are read-in as pandas tables and stored in an intermediate HDF5 binary file that is used for data I/O throughout the model run
Key outputs are written to CSV files
Key Data Structures
pandas.DataFrame - A data table with rows and columns, similar to an R data frame, Excel worksheet, or database table
pandas.Series - a vector of data, a column in a DataFrame table or a 1D array
numpy.array - an N-dimensional array of items of the same type, such as a matrix
Model Orchestrator
ORCA is used for running the overall model system and for defining dynamic data tables, columns, and injectables (functions). ActivitySim wraps ORCA functionality to make a Data Pipeline tool, which allows for re-starting at any model step.
Support for multiprocessing to reduce runtime
Expressions
Model expressions are in CSV files and contain Python expressions, mainly pandas/numpy expression that operate on the input data tables. This helps to avoid modifying Python code when making changes to the model calculations.
-
Python code according to pycodestyle style guide
Written in reStructuredText markup, built with Sphinx and docstrings written in numpydoc
PopulationSim also requires an optimization library for balancing and integerizing. The software makes use of the open source and easy to install ortools package. The ortools integerization results varies from platform to platform since edge case results depend on the exact ortools/cbc version.
Core Components¶
assign¶
- populationsim.assign.assign_variable(target, expression, df, locals_dict, df_alias=None, trace_rows=None)¶
Evaluate an expression of a given data table.
Expressions are evaluated using Python’s eval function. Python expressions have access to variables in locals_d (and df being accessible as variable df.) They also have access to previously assigned targets as the assigned target name.
Users should take care that expressions should result in a Pandas Series (scalars will be automatically promoted to series.)
- Parameters
- assignment_expressionspandas.DataFrame of target assignment expressions
target: target column names expression: pandas or python expression to evaluate
- dfpandas.DataFrame
- locals_dDict
This is a dictionary of local variables that will be the environment for an evaluation of “python” expression.
- trace_rows: series or array of bools to use as mask to select target rows to trace
- Returns
- resultpandas.Series
Will have the index of df and columns named by target and containing the result of evaluating expression
- trace_dfpandas.Series or None
a series containing the eval result values for the assignment expression
balancer¶
- class populationsim.balancer.ListBalancer(incidence_table, initial_weights, control_totals, control_importance_weights, lb_weights, ub_weights, master_control_index, max_iterations)¶
Single-geography list balancer using Newton-Raphson method with control relaxation.
Takes a list of households with initial weights assigned to each household, and updates those weights in such a way as to match the marginal distribution of control variables while minimizing the change to the initial weights. Uses Newton-Raphson method with control relaxation.
The resulting weights are float weights, so need to be integerized to integer household weights
integerizer¶
- populationsim.integerizer.do_integerizing(trace_label, control_spec, control_totals, incidence_table, float_weights, total_hh_control_col)¶
- Parameters
- trace_labelstr
trace label indicating geography zone being integerized (e.g. PUMA_600)
- control_specpandas.Dataframe
full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
- control_totalspandas.Series
control totals explicitly specified for this zone
- incidence_tablepandas.Dataframe
- float_weightspandas.Series
balanced float weights to integerize
- total_hh_control_colstr
name of total_hh column (preferentially constrain to match this control)
- Returns
- integerized_weightspandas.Series
- statusstr
as defined in integerizer.STATUS_TEXT and STATUS_SUCCESS
- populationsim.integerizer.smart_round(int_weights, resid_weights, target_sum)¶
Round weights while ensuring (as far as possible that result sums to target_sum)
- Parameters
- int_weightsnumpy.ndarray(int)
- resid_weightsnumpy.ndarray(float)
- target_sumint
- Returns
- rounded_weightsnumpy.ndarray array of ints
lp¶
- populationsim.lp.get_simul_integerizer()¶
Return simul-integerizer function using installed/configured Linear Programming library.
Different LP packages can be used for integerization (e.g. ortools of cvx) and this function hides the specifics of the individual packages so they can be swapped with minimal impact.
- Returns
- integerizer_funcfunction pointer to simul-integerizer function with call signature:
- def np_simul_integerizer(
sub_int_weights, parent_countrol_importance, parent_relax_ge_upper_bound, sub_countrol_importance, sub_float_weights, sub_resid_weights, lp_right_hand_side, parent_hh_constraint_ge_bound, sub_incidence, parent_incidence, total_hh_right_hand_side, relax_ge_upper_bound, parent_lp_right_hand_side, hh_constraint_ge_bound, parent_resid_weights, total_hh_sub_control_index, total_hh_parent_control_index)
- populationsim.lp.get_single_integerizer()¶
Return single integerizer function using installed/configured Linear Programming library.
Different LP packages can be used for integerization (e.g. ortools of cvx) and this function hides the specifics of the individual packages so they can be swapped with minimal impact.
- Returns
- integerizer_funcfunction pointer to single integerizer function with call signature:
- def np_integerizer(
incidence, resid_weights, log_resid_weights, control_importance_weights, total_hh_control_index, lp_right_hand_side, relax_ge_upper_bound, hh_constraint_ge_bound)
lp_cvx¶
- populationsim.lp_cvx.np_integerizer_cvx(incidence, resid_weights, log_resid_weights, control_importance_weights, total_hh_control_index, lp_right_hand_side, relax_ge_upper_bound, hh_constraint_ge_bound)¶
cvx-based single-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.
- Parameters
- incidencenumpy.ndarray(control_count, sample_count) float
- resid_weightsnumpy.ndarray(sample_count,) float
- log_resid_weightsnumpy.ndarray(sample_count,) float
- control_importance_weightsnumpy.ndarray(control_count,) float
- total_hh_control_indexint
- lp_right_hand_sidenumpy.ndarray(control_count,) float
- relax_ge_upper_boundnumpy.ndarray(control_count,) float
- hh_constraint_ge_boundnumpy.ndarray(control_count,) float
- Returns
- resid_weights_outnumpy.ndarray(sample_count,)
- status_textstr
- populationsim.lp_cvx.np_simul_integerizer_cvx(sub_int_weights, parent_countrol_importance, parent_relax_ge_upper_bound, sub_countrol_importance, sub_float_weights, sub_resid_weights, lp_right_hand_side, parent_hh_constraint_ge_bound, sub_incidence, parent_incidence, total_hh_right_hand_side, relax_ge_upper_bound, parent_lp_right_hand_side, hh_constraint_ge_bound, parent_resid_weights, total_hh_sub_control_index)¶
cvx-based siuml-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.
- Parameters
- sub_int_weightsnumpy.ndarray(sub_zone_count, sample_count) int
- parent_countrol_importancenumpy.ndarray(parent_control_count,) float
- parent_relax_ge_upper_boundnumpy.ndarray(parent_control_count,) float
- sub_countrol_importancenumpy.ndarray(sub_control_count,) float
- sub_float_weightsnumpy.ndarray(sub_zone_count, sample_count) float
- sub_resid_weightsnumpy.ndarray(sub_zone_count, sample_count) float
- lp_right_hand_sidenumpy.ndarray(sub_zone_count, sub_control_count) float
- parent_hh_constraint_ge_boundnumpy.ndarray(parent_control_count,) float
- sub_incidencenumpy.ndarray(sample_count, sub_control_count) float
- parent_incidencenumpy.ndarray(sample_count, parent_control_count) float
- total_hh_right_hand_sidenumpy.ndarray(sub_zone_count,) float
- relax_ge_upper_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
- parent_lp_right_hand_sidenumpy.ndarray(parent_control_count,) float
- hh_constraint_ge_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
- parent_resid_weightsnumpy.ndarray(sample_count,) float
- total_hh_sub_control_indexint
- Returns
- resid_weights_outnumpy.ndarray of float
residual weights in range [0..1] as solved, or, in case of failure, sub_resid_weights unchanged
- status_textstring
STATUS_OPTIMAL, STATUS_FEASIBLE in case of success, or a solver-specific failure status
lp_ortools¶
- populationsim.lp_ortools.np_integerizer_ortools(incidence, resid_weights, log_resid_weights, control_importance_weights, total_hh_control_index, lp_right_hand_side, relax_ge_upper_bound, hh_constraint_ge_bound)¶
ortools single-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.
- Parameters
- incidencenumpy.ndarray(control_count, sample_count) float
- resid_weightsnumpy.ndarray(sample_count,) float
- log_resid_weightsnumpy.ndarray(sample_count,) float
- control_importance_weightsnumpy.ndarray(control_count,) float
- total_hh_control_indexint
- lp_right_hand_sidenumpy.ndarray(control_count,) float
- relax_ge_upper_boundnumpy.ndarray(control_count,) float
- hh_constraint_ge_boundnumpy.ndarray(control_count,) float
- Returns
- resid_weights_outnumpy.ndarray(sample_count,)
- status_textstr
- populationsim.lp_ortools.np_simul_integerizer_ortools(sub_int_weights, parent_countrol_importance, parent_relax_ge_upper_bound, sub_countrol_importance, sub_float_weights, sub_resid_weights, lp_right_hand_side, parent_hh_constraint_ge_bound, sub_incidence, parent_incidence, total_hh_right_hand_side, relax_ge_upper_bound, parent_lp_right_hand_side, hh_constraint_ge_bound, parent_resid_weights, total_hh_sub_control_index, total_hh_parent_control_index)¶
ortools-based siuml-integerizer function taking numpy data types and conforming to a standard function signature that allows it to be swapped interchangeably with alternate LP implementations.
- Parameters
- sub_int_weightsnumpy.ndarray(sub_zone_count, sample_count) int
- parent_countrol_importancenumpy.ndarray(parent_control_count,) float
- parent_relax_ge_upper_boundnumpy.ndarray(parent_control_count,) float
- sub_countrol_importancenumpy.ndarray(sub_control_count,) float
- sub_float_weightsnumpy.ndarray(sub_zone_count, sample_count) float
- sub_resid_weightsnumpy.ndarray(sub_zone_count, sample_count) float
- lp_right_hand_sidenumpy.ndarray(sub_zone_count, sub_control_count) float
- parent_hh_constraint_ge_boundnumpy.ndarray(parent_control_count,) float
- sub_incidencenumpy.ndarray(sample_count, sub_control_count) float
- parent_incidencenumpy.ndarray(sample_count, parent_control_count) float
- total_hh_right_hand_sidenumpy.ndarray(sub_zone_count,) float
- relax_ge_upper_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
- parent_lp_right_hand_sidenumpy.ndarray(parent_control_count,) float
- hh_constraint_ge_boundnumpy.ndarray(sub_zone_count, sub_control_count) float
- parent_resid_weightsnumpy.ndarray(sample_count,) float
- total_hh_sub_control_indexint
- total_hh_parent_control_indexint
- Returns
- resid_weights_outnumpy.ndarray of float
residual weights in range [0..1] as solved, or, in case of failure, sub_resid_weights unchanged
- status_textstring
STATUS_OPTIMAL, STATUS_FEASIBLE in case of success, or a solver-specific failure status
multi_integerizer¶
- populationsim.multi_integerizer.do_no_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, control_spec, total_hh_control_col, sub_control_zones, sub_geography)¶
- populationsim.multi_integerizer.do_sequential_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, control_spec, total_hh_control_col, sub_control_zones, sub_geography, combine_results=True)¶
note: this method returns different results depending on the value of combine_results
- Parameters
- incidence_dfpandas.Dataframe
full incidence_df for all hh samples in seed zone
- sub_zone_weightspandas.DataFame
balanced subzone household sample weights to integerize
- sub_controls_dfpandas.Dataframe
sub_geography controls (one row per zone indexed by sub_zone id)
- control_specpandas.Dataframe
full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
- total_hh_control_colstr
name of total_hh column (so we can preferentially match this control)
- sub_geographystr
subzone geography name (e.g. ‘TAZ’)
- sub_control_zonespandas.Series
series mapping zone_id (index) to zone label (value) for use in sub_controls_df column names
- combine_resultsbool
return all results in a single frame or return infeasible rounded results separately?
- Returns
- ——-
- For combined results:
- integerized_weights_dfpandas.DataFrame
canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids
- for segregated results:
- integerized_zone_idsarray(int)
zone_ids of feasible (integerized) zones
- rounded_zone_idsarray(int)
zone_ids of infeasible (rounded) zones
- integerized_weights_dfpandas.DataFrame or None if all zones infeasible
integerized weights for feasible zones
- rounded_weights_dfpandas.DataFrame or None if all zones feasible
rounded weights for infeasible aones
- Results dataframes are canonical form weight table,
- with columns for ‘balanced_weight’, ‘integer_weight’
- plus columns for household id, and sub_geography zone ids
- populationsim.multi_integerizer.do_simul_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, control_spec, total_hh_control_col, sub_geography, sub_control_zones)¶
Wrapper around simultaneous integerizer to handle solver failure for infeasible subzones.
Simultaneous integerize balanced float sub_weights, If simultaneous integerization fails, integerize serially to identify infeasible subzones, remove and smart_round infeasible subzones, and try simultaneous integerization again. (That ought to succeed, but if not, then fall back to all sequential integerization) Finally combine all results into a single result dataframe.
- Parameters
- incidence_dfpandas.Dataframe
full incidence_df for all hh samples in seed zone
- sub_zone_weightspandas.DataFame
balanced subzone household sample weights to integerize
- sub_controls_dfpandas.Dataframe
sub_geography controls (one row per zone indexed by sub_zone id)
- control_specpandas.Dataframe
full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
- total_hh_control_colstr
name of total_hh column (so we can preferentially match this control)
- sub_geographystr
subzone geography name (e.g. ‘TAZ’)
- sub_control_zonespandas.Series
index is zone id and value is zone label (e.g. TAZ_101) for use in sub_controls_df column names
- Returns
- integer_weights_dfpandas.DataFrame
canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids
- populationsim.multi_integerizer.multi_integerize(incidence_df, sub_zone_weights, sub_controls_df, control_spec, total_hh_control_col, parent_geography, parent_id, sub_geography, sub_control_zones)¶
- Parameters
- incidence_dfpandas.Dataframe
full incidence_df for all hh samples in seed zone
- sub_zone_weightspandas.DataFame
balanced subzone household sample weights to integerize
- sub_controls_dfpandas.Dataframe
sub_geography controls (one row per zone indexed by sub_zone id)
- control_specpandas.Dataframe
full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
- total_hh_control_colstr
name of total_hh column (so we can preferentially match this control)
- parent_geographystr
parent geography zone name
- parent_idint
parent geography zone id
- sub_geographystr
subzone geography name (e.g. ‘TAZ’)
- sub_control_zonespandas.Series
index is zone id and value is zone label (e.g. TAZ_101) for use in sub_controls_df column names
- Returns
- integer_weights_dfpandas.DataFrame
canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, parent and sub_geography zone ids
- populationsim.multi_integerizer.reshape_result(float_weights, integerized_weights, sub_geography, sub_control_zones)¶
Reshape results into unstacked form - (same as that returned by sequential integerizer) with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids
- Parameters
- float_weightspandas.DataFrame
dataframe with one row per sample hh and one column per sub_zone
- integerized_weightspandas.DataFrame
dataframe with one row per sample hh and one column per sub_zone
- sub_geographystr
name of sub_geography for result column name
- sub_control_zonespandas.Series
series mapping zone_id (index) to zone label (value)
- Returns
- integer_weights_dfpandas.DataFrame
canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id, and sub_geography zone ids
- populationsim.multi_integerizer.try_simul_integerizing(trace_label, incidence_df, sub_weights, sub_controls_df, sub_geography, control_spec, total_hh_control_col, sub_control_zones)¶
Attempt simultaneous integerization and return integerized weights if successful
- Parameters
- incidence_df
- sub_weights
- sub_controls_df
- sub_geography
- control_spec
- total_hh_control_col
- sub_control_zones
- Returns
- statusstr
str value of integerizer status from STATUS_TEXT dict integerization was successful if status in STATUS_SUCCESS list
- integerized_weights_dfpandas.DataFrame or None
canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ or None if integerization failed
simul_balancer¶
- class populationsim.simul_balancer.SimultaneousListBalancer(incidence_table, parent_weights, controls, sub_control_zones, total_hh_control_col)¶
Dual-zone simultaneous list balancer using Newton-Raphson method with control relaxation.
Simultaneously balances the household weights across multiple subzones of a parent zone, ensuring that the total weight of each household across sub-zones sums to the parent hh weight.
The resulting weights are float weights, so need to be integerized to integer household weights
- populationsim.simul_balancer.np_simul_balancer(sample_count, control_count, zone_count, master_control_index, incidence, parent_weights, weights_lower_bound, weights_upper_bound, sub_weights, parent_controls, controls_importance, sub_controls)¶
Simultaneous balancer using only numpy (no pandas) data types. Separate function to ensure that no pandas data types leak in from object instance variables since they are often silently accepted as numpy arguments but slow things down
Model Steps¶
input_pre_processor¶
- populationsim.steps.input_pre_processor.input_pre_processor()¶
Read input text files and save them as pipeline tables for use in subsequent steps.
The files to read as specified by table_list, and array of dicts that specify the input file name, the name of the pipeline table, along with keys allow the specification of pre-processing steps.
By default, reads table_list from ‘input_table_list’ in settings.yaml, unless an alternate table_list name is specified as a model step argument ‘table_list’. (This allows alternate/additional input files to be read for repop)
In the case of repop, this step is being run after an initial run has completed, in which case the input_table_list may specify replacement tables. (e.g. lowest geography controls that will replace the previous low controls dataframe.)
See input_table_list in settings.yaml in the example folder for a working example
key
description
tablename
name of pipeline table in which to store dataframe
filename
name of csv file to read (in data_dir)
column_map
list of input columns to rename from_name: to_name
index_col
name of column to set as dataframe index column
drop_columns
list of column names of columns to drop
setup_data_structures¶
- populationsim.steps.setup_data_structures.add_geography_columns(incidence_table, households_df, crosswalk_df)¶
Add seed and meta geography columns to incidence_table
- Parameters
- incidence_table
- households_df
- crosswalk_df
- populationsim.steps.setup_data_structures.build_crosswalk_table()¶
build crosswalk table filtered to include only zones in lowest geography
- populationsim.steps.setup_data_structures.filter_households(households_df, persons_df, crosswalk_df)¶
Filter households and persons tables, removing zero weight households and any households not in seed zones.
Returns filtered households_df and persons_df
- populationsim.steps.setup_data_structures.repop_setup_data_structures(households, persons)¶
Setup geographic correspondence (crosswalk), control sets, and incidence tables for repop run.
A new lowest-level geography control tables should already have been read in by rerunning input_pre_processor with a table_list override. The control table contains one row for each zone, with columns specifying control field totals for that control
This step reads in the repop control file, which specifies which control control fields in the control table should be used for balancing, along with their importance and the recipe (seed table and expression) for determining household incidence for that control.
- Parameters
- households: pipeline table
- persons: pipeline table
- populationsim.steps.setup_data_structures.setup_data_structures(settings, households, persons)¶
Setup geographic correspondence (crosswalk), control sets, and incidence tables.
A control tables for target geographies should already have been read in by running input_pre_processor. The zone control tables contains one row for each zone, with columns specifying control field totals for that control
This step reads in the global control file, which specifies which control control fields in the control table should be used for balancing, along with their importance and the recipe (seed table and expression) for determining household incidence for that control.
If GROUP_BY_INCIDENCE_SIGNATURE setting is enabled, then incidence table rows are household group ids and and additional household_groups table is created mapping hh group ids to actual hh_ids.
- Parameters
- settings: dict
contents of settings.yaml as dict
- households: pipeline table
- persons: pipeline table
- creates pipeline tables:
crosswalk controls geography-specific controls incidence_table household_groups (if GROUP_BY_INCIDENCE_SIGNATURE setting is enabled)
- modifies tables:
households persons
initial_seed_balancing¶
- populationsim.steps.initial_seed_balancing.initial_seed_balancing(settings, crosswalk, control_spec, incidence_table)¶
Balance the household weights for each of the seed geographies (independently) using the seed level controls and the aggregated sub-zone controls totals.
Create the seed_weights table with one row per household and columns contaiing household_id, seed geography (e.g. PUMA), and float preliminary_balanced_weights
Adds seed_weights table to pipeline named <seed_geography>_weights (e.g. PUMA_weights):
index hh_id
PUMA
preliminary_balanced_weight
hh_id
0 1 2 …
600 601 602
0.313555 0.627110 0.313555
0 1 2
- Parameters
- settingsdict (settings.yaml as dict)
- crosswalkpipeline table
- control_specpipeline table
- incidence_tablepipeline table
meta_control_factoring¶
- populationsim.steps.meta_control_factoring.meta_control_factoring(settings, control_spec, incidence_table)¶
Apply simple factoring to summed household fractional weights based on original meta control values relative to summed household fractional weights by meta zone.
The resulting factored meta control weights will be new meta controls appended as additional columns to the seed control table, for final balancing.
- Parameters
- settingsdict (settings.yaml as dict)
- control_specpipeline table
- incidence_tablepipeline table
final_seed_balancing¶
- populationsim.steps.final_seed_balancing.final_seed_balancing(settings, crosswalk, control_spec, incidence_table)¶
Balance the household weights for each of the seed geographies (independently) using the seed level controls and the aggregated sub-zone controls totals.
Create the seed_weights table with one row per household and columns contaiing household_id, seed geography (e.g. PUMA), and float preliminary_balanced_weights
Adds column balanced_weight to the seed_weights table
- Parameters
- settingsdict (settings.yaml as dict)
- crosswalkpipeline table
- control_specpipeline table
- incidence_tablepipeline table
integerize_final_seed_weights¶
- populationsim.steps.integerize_final_seed_weights.integerize_final_seed_weights(settings, crosswalk, control_spec, incidence_table)¶
Final balancing for each seed (puma) zone with aggregated low and mid-level controls and distributed meta-level controls.
Adds integer_weight column to seed-level weight table
- Parameters
- settingsdict (settings.yaml as dict)
- crosswalkpipeline table
- control_specpipeline table
- incidence_tablepipeline table
sub_balancing¶
- populationsim.steps.sub_balancing.balance(incidence_df, parent_weights, sub_controls_df, control_spec, total_hh_control_col, parent_geography, parent_id, sub_geographies, sub_control_zones)¶
- Parameters
- incidence_dfpandas.Dataframe
full incidence_df for all hh samples in seed zone
- parent_weightspandas.Series
parent zone balanced (possibly integerized) aggregate target weights
- sub_controls_dfpandas.Dataframe
sub_geography controls (one row per zone indexed by sub_zone id)
- control_specpandas.Dataframe
full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
- total_hh_control_colstr
name of total_hh column (so we can preferentially match this control)
- parent_geographystr
parent geography zone name
- parent_idint
parent geography zone id
- sub_geographieslist(str)
list of subgeographies in descending order
- sub_control_zonespandas.Series
index is zone id and value is zone label (e.g. TAZ_101) for use in sub_controls_df column names
- Returns
- sub_zone_weightspandas.DataFrame
balanced subzone household float sample weights
- populationsim.steps.sub_balancing.balance_and_integerize(incidence_df, parent_weights, sub_controls_df, control_spec, total_hh_control_col, parent_geography, parent_id, sub_geographies, crosswalk_df)¶
- Parameters
- incidence_dfpandas.Dataframe
full incidence_df for all hh samples in seed zone
- parent_weightspandas.Series
parent zone balanced (possibly integerized) aggregate target weights
- sub_controls_dfpandas.Dataframe
sub_geography controls (one row per zone indexed by sub_zone id)
- control_specpandas.Dataframe
full control spec with columns ‘target’, ‘seed_table’, ‘importance’, …
- total_hh_control_colstr
name of total_hh column (so we can preferentially match this control)
- parent_geographystr
parent geography zone name
- parent_idint
parent geography zone id
- sub_geographieslist(str)
list of subgeographies in descending order
- crosswalk_dfpandas.Dataframe
geo crosswork table sliced to current seed geography
- Returns
- integerized_sub_zone_weights_dfpandas.DataFrame
canonical form weight table, with columns for ‘balanced_weight’, ‘integer_weight’ plus columns for household id and sub_geography zone ids
- populationsim.steps.sub_balancing.sub_balancing(settings, crosswalk, control_spec, incidence_table)¶
Simul-balance and integerize all zones at a specified geographic level in groups by parent zone.
For instance, if the ‘geography’ step arg is ‘TRACT’ and the parent geography is ‘SEED’, then for each seed zone, we simul-balance the TRACTS it contains.
Creates a weight table for the target geography with float ‘balanced_weight’ and ‘integer_weight’ columns.
- Parameters
- settingsdict (settings.yaml as dict)
- crosswalkpipeline table
- control_specpipeline table
- incidence_tablepipeline table
expand_households¶
- populationsim.steps.expand_households.expand_households()¶
Create a complete expanded synthetic household list with their assigned geographic zone ids.
This is the skeleton synthetic household id list with no household or person attributes, one row per household, with geography columns and seed household table household_id.
Creates pipeline table expanded_household_ids
write_tables¶
Write pipeline tables as csv files (in output directory) as specified by output_tables list in settings file.
‘output_tables’ can specify either a list of output tables to include or to skip if no output_tables list is specified, then all checkpointed tables will be written
To write all output tables EXCEPT the households and persons tables:
output_tables:
action: skip
tables:
- households
- persons
To write ONLY the households table:
output_tables:
action: include
tables:
- households
To write tables into a single HDF5 store instead of individual CSVs, use the h5_store flag:
output_tables:
h5_store: True
action: include
tables:
- households
Parameters¶
output_dir: str
write_synthetic_population¶
- populationsim.steps.write_synthetic_population.write_synthetic_population(expanded_household_ids, households, persons, output_dir)¶
Write synthetic households and persons tables to output dir as csv files. The settings file allows specification of output file names, household_id column name, and seed data attribute columns to include in output files.
- Parameters
- expanded_household_idspipeline table
- householdspipeline table
- personspipeline table
- output_dirstr
summarize¶
- populationsim.steps.summarize.summarize(crosswalk, incidence_table, control_spec)¶
Write aggregate summary files of controls and weights for all geographic levels to output dir
- Parameters
- crosswalkpipeline table
- incidence_tablepipeline table
- control_specpipeline table
repop_balancing¶
- populationsim.steps.repop_balancing.repop_balancing(settings, crosswalk, control_spec, incidence_table)¶
Balance and integerize all zones at a lowest geographic level.
Creates a weight table for the repop zones target geography with float ‘balanced_weight’ and ‘integer_weight’ columns.
- Parameters
- settingsdict (settings.yaml as dict)
- crosswalkpipeline table
- control_spec: pipeline table
- incidence_tablepipeline table
Contribution Guidelines¶
PopulationSim development follows the same development guidelines as ActivitySim.