# Using Larch

ActivitySim component models are mostly built as discrete choice models. The
parameters for these models typically need to be estimated based on observed
survey data. The estimation process is facilitated by the Larch package, which
is a Python package for estimating discrete choice models. Larch is a
general-purpose package that can be used to estimate a wide variety of discrete
choice models, including the multinomial logit and nested logit models that are
commonly used in ActivitySim. This section highlights some of the features of
Larch, particularly as they relate to ActivitySim, as there are a few subtle
differences between the two packages that users should be aware of when
estimating models.

## Setting up Larch Models

ActivitySim includes a number of scripts and tools to set up Larch models for
estimation. The `activitysim.estimation.larch` library includes functions to
read the EDBs written by ActivitySim and convert them into Larch models,
including a generic [`component_model`](activitysim.estimation.larch.component_model)
function that can be used to load the data and set up Larch for any standard
ActivitySim component. This function is demonstrated in the [example notebooks]
(#example-notebooks).

When given a truthy `return_data` argument, the `component_model` function will
return a 2-tuple of the Larch model and the data used to create it. The data as
the second element of this tuple should be treated as a *copy* of the data used
to create the model, and is provided primarily for the user to review and use in
debugging if needed. If it is necessary to modify the data (e.g. to recreate
temporary variables), the user should modify the `data` attribute of the model
itself (i.e. `model.data` if `model` is the first element of the returned
tuple), not the data returned in the second element of the tuple.

## Model Specification

By default, the process of estimating parameters for ActivitySim model
components with Larch is based on the existing model specification files. These
are the CSV files that are used to define the utility function for each logit
component. When running ActivitySim, these files are typically found in the
configs directory, but when running in estimation mode, they are written out to
the EDB as well, which is where the `activitysim.estimation.larch` library
functions look for these input files.

Users are not limited to using the existing model specification files, however.
The Larch tools for model estimation now allow users to modify the model
specification files, and then re-estimate the model, including existing and new
parameters. The revised model specification files must rely on the same data
that has already been written out to the EDB, but the user can add new
expressions to the specification to transform the data, or to create new
variables. This is particularly useful for creating new piecewise linear
transformations, or for creating new categorical variables from continuous
variables. The user can also add new variables to the specification that are not
in the EDB, but this will require re-running ActivitySim in estimation mode to
write a new EDB. Examples for how to re-specify the model specification files
are included in [selected example notebooks](#examples-that-include-re-specification).

## Maximum Likelihood Estimation

The approach used to estimate the parameters of a discrete choice model is
maximum likelihood estimation (MLE). The goal of MLE is to find the set of
parameters that maximize the likelihood of observing the choice data that we
have collected.

Finding the maximum likelihood estimates of the parameters is a non-linear
optimization problem. To solve this problem, Larch primarily relies on the
widely-used `scipy.optimize` package, which provides a number of
[optimization algorithms](https://docs.scipy.org/doc/scipy/reference/optimize.html#local-multivariate-optimization)
that can be used to find the maximum likelihood estimates. Different algorithms
have different strengths and weaknesses, and the choice of algorithm can have a
significant impact on the speed and accuracy of the estimation process. By
default, when no constraints or bounds are present, Larch uses an
implementation of the [BHHH algorithm](https://en.wikipedia.org/wiki/Berndt–Hall–Hall–Hausman_algorithm),
which is not included in scipy but is usually efficient for simple, well
specified choice models without any constraints. When constraints or bounds are
present, by default Larch uses the `scipy.optimize.minimize` function with the
`SLSQP` algorithm. The `larch.Model.estimate` method allows the user to specify
the optimization algorithm to use via the `method` argument, which can be set to
'BHHH', 'SLSQP', or any other algorithm supported by `scipy.optimize.minimize`.
If you are estimating a model and find the optimization is not converging as
fast as expected (or at all), you may want to try a different optimization
algorithm.

## Model Evaluation

The `larch.Model` class includes a number of methods for evaluating the
quality of each estimated model. These tools are explained in
[detail](https://larch.driftless.xyz/v6.0/user-guide/analysis.html) in the
Larch documentation.

A [simple aggregate analysis](https://larch.driftless.xyz/v6.0/user-guide/analysis.html#choice-and-availability-summary)
of a Larch model data’s choice and availability statistics is available.

Larch also includes methods to
[analyze model predictions](https://larch.driftless.xyz/v6.0/user-guide/analysis.html#analyze-predictions)
across various dimensions. The `analyze_predictions_co` method can be used to
examine how well the model predicts choices against any available (or
computable) attribute of the chooser. In addition, there are tools to evaluate
[demand elasticity](https://larch.driftless.xyz/v6.0/user-guide/analysis.html#elasticity) with
respect to changes in underlying data.

## Model Overspecification

When using ActivitySim for simulation, there are generally few limitations or
requirements on the uniqueness of data elements. For example, it may end up
being confusing for a user, but there is nothing fundamentally wrong with having
two different variables in the model specification that both represent "income"
but have different scales, or with having alternative-specific constants for all
the alternatives. In model estimation, however, this can lead to problems.
Having two data elements that are perfectly correlated (e.g. two different
variables that both represent "income") or having a full set of
alternative-specific values for all the alternatives can lead to numerical
problems in the estimation process, as the log likelihood function will have
flat areas and will not have a unique maximum. This is called "model
overspecification". In Larch, the user is warned if an estimated model appears
to be overspecified, see the Larch documentation
[for details](https://larch.driftless.xyz/v6.0/user-guide/choice-models.html#overspecification).

## Recreating Temporary Variables

When writing out estimation data bundles, ActivitySim may omit certain temporary
variables included in a model spec. For example, in the example workplace
location choice model, the spec creates a temporary variable
["_DIST"](https://github.com/ActivitySim/activitysim-prototype-mtc/blob/7da9d6d6deca670cc4701fea749a270ab6fe77aa/configs/workplace_location.csv#L2)
which is then reused in several subsequent expressions. When the model's
estimation data bundle is written out, the "_DIST" variable may not be
included[^1]. This is not a problem when simply re-estimating the parameters of
the current model specification, as all the piecewise linear transformations
that use "_DIST" are included. However, if the user wanted to change those
piecewise linear transformations (e.g. by moving the breakpoints), the absence
of the "_DIST" value will be relevant.

[^1]: Future versions of ActivitySim may include these values in the EDB output.

If the missing temporary value can be reconstructed from the data that *is*
included in the EDB, it can be added back into the model's data. For example,
here we reconstitute the total distance by summing up over the piecewise
component parts:

```{python}
model.data["_DIST"] = (
    model.data.util_dist_0_1
    + model.data.util_dist_1_2
    + model.data.util_dist_2_5
    + model.data.util_dist_5_15
    + model.data.util_dist_15_up
)
```

Note in this expression, we are modifying `model.data`, i.e. the data attached
to the model. If you have other raw data available in your estimation notebook,
e.g., from running `model, data = component_model(..., return_data=True)`, it is
not sufficient to manipulate `data` itself; you must manipulate `model.data` or
otherwise re-attach any data changes to the model, or else the changes will not
show up in estimation.

## Expressing Alternative Availability

In ActivitySim, the unavailability of alternatives is typically expressed in the
utility function given in the model specification, by including an indicator variable
for unavailable alternatives, which is then attached to a large negative coefficient.
This creates a large negative utility for the unavailable alternative, which will
render it effectively unavailable in the choice model. If *all* the alternatives
are made unavailable in this manner, this can result in a condition where no
alternative can be chosen, and ActivitySim will raise an error.

When estimating models in Larch for use with ActivitySim, it is totally acceptable and
appropriate to use this approach to express alternative availability,
by embedding it in the utility function. This will greatly simplify the process
of subsequently transferring the resulting model specification and parameters
back to the ActivitySim model. However, it is important to note that this
approach is not the only way to express alternative availability in Larch.

Larch includes a system to define the availability of alternatives explicitly as a
[separate array of values](https://larch.driftless.xyz/dev/user-guide/choice-models.html#availability),
which is not included in the utility function. This is
more robust in estimation, as the Larch computational engine can (and will)
automatically shift the utility values to avoid numerical underflow or overflow
issues that can arise when some choices are very unlikely but not strictly unavailable.
When using the ActivitySim style of expressing alternative availability, the onus
is entirely on the user to ensure that the utility values are not so large or small
that they cause numerical problems. If this is not checked, it is possible that
the model will appear to be estimating correctly in Larch, but the resulting model
will underflow in ActivitySim, resulting in an error when the model is run.

The scripts that build Larch models from estimation data bundles
(`activitysim.estimation.larch`) will attempt to identify unavailability flags
in the utility specifications, and when such flags are found it will automatically
convert them to the Larch availability array format. However, since specification
files can be complex, and the unavailability flags can be expressed in many different
ways, it is possible that the automatic detection will not always work as expected.
It is a good idea to check the
[choice and availability summary](https://larch.driftless.xyz/dev/user-guide/analysis.html#choice-and-availability-summary)
on the Larch model to confirm that the availability of alternatives is being
processes as expected.

## Components that have Related Models

Within ActivitySim, it is possible for multiple parts of model components
to share a common set of coefficients. It is even possible for completely
separate components to do so. For example, in the MTC example model,
the joint tour destination choice model and the non-mandatory tour destination
choice model share a common set of coefficients written in a single file.
To re-estimate these coefficients, the user must simultaneously work with all
the estimation survey data from both models.

In Larch, the case of two models sharing coefficients is handled by creating two separate
`Model` objects, one for each model, and then using the `ModelGroup` object to link them
together. The `ModelGroup` object allows the user to specify a set of common parameters
for two or more models, and then estimate them together. In the case of re-estimating
the joint tour destination choice model and the non-mandatory tour destination
choice model, it may be that both models have a similar (or even identical)
utility structure. In other cases, the linked models may have different utility
structures, which share a subset of parameters, but also may have other parameters
that are unique to each model. In either case, when using the `ModelGroup` object,
parameters are identified as being linked across models by having a common name.

There are also components in ActivitySim where a single component can embed multiple
discrete choice models which share details, but each sub-model can have a different
utility structure or different sets of parameters. For example, the `tour_mode_choice`
component has a coefficient template file, which allows the model developer to
specify a different set of coefficients for each tour purpose, which are otherwise
processed using a common utility function. This is implemented in Larch with the
`ModelGroup` object, where each purpose is represented as a separate `Model` object,
and the `ModelGroup` object is used to link them together. This logic is further
extended by including the at-work subtour mode choice component, which allows for
the joint estimation of the tour mode choice model and the at-work subtour mode
choice model, which very reasonably share numerous parameters, but also have a few
differences. Similarly, the stop frequency and CDAP models are implemented for
Larch estimation as `ModelGroup` objects, segmented on tour purpose and household
size, respectively.

When estimating a `ModelGroup` object, process for estimating the likelihood
maximizing parameters is the same as for a single model: the log likelihood is computed
for each observation (i.e. chooser) in the data set according to the parameters, model,
and data for that chooser, and the overall log likelihood is the sum of all the
chooser log likelihoods. By using this approach, the rest of the estimation process is
the same as for a single model, including finding parameter estimates, the standard
error of those estimates, and any statistical tests or interpretations that are
desired.

## Components with Size Terms

Location choice models in ActivitySim (and in discrete choice modeling
in general) usually include a "size" term. The size term is a measure
of the quantity of the alternative, which in location choice models is
typically a geographic area that contains multiple distinct alternatives.
For example, in a workplace location choice model, the size term might
be the number of jobs in the zone. In practice, the size term is a statistical
approximation of the number of opportunities available in the zone, and
can be composed of multiple components, such as the number of employers, the
number of households, and/or the number of retail establishments.

The size term is included in the utility function of the model, but it is
expressed differently from other qualitative measures. The typical model
specification for a location choice model will include a utility function
given in a spec file, which will represent the quality of the alternative(s)
that are being considered. Put another way, the "regular" utility function
is a measure of the quality of the alternative, while the size term is a
measure of the quantity of the alternative.

In ActivitySim, size terms appear not in the utility spec files, but instead
are expressed in a separate "size term" spec file, typically named
"destination_choice_size_terms.csv". This one file contains all the size
terms for all the location choice models.

When using Larch for model estimation, size terms can be estimated alongside the
other parameters. The `update_size_spec` function in the `activitysim.estimation.larch` library
allows the user to update the size term specification for a model. This function
takes the existing size term specification and updates the appropriate rows
that correspond to the model being re-estimated. The resulting updated size
term output file will also include the (unmodified) size term specification
for all other size-based models. When copying the revised size term specification
to the model configuration, the user should be careful that re-estimation updates
from multiple models are not inadvertently overwriting each other.

As an alternative, users can choose to *not* re-estimate the size terms, by
providing exogenous size terms in the model specification, and instructing Larch
not to re-estimate these parameters. This is done via the `Model.lock_value`
command, which will fix any given named parameter to a specific value. This command
takes two arguments: the name of the parameter to be fixed, and the value to
fix it to. The `lock_value` command can be used to fix the size term parameters
to the values in the size term specification file, and then the model will be
estimated without re-estimating the size terms. If no re-estimation is desired,
users can also safely ignore the `update_size_spec` function.

## Example Notebooks

ActivitySim includes a collection of Jupyter notebooks with interactive
re-estimation examples for many core submodels, which can be found in the GitHub
repository under the [`activitysim/examples/example_estimation/notebooks`](https://github.com/ActivitySim/activitysim/tree/main/activitysim/examples/example_estimation/notebooks)
directory. Most of these notebooks demonstrate the process of re-estimating
model parameters, without changing the model specification, i.e. finding updated
values for coefficients without changing the mathematical form of a model's
utility function.

### Examples that include Re-Specification

A selection of these notebooks have also been updated to demonstrate the process
of estimating model parameters and also *changing the model specification*.
These notebooks generally include instructions and a demonstration of how to
modify the model specification, and then re-estimate the model parameters, as
well as how to compare the results of the original and modified models
side-by-side, which can be useful for understanding the impact of the changes
made, and conducting statistical tests to determine if the changes made are
statistically significant.

The following notebooks include examples of modifying the model specification:

- [`03_work_location.ipynb`](https://github.com/ActivitySim/activitysim/tree/main/activitysim/examples/example_estimation/notebooks/03_work_location.ipynb):
  This notebook includes a demonstration of modification to the SPEC file for a
  destination choice model, using the "interact-sample-simulate" type model.
- [`04_auto_ownership.ipynb`](https://github.com/ActivitySim/activitysim/tree/main/activitysim/examples/example_estimation/notebooks/04_auto_ownership.ipynb):
  This notebook includes a demonstration of modification to the SPEC file for the
  auto ownership model. It shows an example of an edit in the utility function
  for a "simple simulate" type model.
- [`06_cdap.ipynb`](https://github.com/ActivitySim/activitysim/tree/main/activitysim/examples/example_estimation/notebooks/06_cdap.ipynb):
  This notebook includes a demonstration of modification to the SPEC file for the
  CDAP model. This model has a complex structure that is unique among the
  ActivitySim component models.
- [`17_tour_mode_choice.ipynb`](https://github.com/ActivitySim/activitysim/tree/main/activitysim/examples/example_estimation/notebooks/17_tour_mode_choice.ipynb):
  This notebook includes a demonstration of modification to the spec, coefficients,
  and coefficients template file for the tour mode choice model.