API Reference#

DataTree#

class DataTree(graph=None, root_node_name=None, extra_funcs=(), extra_vars=None, cache_dir=None, relationships=(), force_digitization=False, dim_order=None, aux_vars=None, **kwargs)[source]#

A tree representing linked datasets, from which data can flow.

Parameters:
  • graph (networkx.MultiDiGraph) –

  • root_node_name (str or False) – The name of the node at the root of the tree.

  • extra_funcs (Tuple[Callable]) – Additional functions that can be called by Flow objects created using this DataTree. These functions should have defined __name__ attributes, so they can be called in expressions.

  • extra_vars (Mapping[str,Any], optional) – Additional named constants that can be referenced by expressions in Flow objects created using this DataTree.

  • cache_dir (Path-like, optional) – The default directory where Flow objects are created.

  • relationships (Iterable[str or Relationship]) – The relationship definitions used to define this tree. All dataset nodes named in these relationships should also be included as keyword arguments for this constructor.

  • force_digitization (bool, default False) – Whether to automatically digitize all relationships (converting them from label-based to position-based). Digitization is required to evaluate Flows, but doing so automatically on construction may be inefficient.

  • dim_order (Tuple[str], optional) – The order of dimensions to use in Flow outputs. Generally only needed if there are multiple dimensions in the root dataset.

  • aux_vars (Mapping[str,Any], optional) – Additional named arrays or numba-typable variables that can be referenced by expressions in Flow objects created using this DataTree.

Attributes#

DataTree.root_node_name#

The root node for this data tree, which is only ever a parent.

Type:

str

DataTree.subspaces#

Direct access to node Dataset objects by name.

Type:

Mapping[str,Dataset]

DataTree.relationships_are_digitized#

Whether all relationships are digital (by position).

Type:

Bool

DataTree.replacement_filters#

Filters that are automatically applied to data on replacement.

When individual datasets are replaced in the tree, the incoming dataset is passed through the filter with a matching name-key (if it exists). The filter should be a function that accepts one argument (the incoming dataset) and returns one value (the dataset to save in the tree). These filters can be used to ensure data quality, e.g. renaming variables, ensuring particular data types, etc.

Type:

Dict[Str,Callable]

Setup Flow#

DataTree.setup_flow(definition_spec, *, cache_dir=None, name=None, dtype='float32', boundscheck=False, error_model='numpy', nopython=True, fastmath=True, parallel=True, readme=None, flow_library=None, extra_hash_data=(), write_hash_audit=True, hashing_level=1, dim_exclude=None, with_root_node_name=None)[source]#

Set up a new Flow for analysis using the structure of this DataTree.

Parameters:
  • definition_spec (Dict[str,str]) – Gives the names and expressions that define the variables to create in this new Flow.

  • cache_dir (Path-like, optional) – A location to write out generated python and numba code. If not provided, a unique temporary directory is created.

  • name (str, optional) – The name of this Flow used for writing out cached files. If not provided, a unique name is generated. If cache_dir is given, be sure to avoid name conflicts with other flow’s in the same directory.

  • dtype (str, default "float32") – The name of the numpy dtype that will be used for the output.

  • boundscheck (bool, default False) – If True, boundscheck enables bounds checking for array indices, and out of bounds accesses will raise IndexError. The default is to not do bounds checking, which is faster but can produce garbage results or segfaults if there are problems, so try turning this on for debugging if you are getting unexplained errors or crashes.

  • error_model ({'numpy', 'python'}, default 'numpy') – The error_model option controls the divide-by-zero behavior. Setting it to ‘python’ causes divide-by-zero to raise exception like CPython. Setting it to ‘numpy’ causes divide-by-zero to set the result to +/-inf or nan.

  • nopython (bool, default True) – Compile using numba’s nopython mode. Provided for debugging only, as there’s little point in turning this off for production code, as all the speed benefits of sharrow will be lost.

  • fastmath (bool, default True) – If true, fastmath enables the use of “fast” floating point transforms, which can improve performance but can result in tiny distortions in results. See numba docs for details.

  • parallel (bool, default True) – Enable or disable parallel computation for certain functions.

  • readme (str, optional) – A string to inject as a comment at the top of the flow Python file.

  • flow_library (Mapping[str,Flow], optional) – An in-memory cache of precompiled Flow objects. Using this can result in performance improvements when repeatedly using the same definitions.

  • extra_hash_data (Tuple[Hashable], optional) – Additional data used for generating the flow hash. Useful to prevent conflicts when using a flow_library with multiple similar flows.

  • write_hash_audit (bool, default True) – Writes a hash audit log into a comment in the flow Python file, for debugging purposes.

  • hashing_level (int, default 1) – Level of detail to write into flow hashes. Increase detail to avoid hash conflicts for similar flows. Level 2 adds information about names used in expressions and digital encodings to the flow hash, which prevents conflicts but requires more pre-computation to generate the hash.

  • dim_exclude (Collection[str], optional) – Exclude these root dataset dimensions from this flow.

Returns:

Flow

Datasets#

DataTree.add_dataset(name, dataset, relationships=(), as_root=False)[source]#

Add a new Dataset node to this DataTree.

Parameters:
  • name (str) –

  • dataset (Dataset or pandas.DataFrame) – Will be coerced into a Dataset object if it is not already in that format, using a no-copy process if possible.

  • relationships (Tuple[str or Relationship]) – Also add these relationships.

  • as_root (bool, default False) – Set this new node as the root of the tree, displacing any existing root.

DataTree.add_relationship(*args, **kwargs)[source]#

Add a relationship to this DataTree.

The new relationship will point from a variable in one dataset to a dimension of another dataset in this tree. Both the parent and the child datasets should already have been added.

Parameters:
  • *args – All arguments are passed through to the Relationship contructor, unless only a single str argument is provided, in which case the Relationship.from_string class constructor is used.

  • **kwargs – All arguments are passed through to the Relationship contructor, unless only a single str argument is provided, in which case the Relationship.from_string class constructor is used.

DataTree.replace_datasets(other=None, validate=True, redigitize=True, **kwargs)[source]#

Replace one or more datasets in the nodes of this tree.

Parameters:
  • other (Mapping[str,Dataset]) – A dictionary of replacement datasets.

  • validate (bool, default True) – Raise an error when replacing downstream datasets that are referenced by position, unless the replacement is identically sized. If validation is deactivated, and an incompatible dataset is placed in this tree, flows that rely on that relationship will give erroneous results or crash with a segfault.

  • redigitize (bool, default True) – Automatically re-digitize relationships that are label-based and were previously digitized.

  • **kwargs (Mapping[str,Dataset]) – Alternative format to other.

Returns:

DataTree – A new DataTree with data replacements completed.

Digitization#

DataTree.digitize_relationships(inplace=False, redigitize=True)[source]#

Convert all label-based relationships into position-based.

Parameters:
  • inplace (bool, default False) –

  • redigitize (bool, default True) – Re-compute position-based relationships from labels, even if the relationship had previously been digitized.

Returns:

DataTree or None – Only returns a copy if not digitizing in-place.

Relationship#

class Relationship(parent_data, parent_name, child_data, child_name, indexing='label', analog=None)[source]#

Defines a linkage between datasets in a DataTree.

Attributes#

Relationship.parent_data#

Name of the parent dataset.

Type:

str

Relationship.parent_name#

Variable in the parent dataset that references the child dimension.

Type:

str

Relationship.child_data#

Name of the child dataset.

Type:

str

Relationship.child_name#

Dimension in the child dataset that is used by this relationship.

Type:

str

Relationship.indexing#

How the target dimension is used, either by ‘label’ or ‘position’.

Type:

str

Relationship.analog#

Original variable that defined label-based relationship before digitization.

Type:

str

Flow#

class Flow(tree, defs, error_model='numpy', cache_dir=None, name=None, dtype='float32', boundscheck=False, nopython=True, fastmath=True, parallel=True, readme=None, flow_library=None, extra_hash_data=(), write_hash_audit=True, hashing_level=1, dim_order=None, dim_exclude=None, bool_wrapping=False, with_root_node_name=None)[source]#

A prepared data flow.

Parameters:
  • tree (DataTree) – The tree from whence the output will be constructed.

  • defs (Mapping[str,str]) – Gives the names and definitions for the variables to create in the generated output.

  • error_model ({'numpy', 'python'}, default 'numpy') – The error_model option controls the divide-by-zero behavior. Setting it to ‘python’ causes divide-by-zero to raise exception like CPython. Setting it to ‘numpy’ causes divide-by-zero to set the result to +/-inf or nan.

  • cache_dir (Path-like, optional) – A location to write out generated python and numba code. If not provided, a unique temporary directory is created.

  • name (str, optional) – The name of this Flow used for writing out cached files. If not provided, a unique name is generated. If cache_dir is given, be sure to avoid name conflicts with other flow’s in the same directory.

  • dtype (str, default "float32") – The name of the numpy dtype that will be used for the output.

  • boundscheck (bool, default False) – If True, boundscheck enables bounds checking for array indices, and out of bounds accesses will raise IndexError. The default is to not do bounds checking, which is faster but can produce garbage results or segfaults if there are problems, so try turning this on for debugging if you are getting unexplained errors or crashes.

  • nopython (bool, default True) – Compile using numba’s nopython mode. Provided for debugging only, as there’s little point in turning this off for production code, as all the speed benefits of sharrow will be lost.

  • fastmath (bool, default True) – If true, fastmath enables the use of “fast” floating point transforms, which can improve performance but can result in tiny distortions in results. See numba docs for details.

  • parallel (bool, default True) – Enable or disable parallel computation for certain functions.

  • readme (str, optional) – A string to inject as a comment at the top of the flow Python file.

  • flow_library (Mapping[str,Flow], optional) – An in-memory cache of precompiled Flow objects. Using this can result in performance improvements when repeatedly using the same definitions.

  • extra_hash_data (Tuple[Hashable], optional) – Additional data used for generating the flow hash. Useful to prevent conflicts when using a flow_library with multiple similar flows.

  • write_hash_audit (bool, default True) – Writes a hash audit log into a comment in the flow Python file, for debugging purposes.

  • hashing_level (int, default 1) – Level of detail to write into flow hashes. Increase detail to avoid hash conflicts for similar flows.

Load#

Flow.load(source=None, dtype=None, compile_watch=False, mask=None)[source]#

Compute the flow outputs as a numpy array.

Parameters:
  • source (DataTree, optional) – This is the source of the data for this flow. If not provided, the tree used to initialize this flow is used.

  • dtype (str or dtype) – Override the default dtype for the result. May trigger re-compilation of the underlying code.

  • compile_watch (bool, default False) – Set the compiled_recently flag on this flow to True if any file modification activity is observed in the cache directory.

  • mask (array-like, optional) – Only compute values for items where mask is truthy.

Returns:

numpy.array

Flow.load_dataframe(source=None, dtype=None, compile_watch=False, mask=None)[source]#

Compute the flow outputs as a pandas.DataFrame.

Parameters:
  • source (DataTree, optional) – This is the source of the data for this flow. If not provided, the tree used to initialize this flow is used.

  • dtype (str or dtype) – Override the default dtype for the result. May trigger re-compilation of the underlying code.

  • compile_watch (bool, default False) – Set the compiled_recently flag on this flow to True if any file modification activity is observed in the cache directory.

  • mask (array-like, optional) – Only compute values for items where mask is truthy.

Returns:

pandas.DataFrame

Flow.load_dataarray(source=None, dtype=None, compile_watch=False, mask=None)[source]#

Compute the flow outputs as a xarray.DataArray.

Parameters:
  • source (DataTree, optional) – This is the source of the data for this flow. If not provided, the tree used to initialize this flow is used.

  • dtype (str or dtype) – Override the default dtype for the result. May trigger re-compilation of the underlying code.

  • compile_watch (bool, default False) – Set the compiled_recently flag on this flow to True if any file modification activity is observed in the cache directory.

  • mask (array-like, optional) – Only compute values for items where mask is truthy.

Returns:

xarray.DataArray

Dot#

Flow.dot(coefficients, source=None, dtype=None, compile_watch=False)[source]#

Compute the dot-product of expression results and coefficients.

Parameters:
  • coefficients (array-like) – This function will return the dot-product of the computed expressions and this array of coefficients, but without ever materializing the array of computed expression values in memory, achieving significant performance gains.

  • source (DataTree, optional) – This is the source of the data for this flow. If not provided, the tree used to initialize this flow is used.

  • dtype (str or dtype) – Override the default dtype for the result. May trigger re-compilation of the underlying code.

  • compile_watch (bool, default False) – Set the compiled_recently flag on this flow to True if any file modification activity is observed in the cache directory.

Returns:

numpy.ndarray

Flow.dot_dataarray(coefficients, source=None, dtype=None, compile_watch=False)[source]#

Compute the dot-product of expression results and coefficients.

Parameters:
  • coefficients (DataArray) – This function will return the dot-product of the computed expressions and this array of coefficients, but without ever materializing the array of computed expression values in memory, achieving significant performance gains.

  • source (DataTree, optional) – This is the source of the data for this flow. If not provided, the tree used to initialize this flow is used.

  • dtype (str or dtype) – Override the default dtype for the result. May trigger re-compilation of the underlying code.

  • compile_watch (bool, default False) – Set the compiled_recently flag on this flow to True if any file modification activity is observed in the cache directory.

Returns:

xarray.DataArray

Logit#

Flow.logit_draws(coefficients, draws=None, source=None, pick_counted=False, logsums=0, dtype=None, compile_watch=False, nesting=None, as_dataarray=False, mask=None)[source]#

Make random simulated choices for a multinomial logit model.

Parameters:
  • coefficients (array-like) – These coefficients are used is in dot to compute the dot-product of the computed expressions, and this result is treated as the utility function for a multinomial logit model.

  • draws (array-like) – A one or two dimensional array of random values in the unit interval. If one dimensional, then it must have length equal to the first dimension of the base shape of source, and a single draw will be applied for each row in that dimension. If two dimensional, the first dimension must match as above, and the second dimension determines the number of draws applied for each row in the first dimension.

  • source (DataTree, optional) – This is the source of the data for this flow. If not provided, the tree used to initialize this flow is used.

  • pick_counted (bool, default False) – Whether to tally multiple repeated choices with a pick count.

  • logsums (int, default 0) – Set to 1 to return only logsums instead of making draws from logit models. Set to 2 to return both logsums and draws.

  • dtype (str or dtype) – Override the default dtype for the probability. May trigger re-compilation of the underlying code. The choices and pick counts (if included) are always integers.

  • compile_watch (bool, default False) – Set the compiled_recently flag on this flow to True if any file modification activity is observed in the cache directory.

  • nesting (dict, optional) – Nesting instructions

  • as_dataarray (bool, default False) –

  • mask (array-like, optional) – Only compute values for items where mask is truthy.

Returns:

  • choices (array[int32]) – The positions of the simulated choices.

  • probs (array[dtype]) – The probability that was associated with each simulated choice.

  • pick_count (array[int32], optional) – A count of how many times this choice was chosen, only included if pick_counted is True.

Convenience#

Flow.show_code(linenos='inline')[source]#

Display the underlying Python code constructed for this flow.

This convenience function is provided primarily to display the underlying source code in a Jupyter notebook, for debugging and educational purposes.

Parameters:

linenos ({'inline', 'table'}) – This argument is passed to the pygments HtmlFormatter. If set to 'table', output line numbers as a table with two cells, one containing the line numbers, the other the whole code. This is copy-and-paste-friendly, but may cause alignment problems with some browsers or fonts. If set to 'inline', the line numbers will be integrated in the <pre> tag that contains the code.

Returns:

IPython.display.HTML

Dataset#

Sharrow uses the xarray.Dataset class extensively. Refer to the xarray documentation for standard usage. The attributes and methods documented here are added to xarray.Dataset when you import sharrow.

Constructors#

The sharrow library provides several constructors for Dataset objects. These functions can be found in the sharrow.dataset module.

construct(source)[source]#

Create Datasets from various similar objects.

Parameters:

source (pandas.DataFrame, pyarrow.Table, xarray.Dataset, or Sequence[str]) – The source from which to create a Dataset. DataFrames and Tables are converted to Datasets that have one dimension (the rows) and separate variables for each of the columns. A list of strings creates a dataset with those named empty variables.

Returns:

Dataset

from_table(tbl, index_name='index', index=None)[source]#

Convert a pyarrow.Table into an xarray.Dataset.

Parameters:
  • tbl (Table) – Table from which to use data and indices.

  • index_name (str, default 'index') – This name will be given to the default dimension index, if none is given. Ignored if index is given explicitly and it already has a name.

  • index (Index-like, optional) – Use this index instead of a default RangeIndex.

Returns:

New Dataset.

from_omx(omx, index_names=('otaz', 'dtaz'), indexes='one-based', renames=None)[source]#

Create a Dataset from an OMX file.

Parameters:
  • omx (openmatrix.File or larch.OMX) – An OMX-format file, opened for reading.

  • index_names (tuple, default ("otaz", "dtaz")) – Should be a tuple of length 3, giving the names of the three dimensions. The first two names are the native dimensions from the open matrix file, the last is the name of the implicit dimension that is created by parsing array names.

  • indexes (str or tuple[str], optional) – The name of a ‘lookup’ in the OMX file, which will be used to populate the coordinates for the two native dimensions. Or, specify “one-based” or “zero-based” to assume sequential and consecutive numbering starting with 1 or 0 respectively. For non-square OMX data, this must be given as a tuple, relating indexes as above for each dimension of index_names.

  • renames (Mapping or Collection, optional) – Limit the import only to these data elements. If given as a mapping, the keys will be the names of variables in the resulting dataset, and the values give the names of data matrix tables in the OMX file. If given as a list or other non-mapping collection, elements are not renamed but only elements in the collection are included.

Returns:

Dataset

from_omx_3d(omx, index_names=('otaz', 'dtaz', 'time_period'), indexes=None, *, time_periods=None, time_period_sep='__', max_float_precision=32)[source]#

Create a Dataset from an OMX file with an implicit third dimension.

Parameters:
  • omx (openmatrix.File or larch.OMX) – An OMX-format file, opened for reading.

  • index_names (tuple, default ("otaz", "dtaz", "time_period")) – Should be a tuple of length 3, giving the names of the three dimensions. The first two names are the native dimensions from the open matrix file, the last is the name of the implicit dimension that is created by parsing array names.

  • indexes (str, optional) – The name of a ‘lookup’ in the OMX file, which will be used to populate the coordinates for the two native dimensions. Or, specify “one-based” or “zero-based” to assume sequential and consecutive numbering starting with 1 or 0 respectively.

  • time_periods (list-like, required keyword argument) – A list of index values from which the third dimension is constructed for all variables with a third dimension.

  • time_period_sep (str, default "__" (double underscore)) – The presence of this separator within the name of any table in the OMX file indicates that table is to be considered a page in a three dimensional variable. The portion of the name preceding the first instance of this separator is the name of the resulting variable, and the portion of the name after the first instance of this separator is the label of the position for this page, which should appear in time_periods.

  • max_float_precision (int, default 32) – When loading, reduce all floats in the OMX file to this level of precision, generally to save memory if they were stored as double precision but that level of detail is unneeded in the present application.

Returns:

Dataset

from_zarr(store, *args, **kwargs)[source]#

Load and decode a dataset from a Zarr store.

The store object should be a valid store for a Zarr group. store variables must contain dimension metadata encoded in the _ARRAY_DIMENSIONS attribute.

Parameters:
  • store (MutableMapping or str) – A MutableMapping where a Zarr Group has been stored or a path to a directory in file system where a Zarr DirectoryStore has been stored.

  • synchronizer (object, optional) – Array synchronizer provided to zarr

  • group (str, optional) – Group path. (a.k.a. path in zarr terminology.)

  • chunks (int or dict or tuple or {None, 'auto'}, optional) – Chunk sizes along each dimension, e.g., 5 or {'x': 5, 'y': 5}. If chunks=’auto’, dask chunks are created based on the variable’s zarr chunks. If chunks=None, zarr array data will lazily convert to numpy arrays upon access. This accepts all the chunk specifications as Dask does.

  • overwrite_encoded_chunks (bool, optional) – Whether to drop the zarr chunks encoded for each variable when a dataset is loaded with specified chunk sizes (default: False)

  • decode_cf (bool, optional) – Whether to decode these variables, assuming they were saved according to CF conventions.

  • mask_and_scale (bool, optional) – If True, replace array values equal to _FillValue with NA and scale values according to the formula original_values * scale_factor + add_offset, where _FillValue, scale_factor and add_offset are taken from variable attributes (if they exist). If the _FillValue or missing_value attribute contains multiple values a warning will be issued and all array values matching one of the multiple values will be replaced by NA.

  • decode_times (bool, optional) – If True, decode times encoded in the standard NetCDF datetime format into datetime objects. Otherwise, leave them encoded as numbers.

  • concat_characters (bool, optional) – If True, concatenate along the last dimension of character arrays to form string arrays. Dimensions will only be concatenated over (and removed) if they have no corresponding variable and if they are only used as the last dimension of character arrays.

  • decode_coords (bool, optional) – If True, decode the ‘coordinates’ attribute to identify coordinates in the resulting dataset.

  • drop_variables (str or iterable, optional) – A variable or list of variables to exclude from being parsed from the dataset. This may be useful to drop variables with problems or inconsistent values.

  • consolidated (bool, optional) – Whether to open the store using zarr’s consolidated metadata capability. Only works for stores that have already been consolidated. By default (consolidate=None), attempts to read consolidated metadata, falling back to read non-consolidated metadata if that fails.

  • chunk_store (MutableMapping, optional) – A separate Zarr store only for chunk data.

  • storage_options (dict, optional) – Any additional parameters for the storage backend (ignored for local paths).

  • decode_timedelta (bool, optional) – If True, decode variables and coordinates with time units in {‘days’, ‘hours’, ‘minutes’, ‘seconds’, ‘milliseconds’, ‘microseconds’} into timedelta objects. If False, leave them encoded as numbers. If None (default), assume the same value of decode_time.

  • use_cftime (bool, optional) – Only relevant if encoded dates come from a standard calendar (e.g. “gregorian”, “proleptic_gregorian”, “standard”, or not specified). If None (default), attempt to decode times to np.datetime64[ns] objects; if this is not possible, decode times to cftime.datetime objects. If True, always decode times to cftime.datetime objects, regardless of whether or not they can be represented using np.datetime64[ns] objects. If False, always decode times to np.datetime64[ns] objects; if this is not possible raise an error.

Returns:

dataset (Dataset) – The newly created dataset.

References

http://zarr.readthedocs.io/

from_named_objects(*args)[source]#

Create a Dataset by populating it with named objects.

A mapping of names to values is first created, and then that mapping is used in the standard constructor to initialize a Dataset.

Parameters:

*args (Any) – A collection of objects, each exposing a name attribute.

Returns:

Dataset

Editing#

Dataset.ensure_integer()#

Convert dataset variables to integers, if they are not already integers.

Parameters:
  • names (Iterable[str]) – Variable names in this dataset to convert.

  • bitwidth (int, default 32) – Bit width of integers that are created when a conversion is made. Note that variables that are already integer are not modified, even if their bit width differs from this.

  • inplace (bool, default False) – Whether to make the conversion in-place on this Dataset, or return a copy.

Returns:

Dataset

Indexing#

Dataset.iloc()#

Purely integer-location based indexing for selection by position on 1-d Datasets.

In many ways, a dataset with a single dimensions is like a pandas DataFrame, with the one dimension giving the rows, and the variables as columns. This analogy eventually breaks down (DataFrame columns are ordered, Dataset variables are not) but the similarities are enough that it’s sometimes convenient to have iloc functionality enabled. This only works for indexing on the rows, but if there’s only the one dimension the complexity of isel is not needed.

Dataset.at()#

Multi-dimensional fancy indexing by label.

Provide the dataset dimensions to index up as keywords, each with a value giving an array (one dimensional) of labels to extract.

All other arguments are keyword-only arguments beginning with an underscore.

Parameters:
  • _name (str, optional) – Only process this variable of this Dataset, and return a DataArray.

  • _names (Collection[str], optional) – Only include these variables of this Dataset.

  • _load (bool, default False) – Call load on the result, which will trigger a compute operation if the data underlying this Dataset is in dask, otherwise this does nothing.

  • _index_name (str, default "index") – The name to use for the resulting dataset’s dimension.

  • **idxs (Mapping[str, Any]) – Labels to extract.

Returns:

Dataset or DataArray

Dataset.at.df(df, *, append=False, mapping=None)#

Extract values based on the coordinates indicated by columns of a DataFrame.

Parameters:
  • df (pandas.DataFrame or Mapping[str, array-like]) – The columns (or keys) of df should match the named dimensions of this Dataset. The resulting extracted DataFrame will have one row per row of df, columns matching the data variables in this dataset, and each value is looked from the source Dataset.

  • append (bool or str, default False) – Assign the results of this extraction to variables in a copy of the dataframe df. Set to a string to make that a prefix for the variable names.

  • mapping (dict, optional) – Apply this rename mapping to the column names before extracting data.

Returns:

pandas.DataFrame

Dataset.iat()#

Multi-dimensional fancy indexing by position.

Provide the dataset dimensions to index up as keywords, each with a value giving an array (one dimensional) of positions to extract.

All other arguments are keyword-only arguments beginning with an underscore.

Parameters:
  • _name (str, optional) – Only process this variable of this Dataset, and return a DataArray.

  • _names (Collection[str], optional) – Only include these variables of this Dataset.

  • _load (bool, default False) – Call load on the result, which will trigger a compute operation if the data underlying this Dataset is in dask, otherwise this does nothing.

  • _index_name (str, default "index") – The name to use for the resulting dataset’s dimension.

  • **idxs (Mapping[str, Any]) – Positions to extract.

Returns:

Dataset or DataArray

Dataset.iat.df(df, *, append=False, mapping=None)#

Extract values based on the coordinates indicated by columns of a DataFrame.

Parameters:
  • df (pandas.DataFrame or Mapping[str, array-like]) – The columns (or keys) of df should match the named dimensions of this Dataset. The resulting extracted DataFrame will have one row per row of df, columns matching the data variables in this dataset, and each value is looked from the source Dataset.

  • append (bool or str, default False) – Assign the results of this extraction to variables in a copy of the dataframe df. Set to a string to make that a prefix for the variable names.

  • mapping (dict, optional) – Apply this rename mapping to the column names before extracting data.

Returns:

pandas.DataFrame

Shared Memory#

Sharrow’s shared memory system is consolidated into the Dataset.shm accessor.

Dataset.shm.to_shared_memory(key=None, mode='r+', _dupe=True)#

Load this Dataset into shared memory.

The returned Dataset object references the shared memory and is the “owner” of this data. When this object is destroyed, the data backing it may also be freed, which can result in a segfault or other unfortunate condition if that memory is still accessed from elsewhere.

Parameters:
  • key (str) – An identifying key for this shared memory. Use the same key in from_shared_memory to recreate this Dataset elsewhere.

  • mode ({‘r+’, ‘r’, ‘w+’, ‘c’}, optional) – This methid returns a copy of the Dataset in shared memory. If memmapped, that copy can be opened in various modes. See numpy.memmap() for details.

Returns:

Dataset

classmethod Dataset.shm.from_shared_memory(key, own_data=False, mode='r+')#

Connect to an existing Dataset in shared memory.

Parameters:
  • key (str) – An identifying key for this shared memory. Use the same key in from_shared_memory to recreate this Dataset elsewhere.

  • own_data (bool or memmap, default False) – The returned Dataset object references the shared memory but is not the “owner” of this data unless this flag is set. Pass a memmap to reuse an existing buffer.

Returns:

Dataset

Dataset.shm.release_shared_memory()#

Release shared memory allocated to this Dataset.

static Dataset.shm.preload_shared_memory_size(key)#

Compute the size in bytes of a shared Dataset without actually loading it.

Parameters:

key (str) – The identifying key for this shared memory.

Returns:

int

Dataset.shm.shared_memory_key#
Dataset.shm.shared_memory_size#

Int : Size (in bytes) in shared memory, raises ValueError if not shared.

Dataset.shm.is_shared_memory#

Bool : Whether this Dataset is in shared memory.

Digital Encoding#

Sharrow’s digital encoding management is consolidated into the Dataset.digital_encoding accessor.

Dataset.digital_encoding.info()#

All digital_encoding attributes from Dataset variables.

Returns:

dict

Dataset.digital_encoding.set(name, *args, **kwargs)#

Digitally encode one or more variables in this dataset.

All variables are encoded using the same given parameters. To encode various variables differently, make multiple calls to this function.

Parameters:
  • name (str or Collection[str]) – The name(s) of the variable to be encoded.

  • missing_value (Numeric, optional) – If the current array has “missing” values encoded with something other that NaN, give that value here.

  • bitwidth ({16, 8}) – Number of bits to use in the encoded integers.

  • min_value (Numeric, optional) – Explicitly give the min and max values represented in the array. If not given, they are inferred from x. It is useful to give these values if x does not necessarily include all the values that might need to be inserted into x later.

  • max_value (Numeric, optional) – Explicitly give the min and max values represented in the array. If not given, they are inferred from x. It is useful to give these values if x does not necessarily include all the values that might need to be inserted into x later.

  • scale (Numeric, optional) – Explicitly give the scaling factor. This is inferred from the min and max values if not provided.

  • offset (Numeric, optional) – Explicitly give the offset factor. This is inferred from the min value if not provided.

  • by_dict ({8, 16, 32}, optional) – Encode by dictionary, using this bitwidth. If given, all arguments other than this and x are ignored.

  • joint_dict (bool or str, optional) – If given as a string, the variables in name will be encoded with joint dictionary encoding under this name. Or simply give a True value to apply the same with a random unique name.

Returns:

Dataset – A copy of the dataset, with the named variable digitally encoded.