lamindb.Curator

class lamindb.Curator

Bases: BaseCurator

Dataset curator.

A Curator object makes it easy to save validated & annotated artifacts.

Example:

>>> curator = ln.Curator.from_df(
>>>     df,
>>>     # define validation criteria as mappings
>>>     columns=ln.Feature.name,  # map column names
>>>     categoricals={"perturbation": ln.ULabel.name},  # map categories
>>> )
>>> curator.validate()  # validate the data in df
>>> artifact = curate.save_artifact(description="my RNA-seq")
>>> artifact.describe()  # see annotations

curator.validate() maps values within df according to the mapping criteria and logs validated & problematic values.

If you find non-validated values, you have several options:

  • validated values not yet in the registry can be automatically registered using add_validated_from()

  • new values found in the data can be registered using add_new_from()

  • non-validated values can be accessed using non_validated() and addressed manually

Class methods

classmethod from_anndata(data, var_index, categoricals=None, obs_columns=FieldAttr(Feature.name), using_key='default', verbosity='hint', organism=None, sources=None)

Curation flow for AnnData.

See also Curator.

Note that if genes are removed from the AnnData object, the object should be recreated using from_anndata().

See Curate AnnData based on the CELLxGENE schema for instructions on how to curate against a specific cellxgene schema version.

Parameters:
  • data (ad.AnnData | UPathStr) – The AnnData object or an AnnData-like path.

  • var_index (FieldAttr) – The registry field for mapping the .var index.

  • categoricals (dict[str, FieldAttr] | None, default: None) – A dictionary mapping .obs.columns to a registry field.

  • using_key (str, default: 'default') – A reference LaminDB instance.

  • verbosity (str, default: 'hint') – The verbosity level.

  • organism (str | None, default: None) – The organism name.

  • sources (dict[str, Record] | None, default: None) – A dictionary mapping .obs.columns to Source records.

  • exclude – A dictionary mapping column names to values to exclude.

Return type:

AnnDataCurator

Examples

>>> import bionty as bt
>>> curate = ln.Curator.from_anndata(
...     adata,
...     var_index=bt.Gene.ensembl_gene_id,
...     categoricals={
...         "cell_type_ontology_id": bt.CellType.ontology_id,
...         "donor_id": ln.ULabel.name
...     },
...     organism="human",
... )
classmethod from_df(df, categoricals=None, columns=FieldAttr(Feature.name), using_key=None, verbosity='hint', organism=None)

Curation flow for a DataFrame object.

See also Curator.

Parameters:
  • df (DataFrame) – The DataFrame object to curate.

  • columns (DeferredAttribute, default: FieldAttr(Feature.name)) – The field attribute for the feature column.

  • categoricals (dict[str, DeferredAttribute] | None, default: None) – A dictionary mapping column names to registry_field.

  • using_key (str | None, default: None) – The reference instance containing registries to validate against.

  • verbosity (str, default: 'hint') – The verbosity level.

  • organism (str | None, default: None) – The organism name.

  • sources – A dictionary mapping column names to Source records.

  • exclude – A dictionary mapping column names to values to exclude.

Return type:

DataFrameCurator

Examples

>>> import bionty as bt
>>> curate = ln.Curator.from_df(
...     df,
...     categoricals={
...         "cell_type_ontology_id": bt.CellType.ontology_id,
...         "donor_id": ln.ULabel.name
...     }
... )
classmethod from_mudata(mdata, var_index, categoricals=None, using_key='default', verbosity='hint', organism=None)

Curation flow for a MuData object.

See also Curator.

Note that if genes or other measurements are removed from the MuData object, the object should be recreated using from_mudata().

Parameters:
  • mdata (MuData) – The MuData object to curate.

  • var_index (dict[str, dict[str, DeferredAttribute]]) – The registry field for mapping the .var index for each modality. For example: {"modality_1": bt.Gene.ensembl_gene_id, "modality_2": ln.CellMarker.name}

  • categoricals (dict[str, DeferredAttribute] | None, default: None) – A dictionary mapping .obs.columns to a registry field. Use modality keys to specify categoricals for MuData slots such as "rna:cell_type": bt.CellType.name".

  • using_key (str, default: 'default') – A reference LaminDB instance.

  • verbosity (str, default: 'hint') – The verbosity level.

  • organism (str | None, default: None) – The organism name.

  • sources – A dictionary mapping .obs.columns to Source records.

  • exclude – A dictionary mapping column names to values to exclude.

Return type:

MuDataCurator

Examples

>>> import bionty as bt
>>> curate = ln.Curator.from_mudata(
...     mdata,
...     var_index={
...         "rna": bt.Gene.ensembl_gene_id,
...         "adt": ln.CellMarker.name
...     },
...     categoricals={
...         "cell_type_ontology_id": bt.CellType.ontology_id,
...         "donor_id": ln.ULabel.name
...     },
...     organism="human",
... )

Methods

save_artifact(description=None, key=None, revises=None, run=None)

Save the dataset as artifact.

Parameters:
  • description (str | None, default: None) – str | None = None A description of the DataFrame object.

  • key (str | None, default: None) – str | None = None A path-like key to reference artifact in default storage, e.g., "myfolder/myfile.fcs". Artifacts with the same key form a revision family.

  • revises (Artifact | None, default: None) – Artifact | None = None Previous version of the artifact. Triggers a revision.

  • run (Run | None, default: None) – Run | None = None The run that creates the artifact.

Return type:

Artifact

Returns:

A saved artifact record.

validate()

Validate dataset.

Return type:

bool

Returns:

Boolean indicating whether the dataset is validated.