Introduction¶

LaminDB is an open-source data framework for biology.

Manage storage & databases with a unified Python API (“lakehouse”).
Track data lineage across notebooks & pipelines.
Integrate registries for experimental metadata & in-house ontologies.
Validate, standardize & annotate.
Collaborate across distributed databases.

LaminHub is a data collaboration hub built on LaminDB similar to how GitHub is built on git.

Basic features of LaminHub are free. Enterprise features hosted in your or our infrastructure are available on a paid plan!

Quickstart¶

You’ll ingest a small dataset while tracking data lineage, and learn how to validate, annotate, query & search.

Setup¶

Install the lamindb Python package:

pip install 'lamindb[jupyter,bionty]'

Initialize a LaminDB instance mounting plugin bionty for biological types.

# store artifacts in a local directory `./lamin-intro`
!lamin init --storage ./lamin-intro --schema bionty

Track¶

Run track() to track the inputs and outputs of your code.

When you first run ln.track(), it raises an exception and creates a stem_uid & version to identify a notebook or script:

import lamindb as ln

# copy-pasted identifiers for your notebook or script
ln.settings.transform.stem_uid = "FPnfDtJz8qbE"  # <-- auto-generated by running ln.track()
ln.settings.transform.version = "1"  # <-- auto-generated by running ln.track()

# track the execution of your notebook or script
run = ln.track()

# get your currently running transform
run.transform

ln.track() added a record to the Transform registry to query & search your notebook and a Run registry to store the specific notebook run.

Artifacts¶

Use Artifact to manage data in local or remote storage.

import pandas as pd

# a sample dataset
df = pd.DataFrame(
    {"CD8A": [1, 2, 3], "CD4": [3, 4, 5], "CD14": [5, 6, 7], "perturbation": ["DMSO", "IFNG", "DMSO"]},
    index=["observation1", "observation2", "observation3"],
)

# create an artifact from a DataFrame
artifact = ln.Artifact.from_df(df, description="my RNA-seq", version="1")

# artifacts come with typed, relational metadata
artifact.describe()

# save data & metadata in one operation
artifact.save()

View data lineage:

artifact.view_lineage()

Load an artifact:

artifact.load()

Show code cell output Hide code cell output

	CD8A	CD4	CD14	perturbation
observation1	1	3	5	DMSO
observation2	2	4	6	IFNG
observation3	3	5	7	DMSO

An artifact stores a dataset or model as either a file or a folder.

Labels¶

Label an artifact with a label managed by the ULabel registry.

# create & save a label
candidate_marker_study = ln.ULabel(name="Candidate marker study").save()

# label an artifact
artifact.ulabels.add(candidate_marker_study)
artifact.describe()

Registries¶

LaminDB’s central classes are registries that manage metadata.

The easiest way to see what’s in a registry is to call .df().

ln.Artifact.df()

	uid	version	description	key	suffix	accessor	size	hash	hash_type	n_objects	n_observations	visibility	key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
id
1	TiOXNWydSBKP8ns34Uc8	1	my RNA-seq	None	.parquet	DataFrame	4122	EzUJIW3AamdtaNxG_Bu_nA	md5	None	None	1	True	1	1	1	1	2024-06-05 10:45:11.639054+00:00

ln.Transform.df()

	uid	version	name	key	description	type	reference	reference_type	latest_report_id	source_code_id	created_by_id	updated_at
id
1	FPnfDtJz8qbE5zKv	1	Introduction	introduction	None	notebook	None	None	None	None	1	2024-06-05 10:45:09.541377+00:00

ln.ULabel.df() 

	uid	name	description	reference	reference_type	run_id	created_by_id	updated_at
id
1	bP2a1FCb	Candidate marker study	None	None	None	1	1	2024-06-05 10:45:12.247887+00:00

Queries¶

You can write arbitrary relational queries using Django’s query syntax.

# get an entity by uid (here, the current notebook)
transform = ln.Transform.get("FPnfDtJz8qbE")

# filter by description
ln.Artifact.filter(description="my RNA-seq").df()

# query all artifacts ingested from the current notebook
artifacts = ln.Artifact.filter(transform=transform).all()

# query all artifacts ingested from a notebook with "intro" in the name and labeled "Candidate marker study"
artifacts = ln.Artifact.filter(
    transform__name__icontains="intro",
    ulabels=candidate_marker_study
).all()

Search¶

# search in a registry
ln.Transform.search("intro").df()

# look up records with auto-complete
ulabels = ln.ULabel.lookup()

Features¶

You can annotate artifacts with features & values.

import pytest

with pytest.raises(ln.core.exceptions.ValidationError) as e:
    artifact.features.add_values({"temperature": 21.6})

print(e.exconly())

LaminDB validates all user input against its registries. As the temperature feature didn’t exist, we got an error.

Let’s follow the hint in the error message:

# register the "temperature" feature
ln.Feature(name='temperature', dtype='float').save()

# now we can annotate with the feature & the value
artifact.features.add_values({"temperature": 21.6})
artifact.describe()

We can also annotate with categorical features:

# register a categorical feature
ln.Feature(name='study', dtype='cat').save()

# add a categorical value
artifact.features.add_values({"study": "Candidate marker study"})

# describe the artifact and add type information
artifact.describe(print_types=True)

Features provide a way to bucket labels beyond their type/registry.

Validate & annotate¶

LaminDB validates & annotates categorical metadata by mapping categories on registries.

Validate¶

Let’s use the high-level Annotateclass to validate a DataFrame:

# construct an object to validate & annotate a DataFrame
annotate = ln.Annotate.from_df(
    df,
    # define validation criteria
    columns=ln.Feature.name,  # map column names
    categoricals={"perturbation": ln.ULabel.name},  # map categories
)

# the dataframe doesn't validate because registries don't contain the categories
annotate.validate()

Update registries¶

# add non-validated features based on the DataFrame columns
annotate.add_new_from_columns()
# see the updated content of the features registry
ln.Feature.df()

Show code cell output Hide code cell output

✅ added 3 records with Feature.name for columns: 'CD8A', 'CD14', 'CD4'

	uid	name	dtype	unit	description	synonyms	run_id	created_by_id	updated_at
id
6	9wsBypCgN0dR	CD14	int	None	None	None	1	1	2024-06-05 10:45:12.809433+00:00
5	U5jviGC7nXpn	CD4	int	None	None	None	1	1	2024-06-05 10:45:12.809301+00:00
4	yn10Pe3pXnob	CD8A	int	None	None	None	1	1	2024-06-05 10:45:12.809159+00:00
3	pqJTbBpw6p4r	perturbation	cat	None	None	None	1	1	2024-06-05 10:45:12.672232+00:00
2	88vg3nyw3Ke2	study	cat[ULabel]	None	None	None	1	1	2024-06-05 10:45:12.578849+00:00
1	RYHcV5pc5ND9	temperature	float	None	None	None	1	1	2024-06-05 10:45:12.523746+00:00

# add non-validated labels based on the perturbations
annotate.add_new_from("perturbation")

# see the updated content of the ULabel registry
ln.ULabel.df()

✅ added 2 records with ULabel.name for perturbation: 'DMSO', 'IFNG'

	uid	name	description	reference	reference_type	run_id	created_by_id	updated_at
id
4	006e1sIK	is_perturbation	None	None	None	1	1	2024-06-05 10:45:12.859182+00:00
3	ac8HuWBc	IFNG	None	None	None	1	1	2024-06-05 10:45:12.848775+00:00
2	1ahjBnLs	DMSO	None	None	None	1	1	2024-06-05 10:45:12.848647+00:00
1	bP2a1FCb	Candidate marker study	None	None	None	1	1	2024-06-05 10:45:12.247887+00:00

Annotate¶

# given the updated registries, the validation passes
annotate.validate()

# save annotated artifact
artifact = annotate.save_artifact(description="my RNA-seq", version="1")
artifact.describe()

Query for annotations¶

ulabels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels=ulabels.ifng).one()

Biological registries¶

The generic Feature and ULabel registries will get you pretty far.

But let’s now look at what you do can with a dedicated biological registry like Gene.

Access public ontologies¶

Every bionty registry is based on configurable public ontologies.

import bionty as bt

cell_types = bt.CellType.public()
cell_types

cell_types.search("gamma delta T cell").head(2)

Show code cell output Hide code cell output

	ontology_id	definition	synonyms	parents	__ratio__
name
gamma-delta T cell	CL:0000798	A T Cell That Expresses A Gamma-Delta T Cell R...	gammadelta T cell\|gamma-delta T-cell\|gamma-del...	[CL:0000084]	100.000000
CD27-negative gamma-delta T cell	CL:0002125	A Circulating Gamma-Delta T Cell That Expresse...	gammadelta-17 cells	[CL:0000800]	86.486486

Validate & annotate with typed features¶

import anndata as ad

# store the dataset as an AnnData object to distinguish data from metadata
adata = ad.AnnData(df[["CD8A", "CD4", "CD14"]], obs=df[["perturbation"]])

# create an annotation flow for an AnnData object
annotate = ln.Annotate.from_anndata(
    adata,
    # define validation criteria
    var_index=bt.Gene.symbol, # map .var.index onto Gene registry
    categoricals={adata.obs.perturbation.name: ln.ULabel.name}, 
    organism="human",  # specify the organism for the Gene registry
)
annotate.validate()

# save annotated artifact
artifact = annotate.save_artifact(description="my RNA-seq", version="1")
artifact.describe()

Query for typed features¶

# get a lookup object for human genes
genes = bt.Gene.filter(organism__name="human").lookup()
# query for all feature sets that contain CD8A
feature_sets = ln.FeatureSet.filter(genes=genes.cd8a).all()
# write the query
ln.Artifact.filter(feature_sets__in=feature_sets).df()

Show code cell output Hide code cell output

	uid	version	description	key	suffix	accessor	size	hash	hash_type	n_objects	n_observations	visibility	key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
id
2	3w0dZndQcbKiWLTTmFDa	1	my RNA-seq	None	.h5ad	AnnData	19240	ohAeiVMJZOrc3bFTKmankw	md5	None	3	1	True	1	1	1	1	2024-06-05 10:45:18.822845+00:00

Add new records¶

Create a cell type record and add a new cell state.

# create an ontology-coupled cell type record and save it
neuron = bt.CellType.from_public(name="neuron")
neuron.save()

Show code cell output Hide code cell output

✅ created 1 CellType record from Bionty matching name: 'neuron'

💡 also saving parents of CellType(uid='3QnZfoBk', name='neuron', ontology_id='CL:0000540', synonyms='nerve cell', description='The Basic Cellular Unit Of Nervous Tissue. Each Neuron Consists Of A Body, An Axon, And Dendrites. Their Purpose Is To Receive, Conduct, And Transmit Impulses In The Nervous System.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-06-05 10:45:19 UTC')

✅ created 3 CellType records from Bionty matching ontology_id: 'CL:0000393', 'CL:0002319', 'CL:0000404'

❗ now recursing through parents: this only happens once, but is much slower than bulk saving

💡 you can switch this off via: bt.settings.auto_save_parents = False

💡 also saving parents of CellType(uid='2qSJYeQX', name='electrically responsive cell', ontology_id='CL:0000393', description='A Cell Whose Function Is Determined By Its Response To An Electric Signal.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-06-05 10:45:20 UTC')

✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000211'

💡 also saving parents of CellType(uid='590vrK18', name='electrically active cell', ontology_id='CL:0000211', description='A Cell Whose Function Is Determined By The Generation Or The Reception Of An Electric Signal.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-06-05 10:45:21 UTC')

✅ created 1 CellType record from Bionty matching ontology_id: 'CL:0000000'

💡 also saving parents of CellType(uid='7kYbAaTq', name='neural cell', ontology_id='CL:0002319', description='A Cell That Is Part Of The Nervous System.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-06-05 10:45:20 UTC')

💡 also saving parents of CellType(uid='5NqNmmSr', name='electrically signaling cell', ontology_id='CL:0000404', description='A Cell That Initiates An Electrical Signal And Passes That Signal To Another Cell.', created_by_id=1, run_id=1, public_source_id=29, updated_at='2024-06-05 10:45:20 UTC')

# create a record to track a new cell state
new_cell_state = bt.CellType(name="my neuron cell state", description="explains X")
new_cell_state.save()

# express that it's a neuron state
new_cell_state.parents.add(neuron)

# view ontological hierarchy
new_cell_state.view_parents(distance=2)

❗ records with similar names exist! did you mean to load one of them?

	uid	name	ontology_id	abbr	synonyms	description	public_source_id	run_id	created_by_id	updated_at
id
1	3QnZfoBk	neuron	CL:0000540	None	nerve cell	The Basic Cellular Unit Of Nervous Tissue. Eac...	29	1	1	2024-06-05 10:45:19.764985+00:00
2	2qSJYeQX	electrically responsive cell	CL:0000393	None	None	A Cell Whose Function Is Determined By Its Res...	29	1	1	2024-06-05 10:45:20.756282+00:00
3	7kYbAaTq	neural cell	CL:0002319	None	None	A Cell That Is Part Of The Nervous System.	29	1	1	2024-06-05 10:45:20.756436+00:00
4	5NqNmmSr	electrically signaling cell	CL:0000404	None	None	A Cell That Initiates An Electrical Signal And...	29	1	1	2024-06-05 10:45:20.756577+00:00
5	590vrK18	electrically active cell	CL:0000211	None	None	A Cell Whose Function Is Determined By The Gen...	29	1	1	2024-06-05 10:45:21.709423+00:00

_images/36ed6192dc0a410de9baa26550cf135a627e6c37065678fbdb55c5132b63759b.svg

Scale up data & learning¶

How do you learn from new datasets that extend your previous data history? Leverage Collection.

# a new dataset
df = pd.DataFrame(
    {
        "CD8A": [2, 3, 3],
        "CD4": [3, 4, 5],
        "CD38": [4, 2, 3],
        "perturbation": ["DMSO", "IFNG", "IFNG"]
    },
    index=["observation4", "observation5", "observation6"],
)
adata = ad.AnnData(df[["CD8A", "CD4", "CD38"]], obs=df[["perturbation"]])

# validate, annotate and save a new artifact
annotate = ln.Annotate.from_anndata(
    adata,
    var_index=bt.Gene.symbol,
    categoricals={adata.obs.perturbation.name: ln.ULabel.name},
    organism="human"
)
annotate.validate()
artifact2 = annotate.save_artifact(description="my RNA-seq dataset 2")

Collections of artifacts¶

Create a collection using Collection.

collection = ln.Collection([artifact, artifact2], name="my RNA-seq collection", version="1")
collection.save()
collection.describe()
collection.view_lineage()

# if it's small enough, you can load the entire collection into memory as if it was one
collection.load()

# typically, it's too big, hence, iterate over its artifacts
collection.artifacts.all()

# or look at a DataFrame listing the artifacts
collection.artifacts.df()

Show code cell output Hide code cell output

	uid	version	description	key	suffix	accessor	size	hash	hash_type	n_objects	n_observations	visibility	key_is_virtual	storage_id	transform_id	run_id	created_by_id	updated_at
id
2	3w0dZndQcbKiWLTTmFDa	1	my RNA-seq	None	.h5ad	AnnData	19240	ohAeiVMJZOrc3bFTKmankw	md5	None	3	1	True	1	1	1	1	2024-06-05 10:45:18.822845+00:00
3	MpTniKWTtbcF0Zg8xmOL	None	my RNA-seq dataset 2	None	.h5ad	AnnData	19240	L37UPl4IUH20HkIRzvlRMw	md5	None	3	1	True	1	1	1	1	2024-06-05 10:45:26.615131+00:00

Data loaders¶

# to train models, batch iterate through the collection as if it was one array
from torch.utils.data import DataLoader, WeightedRandomSampler
dataset = collection.mapped(obs_keys=["perturbation"])
sampler = WeightedRandomSampler(
    weights=dataset.get_label_weights("perturbation"), num_samples=len(dataset)
)
data_loader = DataLoader(dataset, batch_size=2, sampler=sampler)
for batch in data_loader:
    pass

Read this blog post for more on training models on sharded datasets.

Data lineage¶

Save notebooks & scripts¶

If you call finish(), you save the run report, source code, and compute environment to your default storage location.

ln.finish()

See an example for this introductory notebook here.

If you want to cache a notebook or script, call:

lamin get https://lamin.ai/laminlabs/lamindata/transform/FPnfDtJz8qbE5zKv

Data lineage across entire projects¶

View the sequence of data transformations (Transform) in a project (from a use case, based on Schmidt et al., 2022):

transform.view_parents()

Or, the generating flow of an artifact:

artifact.view_lineage()

Both figures are based on mere calls to ln.track() in notebooks, pipelines & app.

Distributed databases¶

Easily create & access databases¶

LaminDB is a distributed system like git. Similar to cloning a repository, collaborators can connect to your instance via:

ln.connect("account-handle/instance-name")

Or you load an instance on the command line for auto-connecting in a Python session:

lamin load "account-handle/instance-name"

Or you create your new instance:

lamin init --storage ./my-data-folder

Custom schemas and plugins¶

LaminDB can be customized & extended with schema & app plugins building on the Django ecosystem. Examples are:

bionty: Registries for basic biological entities, coupled to public ontologies.
wetlab: Exemplary custom schema to manage samples, treatments, etc.

If you’d like to create your own schema or app:

Create a git repository with registries similar to wetlab
Create & deploy migrations via lamin migrate create and lamin migrate deploy

It’s fastest if we do this for you based on our templates within an enterprise plan.

Design¶

Why?¶

The complexity of modern R&D data often blocks realizing the scientific progress it promises: see this blog post.

More basically: The pydata family of objects is at the heart of most data science, ML & comp bio workflows: DataFrame, AnnData, pytorch.DataLoader, zarr.Array, pyarrow.Table, xarray.Collection, etc. We couldn’t find a tool to link these objects to context so that they could be analyzed in context:

provenance: data sources, data transformations, models, users
domain knowledge & experimental metadata: the features & labels derived from domain entities

Assumptions¶

Batched datasets from physical instruments are transformed (Transform) into useful representations (Artifact)
Learning needs features (Feature, CellMarker, …) and labels (ULabel, CellLine, …)
Insights connect representations to experimental metadata and knowledge (ontologies)

Schema & API¶

LaminDB provides a SQL schema for common entities: Artifact, Collection, Transform, Feature, ULabel etc. - see the API reference or the source code.

The core schema is extendable through plugins (see blue vs. red entities in graphic), e.g., with basic biological (Gene, Protein, CellLine, etc.) & operational entities (Biosample, Techsample, Treatment, etc.).

On top of the schema, LaminDB is a Python API that abstracts over storage & database access, data transformations, and (biological) ontologies.

Repositories¶

LaminDB and its plug-ins consist in open-source Python libraries & publicly hosted metadata assets:

lamindb: Core API, which builds on the core schema.
bionty: Registries for basic biological entities, coupled to public ontologies.
wetlab: An (exemplary) wetlab schema.
guides: Guides.
usecases: Use cases.

LaminHub is not open-sourced.

Influences¶

LaminDB was influenced by many other projects, see Influences.