What does the key parameter do under the hood?ΒΆ
LaminDB is designed around associating biological metadata to artifacts and collections. This enables querying for them in storage by metadata and removes the requirement for semantic artifact and collection names.
Here, we will discuss trade-offs for using the key
parameter, which allows for semantic keys, in various scenarios.
SetupΒΆ
Weβre simulating an artifact system with several nested folders and artifacts. Such structures are resembled in, for example, the RxRx: cell imaging guide.
import random
import string
from pathlib import Path
def create_complex_biological_hierarchy(root_folder):
root_path = Path(root_folder)
if root_path.exists():
print("Folder structure already exists. Skipping...")
else:
root_path.mkdir()
raw_folder = root_path / "raw"
preprocessed_folder = root_path / "preprocessed"
raw_folder.mkdir()
preprocessed_folder.mkdir()
for i in range(1, 5):
artifact_name = f"raw_data_{i}.txt"
with (raw_folder / artifact_name).open("w") as f:
random_text = "".join(
random.choice(string.ascii_letters) for _ in range(10)
)
f.write(random_text)
for i in range(1, 3):
collection_folder = raw_folder / f"Collection_{i}"
collection_folder.mkdir()
for j in range(1, 5):
artifact_name = f"raw_data_{j}.txt"
with (collection_folder / artifact_name).open("w") as f:
random_text = "".join(
random.choice(string.ascii_letters) for _ in range(10)
)
f.write(random_text)
for i in range(1, 5):
artifact_name = f"result_{i}.txt"
with (preprocessed_folder / artifact_name).open("w") as f:
random_text = "".join(
random.choice(string.ascii_letters) for _ in range(10)
)
f.write(random_text)
root_folder = "complex_biological_project"
create_complex_biological_hierarchy(root_folder)
!lamin init --storage ./key-eval
π‘ connected lamindb: testuser1/key-eval
import lamindb as ln
ln.settings.verbosity = "hint"
π‘ connected lamindb: testuser1/key-eval
ln.UPath("complex_biological_project").view_tree()
4 sub-directories & 8 files with suffixes '.txt'
/home/runner/work/lamindb/lamindb/docs/faq/complex_biological_project
βββ raw/
β βββ Collection_2/
β βββ Collection_1/
β βββ raw_data_1.txt
β βββ raw_data_2.txt
β βββ raw_data_4.txt
β βββ raw_data_3.txt
βββ preprocessed/
βββ result_2.txt
βββ result_4.txt
βββ result_3.txt
βββ result_1.txt
ln.settings.transform.stem_uid = "WIwaNDvlEkwS"
ln.settings.transform.version = "1"
ln.track()
π‘ notebook imports: lamindb==0.73.0
π‘ saved: Transform(uid='WIwaNDvlEkwS5zKv', version='1', name='What does the key parameter do under the hood?', key='key', type='notebook', created_by_id=1, updated_at='2024-06-05 10:53:33 UTC')
π‘ saved: Run(uid='IkyURaSzhBu6J33q5zbx', transform_id=1, created_by_id=1)
π‘ tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_IkyURaSzhBu6J33q5zbx.txt
Run(uid='IkyURaSzhBu6J33q5zbx', started_at='2024-06-05 10:53:33 UTC', is_consecutive=True, transform_id=1, created_by_id=1)
Storing artifacts using Storage
, File
, and Collection
ΒΆ
Lamin has three storage classes that manage different types of in-memory and on-disk objects:
Storage
: Manages the default storage root that can be either local or in the cloud. For more details we refer to Storage FAQ.Artifact
: Manages datasets with an optionalkey
that acts as a relative path within the current default storage root (seeStorage
). An example is a single h5 artifact.Collection
: Manages a collection of datasets with an optionalkey
that acts as a relative path within the current default storage root (seeStorage
). An example is a collection of h5 artifacts.
For more details we refer to Tutorial: Artifacts.
The current storage root is:
ln.settings.storage
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval')
By default, Lamin uses virtual keys
that are only reflected in the database but not in storage.
It is possible to turn this behavior off by setting ln.settings.artifact_use_virtual_keys = False
.
Generally, we discourage disabling this setting manually. For more details we refer to Storage FAQ.
ln.settings.artifact_use_virtual_keys
True
We will now create File
objects with and without semantic keys using key
and also save them as Collections
.
artifact_no_key_1 = ln.Artifact("complex_biological_project/raw/raw_data_1.txt")
artifact_no_key_2 = ln.Artifact("complex_biological_project/raw/raw_data_2.txt")
π‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/K8hkdSmFNeQ1lgicvvOy.txt')
π‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/jwJBBsDBcJEyTcdJRubU.txt')
The logging suggests that the artifacts will be saved to our current default storage with auto generated storage keys.
artifact_no_key_1.save()
artifact_no_key_2.save()
β
storing artifact 'K8hkdSmFNeQ1lgicvvOy' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/K8hkdSmFNeQ1lgicvvOy.txt'
β
storing artifact 'jwJBBsDBcJEyTcdJRubU' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/jwJBBsDBcJEyTcdJRubU.txt'
Artifact(uid='jwJBBsDBcJEyTcdJRubU', suffix='.txt', size=10, hash='krapfLSobo-Zz0N-ByDIMw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:34 UTC')
artifact_key_3 = ln.Artifact(
"complex_biological_project/raw/raw_data_3.txt", key="raw/raw_data_3.txt"
)
artifact_key_4 = ln.Artifact(
"complex_biological_project/raw/raw_data_4.txt", key="raw/raw_data_4.txt"
)
artifact_key_3.save()
artifact_key_4.save()
π‘ path content will be copied to default storage upon `save()` with key 'raw/raw_data_3.txt'
π‘ path content will be copied to default storage upon `save()` with key 'raw/raw_data_4.txt'
β
storing artifact 'EYaY0xEEWx5dn8DiyDsb' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/EYaY0xEEWx5dn8DiyDsb.txt'
β
storing artifact '9nvosCWH4pWsCsgewR4Y' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/9nvosCWH4pWsCsgewR4Y.txt'
Artifact(uid='9nvosCWH4pWsCsgewR4Y', key='raw/raw_data_4.txt', suffix='.txt', size=10, hash='KH4K95aN__RwMDnPeep1Iw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:34 UTC')
Files
with keys are not stored in different locations because of the usage of virtual keys
.
However, they are still semantically queryable by key
.
ln.Artifact.filter(key__contains="raw").df().head()
uid | version | description | key | suffix | accessor | size | hash | hash_type | n_objects | n_observations | visibility | key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
3 | EYaY0xEEWx5dn8DiyDsb | None | None | raw/raw_data_3.txt | .txt | None | 10 | gHvRrd-17KwsJ74triHs4Q | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:34.812065+00:00 |
4 | 9nvosCWH4pWsCsgewR4Y | None | None | raw/raw_data_4.txt | .txt | None | 10 | KH4K95aN__RwMDnPeep1Iw | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:34.816465+00:00 |
Collection
does not have a key
parameter because it does not store any additional data in Storage
.
In contrast, it has a name
parameter that serves as a semantic identifier of the collection.
ds_1 = ln.Collection([artifact_no_key_1, artifact_no_key_2], name="no key collection")
ds_2 = ln.Collection([artifact_key_3, artifact_key_4], name="sample collection")
ds_1
Collection(uid='aTFTtdchyCiPTxLmNKci', name='no key collection', hash='hAUZJ_Ny-8Zb9kHfGks2', visibility=1, created_by_id=1, transform_id=1, run_id=1)
Advantages and disadvantages of semantic keysΒΆ
Semantic keys have several advantages and disadvantages that we will discuss and demonstrate in the remaining notebook:
Advantages:ΒΆ
Simple: It can be easier to refer to specific collections in conversations
Familiarity: Most people are familiar with the concept of semantic names
DisadvantagesΒΆ
Length: Semantic names can be long with limited aesthetic appeal
Inconsistency: Lack of naming conventions can lead to confusion
Limited metadata: Semantic keys can contain some, but usually not all metadata
Inefficiency: Writing lengthy semantic names is a repetitive process and can be time-consuming
Ambiguity: Overly descriptive artifact names may introduce ambiguity and redundancy
Clashes: Several people may attempt to use the same semantic key. They are not unique
Renaming artifactsΒΆ
Renaming Files
that have associated keys can be done on several levels.
In storageΒΆ
A artifact can be locally moved or renamed:
artifact_key_3.path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/EYaY0xEEWx5dn8DiyDsb.txt')
loaded_artifact = artifact_key_3.load()
!mkdir complex_biological_project/moved_artifacts
!mv complex_biological_project/raw/raw_data_3.txt complex_biological_project/moved_artifacts
artifact_key_3.path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/EYaY0xEEWx5dn8DiyDsb.txt')
After moving the artifact locally, the storage location (the path) has not changed and the artifact can still be loaded.
artifact_3 = artifact_key_3.load()
The same applies to the key
which has not changed.
artifact_key_3.key
'raw/raw_data_3.txt'
By keyΒΆ
Besides moving the artifact in storage, the key
can also be renamed.
artifact_key_4.key
'raw/raw_data_4.txt'
artifact_key_4.key = "bad_samples/sample_data_4.txt"
artifact_key_4.key
'bad_samples/sample_data_4.txt'
Due to the usage of virtual keys
, modifying the key does not change the storage location and the artifact stays accessible.
artifact_key_4.path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/9nvosCWH4pWsCsgewR4Y.txt')
artifact_4 = artifact_key_4.load()
Modifying the path
attributeΒΆ
However, modifying the path
directly is not allowed:
try:
artifact_key_4.path = f"{ln.settings.storage}/here_now/sample_data_4.txt"
except AttributeError as e:
print(e)
property of 'Artifact' object has no setter
Clashing semantic keysΒΆ
Semantic keys should not clash. Letβs attempt to use the same semantic key twice
print(artifact_key_3.key)
print(artifact_key_4.key)
raw/raw_data_3.txt
bad_samples/sample_data_4.txt
artifact_key_4.key = "raw/raw_data_3.txt"
print(artifact_key_3.key)
print(artifact_key_4.key)
raw/raw_data_3.txt
raw/raw_data_3.txt
When filtering for this semantic key it is now unclear to which artifact we were referring to:
ln.Artifact.filter(key__icontains="sample_data_3").df()
uid | version | description | key | suffix | accessor | size | hash | hash_type | n_objects | n_observations | visibility | key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id |
When querying by key
LaminDB cannot resolve which artifact we actually wanted.
In fact, we only get a single hit which does not paint a complete picture.
print(artifact_key_3.uid)
print(artifact_key_4.uid)
EYaY0xEEWx5dn8DiyDsb
9nvosCWH4pWsCsgewR4Y
Both artifacts still exist though with unique uids
that can be used to get access to them.
Most importantly though, saving these artifacts to the database will result in an IntegrityError
to prevent this issue.
try:
artifact_key_3.save()
artifact_key_4.save()
except Exception as e:
print(
"It is not possible to save artifacts to the same key. This results in an"
" Integrity Error!"
)
We refer to What happens if I save the same artifacts & records twice? for more detailed explanations of behavior when attempting to save artifacts multiple times.
HierarchiesΒΆ
Another common use-case of keys
are artifact hierarchies.
It can be useful to resemble the artifact structure in βcomplex_biological_projectβ from above also in LaminDB to allow for queries for artifacts that were stored in specific folders.
Common examples of this are folders specifying different processing stages such as raw
, preprocessed
, or annotated
.
Note that this use-case may also be overlapping with Collection
which also allows for grouping Files
.
However, Collection
cannot model hierarchical groupings.
KeyΒΆ
import os
for root, _, artifacts in os.walk("complex_biological_project/raw"):
for artifactname in artifacts:
file_path = os.path.join(root, artifactname)
key_path = file_path.removeprefix("complex_biological_project")
ln_artifact = ln.Artifact(file_path, key=key_path)
ln_artifact.save()
π‘ returning existing artifact with same hash: Artifact(uid='K8hkdSmFNeQ1lgicvvOy', suffix='.txt', size=10, hash='9RX1_cP63CnI0ECQquZAnQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:34 UTC')
β key None on existing artifact differs from passed key /raw/raw_data_1.txt
π‘ returning existing artifact with same hash: Artifact(uid='jwJBBsDBcJEyTcdJRubU', suffix='.txt', size=10, hash='krapfLSobo-Zz0N-ByDIMw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:34 UTC')
β key None on existing artifact differs from passed key /raw/raw_data_2.txt
π‘ returning existing artifact with same hash: Artifact(uid='9nvosCWH4pWsCsgewR4Y', key='raw/raw_data_3.txt', suffix='.txt', size=10, hash='KH4K95aN__RwMDnPeep1Iw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
β key raw/raw_data_3.txt on existing artifact differs from passed key /raw/raw_data_4.txt
π‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_1.txt'
β
storing artifact 'P56NZfkOldIJfbjwlCLn' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/P56NZfkOldIJfbjwlCLn.txt'
π‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_2.txt'
β
storing artifact 'uEQDF26HKgon1IkeDukn' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/uEQDF26HKgon1IkeDukn.txt'
π‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_4.txt'
β
storing artifact 'K5zDWqycRbpbCsxDcPs2' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/K5zDWqycRbpbCsxDcPs2.txt'
π‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_2/raw_data_3.txt'
β
storing artifact '7p9UOHdDuID3V0EqppIK' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/7p9UOHdDuID3V0EqppIK.txt'
π‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_1.txt'
β
storing artifact '1qCQFeHBVOENwOqSAm2C' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/1qCQFeHBVOENwOqSAm2C.txt'
π‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_2.txt'
β
storing artifact 'GCuSkC4TR4pNzawj49Qd' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/GCuSkC4TR4pNzawj49Qd.txt'
π‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_4.txt'
β
storing artifact 'gStJ5tJkl2N0Yl4wjtZD' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/gStJ5tJkl2N0Yl4wjtZD.txt'
π‘ path content will be copied to default storage upon `save()` with key '/raw/Collection_1/raw_data_3.txt'
β
storing artifact 'N184PGRDDPCV0tExoOpk' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/N184PGRDDPCV0tExoOpk.txt'
ln.Artifact.filter(key__startswith="raw").df()
uid | version | description | key | suffix | accessor | size | hash | hash_type | n_objects | n_observations | visibility | key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
3 | EYaY0xEEWx5dn8DiyDsb | None | None | raw/raw_data_3.txt | .txt | None | 10 | gHvRrd-17KwsJ74triHs4Q | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.259544+00:00 |
4 | 9nvosCWH4pWsCsgewR4Y | None | None | raw/raw_data_3.txt | .txt | None | 10 | KH4K95aN__RwMDnPeep1Iw | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.302610+00:00 |
CollectionΒΆ
Alternatively, it would have been possible to create a Collection
with a corresponding name:
all_data_paths = []
for root, _, artifacts in os.walk("complex_biological_project/raw"):
for artifactname in artifacts:
file_path = os.path.join(root, artifactname)
all_data_paths.append(file_path)
all_data_artifacts = []
for path in all_data_paths:
all_data_artifacts.append(ln.Artifact(path))
data_ds = ln.Collection(all_data_artifacts, name="data")
data_ds.save()
π‘ returning existing artifact with same hash: Artifact(uid='K8hkdSmFNeQ1lgicvvOy', suffix='.txt', size=10, hash='9RX1_cP63CnI0ECQquZAnQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing artifact with same hash: Artifact(uid='jwJBBsDBcJEyTcdJRubU', suffix='.txt', size=10, hash='krapfLSobo-Zz0N-ByDIMw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing artifact with same hash: Artifact(uid='9nvosCWH4pWsCsgewR4Y', key='raw/raw_data_3.txt', suffix='.txt', size=10, hash='KH4K95aN__RwMDnPeep1Iw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing artifact with same hash: Artifact(uid='P56NZfkOldIJfbjwlCLn', key='/raw/Collection_2/raw_data_1.txt', suffix='.txt', size=10, hash='3kWliF8jgv82QS8Leh3Vzg', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing artifact with same hash: Artifact(uid='uEQDF26HKgon1IkeDukn', key='/raw/Collection_2/raw_data_2.txt', suffix='.txt', size=10, hash='-6Mr716Jv2SRhB-b3j22FQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing artifact with same hash: Artifact(uid='K5zDWqycRbpbCsxDcPs2', key='/raw/Collection_2/raw_data_4.txt', suffix='.txt', size=10, hash='_diti4NVMjDxBtg2zVftdA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing artifact with same hash: Artifact(uid='7p9UOHdDuID3V0EqppIK', key='/raw/Collection_2/raw_data_3.txt', suffix='.txt', size=10, hash='k1-iaLkG77lywLfB0Ug9VQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing artifact with same hash: Artifact(uid='1qCQFeHBVOENwOqSAm2C', key='/raw/Collection_1/raw_data_1.txt', suffix='.txt', size=10, hash='ugPIvW0bb0h8kRpe3_oWyA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing artifact with same hash: Artifact(uid='GCuSkC4TR4pNzawj49Qd', key='/raw/Collection_1/raw_data_2.txt', suffix='.txt', size=10, hash='ssJoW3RRg0ywqEb79afI6w', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing artifact with same hash: Artifact(uid='gStJ5tJkl2N0Yl4wjtZD', key='/raw/Collection_1/raw_data_4.txt', suffix='.txt', size=10, hash='oUzYbJuYXLKpoxci3MiD3w', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing artifact with same hash: Artifact(uid='N184PGRDDPCV0tExoOpk', key='/raw/Collection_1/raw_data_3.txt', suffix='.txt', size=10, hash='k05i0-5NQFLmIm55y_ua-A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
ln.Collection.filter(name__icontains="data").df()
uid | version | name | description | hash | reference | reference_type | visibility | transform_id | artifact_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
1 | fVv0Yg2GSmnNMEM8uA7C | None | data | None | x2c6Wt31oLPQsGu0h5SY | None | None | 1 | 1 | None | 1 | 1 | 2024-06-05 10:53:35.468000+00:00 |
This approach will likely lead to clashes. Alternatively, Ulabels
can be added to Files
to resemble hierarchies.
UlabelsΒΆ
for root, _, artifacts in os.walk("complex_biological_project/raw"):
for artifactname in artifacts:
file_path = os.path.join(root, artifactname)
key_path = file_path.removeprefix("complex_biological_project")
ln_artifact = ln.Artifact(file_path, key=key_path)
ln_artifact.save()
data_label = ln.ULabel(name="data")
data_label.save()
ln_artifact.ulabels.add(data_label)
π‘ returning existing artifact with same hash: Artifact(uid='K8hkdSmFNeQ1lgicvvOy', suffix='.txt', size=10, hash='9RX1_cP63CnI0ECQquZAnQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
β key None on existing artifact differs from passed key /raw/raw_data_1.txt
π‘ returning existing artifact with same hash: Artifact(uid='jwJBBsDBcJEyTcdJRubU', suffix='.txt', size=10, hash='krapfLSobo-Zz0N-ByDIMw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
β key None on existing artifact differs from passed key /raw/raw_data_2.txt
π‘ returning existing ULabel record with same name: 'data'
π‘ returning existing artifact with same hash: Artifact(uid='9nvosCWH4pWsCsgewR4Y', key='raw/raw_data_3.txt', suffix='.txt', size=10, hash='KH4K95aN__RwMDnPeep1Iw', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
β key raw/raw_data_3.txt on existing artifact differs from passed key /raw/raw_data_4.txt
π‘ returning existing ULabel record with same name: 'data'
π‘ returning existing artifact with same hash: Artifact(uid='P56NZfkOldIJfbjwlCLn', key='/raw/Collection_2/raw_data_1.txt', suffix='.txt', size=10, hash='3kWliF8jgv82QS8Leh3Vzg', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing ULabel record with same name: 'data'
π‘ returning existing artifact with same hash: Artifact(uid='uEQDF26HKgon1IkeDukn', key='/raw/Collection_2/raw_data_2.txt', suffix='.txt', size=10, hash='-6Mr716Jv2SRhB-b3j22FQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing ULabel record with same name: 'data'
π‘ returning existing artifact with same hash: Artifact(uid='K5zDWqycRbpbCsxDcPs2', key='/raw/Collection_2/raw_data_4.txt', suffix='.txt', size=10, hash='_diti4NVMjDxBtg2zVftdA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing ULabel record with same name: 'data'
π‘ returning existing artifact with same hash: Artifact(uid='7p9UOHdDuID3V0EqppIK', key='/raw/Collection_2/raw_data_3.txt', suffix='.txt', size=10, hash='k1-iaLkG77lywLfB0Ug9VQ', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing ULabel record with same name: 'data'
π‘ returning existing artifact with same hash: Artifact(uid='1qCQFeHBVOENwOqSAm2C', key='/raw/Collection_1/raw_data_1.txt', suffix='.txt', size=10, hash='ugPIvW0bb0h8kRpe3_oWyA', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing ULabel record with same name: 'data'
π‘ returning existing artifact with same hash: Artifact(uid='GCuSkC4TR4pNzawj49Qd', key='/raw/Collection_1/raw_data_2.txt', suffix='.txt', size=10, hash='ssJoW3RRg0ywqEb79afI6w', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing ULabel record with same name: 'data'
π‘ returning existing artifact with same hash: Artifact(uid='gStJ5tJkl2N0Yl4wjtZD', key='/raw/Collection_1/raw_data_4.txt', suffix='.txt', size=10, hash='oUzYbJuYXLKpoxci3MiD3w', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing ULabel record with same name: 'data'
π‘ returning existing artifact with same hash: Artifact(uid='N184PGRDDPCV0tExoOpk', key='/raw/Collection_1/raw_data_3.txt', suffix='.txt', size=10, hash='k05i0-5NQFLmIm55y_ua-A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
π‘ returning existing ULabel record with same name: 'data'
labels = ln.ULabel.lookup()
ln.Artifact.filter(ulabels__in=[labels.data]).df()
uid | version | description | key | suffix | accessor | size | hash | hash_type | n_objects | n_observations | visibility | key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
1 | K8hkdSmFNeQ1lgicvvOy | None | None | None | .txt | None | 10 | 9RX1_cP63CnI0ECQquZAnQ | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.512018+00:00 |
2 | jwJBBsDBcJEyTcdJRubU | None | None | None | .txt | None | 10 | krapfLSobo-Zz0N-ByDIMw | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.532537+00:00 |
4 | 9nvosCWH4pWsCsgewR4Y | None | None | raw/raw_data_3.txt | .txt | None | 10 | KH4K95aN__RwMDnPeep1Iw | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.554545+00:00 |
5 | P56NZfkOldIJfbjwlCLn | None | None | /raw/Collection_2/raw_data_1.txt | .txt | None | 10 | 3kWliF8jgv82QS8Leh3Vzg | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.576336+00:00 |
6 | uEQDF26HKgon1IkeDukn | None | None | /raw/Collection_2/raw_data_2.txt | .txt | None | 10 | -6Mr716Jv2SRhB-b3j22FQ | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.598646+00:00 |
7 | K5zDWqycRbpbCsxDcPs2 | None | None | /raw/Collection_2/raw_data_4.txt | .txt | None | 10 | _diti4NVMjDxBtg2zVftdA | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.619821+00:00 |
8 | 7p9UOHdDuID3V0EqppIK | None | None | /raw/Collection_2/raw_data_3.txt | .txt | None | 10 | k1-iaLkG77lywLfB0Ug9VQ | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.642792+00:00 |
9 | 1qCQFeHBVOENwOqSAm2C | None | None | /raw/Collection_1/raw_data_1.txt | .txt | None | 10 | ugPIvW0bb0h8kRpe3_oWyA | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.664083+00:00 |
10 | GCuSkC4TR4pNzawj49Qd | None | None | /raw/Collection_1/raw_data_2.txt | .txt | None | 10 | ssJoW3RRg0ywqEb79afI6w | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.684818+00:00 |
11 | gStJ5tJkl2N0Yl4wjtZD | None | None | /raw/Collection_1/raw_data_4.txt | .txt | None | 10 | oUzYbJuYXLKpoxci3MiD3w | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.705203+00:00 |
12 | N184PGRDDPCV0tExoOpk | None | None | /raw/Collection_1/raw_data_3.txt | .txt | None | 10 | k05i0-5NQFLmIm55y_ua-A | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.726405+00:00 |
However, Ulabels
are too versatile for such an approach and clashes are also to be expected here.
MetadataΒΆ
Due to the chance of clashes for the aforementioned approaches being rather high, we generally recommend not to store hierarchical data with solely semantic keys.
Biological metadata makes Files
and Collections
unambiguous and easily queryable.
Legacy data and multiple storage rootsΒΆ
Distributed CollectionsΒΆ
LaminDB can ingest legacy data that already had a structure in their storage.
In such cases, it disables artifact_use_virtual_keys
and the artifacts are ingested with their actual storage location.
It might be therefore be possible that Files
stored in different storage roots may be associated with a single Collection
.
To simulate this, we are disabling artifact_use_virtual_keys
and ingest artifacts stored in a different path (the βlegacy dataβ).
ln.settings.artifact_use_virtual_keys = False
for root, _, artifacts in os.walk("complex_biological_project/preprocessed"):
for artifactname in artifacts:
file_path = os.path.join(root, artifactname)
key_path = file_path.removeprefix("complex_biological_project")
print(file_path)
print()
ln_artifact = ln.Artifact(file_path, key=f"./{key_path}")
ln_artifact.save()
complex_biological_project/preprocessed/result_2.txt
π‘ path content will be copied to default storage upon `save()` with key './/preprocessed/result_2.txt'
β
storing artifact 'M8gOUhgbxPyoXnpDxgc9' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_2.txt'
complex_biological_project/preprocessed/result_4.txt
π‘ path content will be copied to default storage upon `save()` with key './/preprocessed/result_4.txt'
β
storing artifact 'BF7mI0g4ztQ1uVSlHK8b' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_4.txt'
complex_biological_project/preprocessed/result_3.txt
π‘ path content will be copied to default storage upon `save()` with key './/preprocessed/result_3.txt'
β
storing artifact 'HmZIBETahU3FpZbtDfMw' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_3.txt'
complex_biological_project/preprocessed/result_1.txt
π‘ path content will be copied to default storage upon `save()` with key './/preprocessed/result_1.txt'
β
storing artifact 'edd7PpRxkSs8UNRfrLf4' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_1.txt'
ln.Artifact.df()
uid | version | description | key | suffix | accessor | size | hash | hash_type | n_objects | n_observations | visibility | key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
16 | edd7PpRxkSs8UNRfrLf4 | None | None | .//preprocessed/result_1.txt | .txt | None | 10 | 81Wg-SWsUjbAb4HqDnd3dw | md5 | None | None | 1 | False | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.822922+00:00 |
15 | HmZIBETahU3FpZbtDfMw | None | None | .//preprocessed/result_3.txt | .txt | None | 10 | CRUjamGa0kfwFig6m-zP0g | md5 | None | None | 1 | False | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.814986+00:00 |
14 | BF7mI0g4ztQ1uVSlHK8b | None | None | .//preprocessed/result_4.txt | .txt | None | 10 | x0hElZGXalH9cN5IUSJ__w | md5 | None | None | 1 | False | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.806831+00:00 |
13 | M8gOUhgbxPyoXnpDxgc9 | None | None | .//preprocessed/result_2.txt | .txt | None | 10 | 41qHRX9h1_FsUowEQdWlWw | md5 | None | None | 1 | False | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.798567+00:00 |
12 | N184PGRDDPCV0tExoOpk | None | None | /raw/Collection_1/raw_data_3.txt | .txt | None | 10 | k05i0-5NQFLmIm55y_ua-A | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.726405+00:00 |
11 | gStJ5tJkl2N0Yl4wjtZD | None | None | /raw/Collection_1/raw_data_4.txt | .txt | None | 10 | oUzYbJuYXLKpoxci3MiD3w | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.705203+00:00 |
10 | GCuSkC4TR4pNzawj49Qd | None | None | /raw/Collection_1/raw_data_2.txt | .txt | None | 10 | ssJoW3RRg0ywqEb79afI6w | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.684818+00:00 |
9 | 1qCQFeHBVOENwOqSAm2C | None | None | /raw/Collection_1/raw_data_1.txt | .txt | None | 10 | ugPIvW0bb0h8kRpe3_oWyA | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.664083+00:00 |
8 | 7p9UOHdDuID3V0EqppIK | None | None | /raw/Collection_2/raw_data_3.txt | .txt | None | 10 | k1-iaLkG77lywLfB0Ug9VQ | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.642792+00:00 |
7 | K5zDWqycRbpbCsxDcPs2 | None | None | /raw/Collection_2/raw_data_4.txt | .txt | None | 10 | _diti4NVMjDxBtg2zVftdA | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.619821+00:00 |
6 | uEQDF26HKgon1IkeDukn | None | None | /raw/Collection_2/raw_data_2.txt | .txt | None | 10 | -6Mr716Jv2SRhB-b3j22FQ | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.598646+00:00 |
5 | P56NZfkOldIJfbjwlCLn | None | None | /raw/Collection_2/raw_data_1.txt | .txt | None | 10 | 3kWliF8jgv82QS8Leh3Vzg | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.576336+00:00 |
4 | 9nvosCWH4pWsCsgewR4Y | None | None | raw/raw_data_3.txt | .txt | None | 10 | KH4K95aN__RwMDnPeep1Iw | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.554545+00:00 |
2 | jwJBBsDBcJEyTcdJRubU | None | None | None | .txt | None | 10 | krapfLSobo-Zz0N-ByDIMw | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.532537+00:00 |
1 | K8hkdSmFNeQ1lgicvvOy | None | None | None | .txt | None | 10 | 9RX1_cP63CnI0ECQquZAnQ | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.512018+00:00 |
3 | EYaY0xEEWx5dn8DiyDsb | None | None | raw/raw_data_3.txt | .txt | None | 10 | gHvRrd-17KwsJ74triHs4Q | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.259544+00:00 |
artifact_from_raw = ln.Artifact.filter(key__icontains="Collection_2/raw_data_1").first()
artifact_from_preprocessed = ln.Artifact.filter(
key__icontains="preprocessed/result_1"
).first()
print(artifact_from_raw.path)
print(artifact_from_preprocessed.path)
/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/P56NZfkOldIJfbjwlCLn.txt
/home/runner/work/lamindb/lamindb/docs/faq/key-eval/preprocessed/result_1.txt
Letβs create our Collection
:
ds = ln.Collection(
[artifact_from_raw, artifact_from_preprocessed], name="raw_and_processed_collection_2"
)
ds.save()
ds.artifacts.df()
uid | version | description | key | suffix | accessor | size | hash | hash_type | n_objects | n_observations | visibility | key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
5 | P56NZfkOldIJfbjwlCLn | None | None | /raw/Collection_2/raw_data_1.txt | .txt | None | 10 | 3kWliF8jgv82QS8Leh3Vzg | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.576336+00:00 |
16 | edd7PpRxkSs8UNRfrLf4 | None | None | .//preprocessed/result_1.txt | .txt | None | 10 | 81Wg-SWsUjbAb4HqDnd3dw | md5 | None | None | 1 | False | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.822922+00:00 |
Modeling directoriesΒΆ
ln.settings.artifact_use_virtual_keys = True
dir_path = ln.core.datasets.dir_scrnaseq_cellranger("sample_001")
ln.UPath(dir_path).view_tree()
π‘ file has more than one suffix (path.suffixes), using only last suffix: '.bai' - if you want your composite suffix to be recognized add it to lamindb.core.storage.VALID_SUFFIXES.add()
3 sub-directories & 15 files with suffixes '.tsv.gz', '.bai', '.bam', '.csv', '.h5', '.mtx.gz', '.cloupe', '.html'
/home/runner/work/lamindb/lamindb/docs/faq/sample_001
βββ possorted_genome_bam.bam
βββ raw_feature_bc_matrix.h5
βββ molecule_info.h5
βββ filtered_feature_bc_matrix.h5
βββ raw_feature_bc_matrix/
β βββ matrix.mtx.gz
β βββ barcodes.tsv.gz
β βββ features.tsv.gz
βββ metrics_summary.csv
βββ cloupe.cloupe
βββ web_summary.html
βββ analysis/
β βββ analysis.csv
βββ possorted_genome_bam.bam.bai
βββ filtered_feature_bc_matrix/
βββ matrix.mtx.gz
βββ barcodes.tsv.gz
βββ features.tsv.gz
There are two ways to create Artifact
objects from directories: from_dir()
and Artifact
.
cellranger_raw_artifact = ln.Artifact.from_dir("sample_001/raw_feature_bc_matrix/")
β this creates one artifact per file in the directory - you might simply call ln.Artifact(dir) to get one artifact for the entire directory
β folder is outside existing storage location, will copy files from sample_001/raw_feature_bc_matrix/ to /home/runner/work/lamindb/lamindb/docs/faq/key-eval/raw_feature_bc_matrix
β
created 3 artifacts from directory using storage /home/runner/work/lamindb/lamindb/docs/faq/key-eval and key = raw_feature_bc_matrix/
for artifact in cellranger_raw_artifact:
artifact.save()
β
storing artifact 'bX9fXX2gjeHs35wH476m' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/bX9fXX2gjeHs35wH476m.mtx.gz'
β
storing artifact 'lfAVu6AxFYFH8Ag3BZ9l' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/lfAVu6AxFYFH8Ag3BZ9l.tsv.gz'
β
storing artifact 'c2Q3wUuzPfTI6X1Pwzde' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/c2Q3wUuzPfTI6X1Pwzde.tsv.gz'
cellranger_raw_folder = ln.Artifact(
"sample_001/raw_feature_bc_matrix/", description="cellranger raw"
)
cellranger_raw_folder.save()
π‘ path content will be copied to default storage upon `save()` with key `None` ('.lamindb/uQCWS2hX03MaBolN')
β
storing artifact 'uQCWS2hX03MaBolNBkY9' at '/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/uQCWS2hX03MaBolN'
Artifact(uid='uQCWS2hX03MaBolNBkY9', description='cellranger raw', suffix='', size=18, hash='-3ZErJYvbDRZz4ypdHTbFg', hash_type='md5-d', n_objects=3, visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:53:35 UTC')
ln.Artifact.filter(key__icontains="raw_feature_bc_matrix").df()
uid | version | description | key | suffix | accessor | size | hash | hash_type | n_objects | n_observations | visibility | key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
17 | bX9fXX2gjeHs35wH476m | None | None | raw_feature_bc_matrix/matrix.mtx.gz | .mtx.gz | None | 6 | UoLa_8cvjpUynenWYIttRw | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.956943+00:00 |
18 | lfAVu6AxFYFH8Ag3BZ9l | None | None | raw_feature_bc_matrix/barcodes.tsv.gz | .tsv.gz | None | 6 | u91hb4oMZjc5zwSSj_eQXg | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.962963+00:00 |
19 | c2Q3wUuzPfTI6X1Pwzde | None | None | raw_feature_bc_matrix/features.tsv.gz | .tsv.gz | None | 6 | 8kLsEjlmRZCJWK7ipTQgWg | md5 | None | None | 1 | True | 1 | 1 | 1 | 1 | 2024-06-05 10:53:35.967015+00:00 |
ln.Artifact.filter(key__icontains="raw_feature_bc_matrix/matrix.mtx.gz").one().path
PosixUPath('/home/runner/work/lamindb/lamindb/docs/faq/key-eval/.lamindb/bX9fXX2gjeHs35wH476m.mtx.gz')
artifact = ln.Artifact.filter(description="cellranger raw").one()
artifact.path.glob("*")
<generator object Path.glob at 0x7f2c98b05e00>