What happens if I save the same artifacts & records twice?

LaminDB’s operations are idempotent in the sense defined in this document.

This allows you to re-run a notebook or script without erroring or duplicating data. Similar behavior holds for human data entry.

Summary

Metadata records

If you try to create any metadata record (Registry) and upon_create_search_names is True (the default):

  1. LaminDB will warn you if a record with similar name exists and display a table of similar existing records.

  2. You can then decide whether you’d like to save a record to the database or rather query an existing one from the table.

  3. If a name already has an exact match in a registry, LaminDB will return it instead of creating a new record. For versioned entities, also the version must be passed.

If you set upon_create_search_names to False, you’ll directly populate the DB.

Data: artifacts & collections

If you try to create a Artifact object from the same content, depending on upon_artifact_create_if_hash_exists,

  • you’ll get an existing object, if upon_artifact_create_if_hash_exists = "warn_return_existing" (the default)

  • you’ll get an error, if upon_artifact_create_if_hash_exists = "error"

  • you’ll get a warning and a new object, if upon_artifact_create_if_hash_exists = "warn_create_new"

Examples

!lamin init --storage ./test-idempotency
💡 connected lamindb: testuser1/test-idempotency
import lamindb as ln
import pytest

ln.settings.verbosity = "hint"
ln.settings.transform.stem_uid = "ANW20Fr4eZgM"
ln.settings.transform.version = "1"
ln.track()
💡 connected lamindb: testuser1/test-idempotency
💡 notebook imports: lamindb==0.73.0 pytest==8.2.2
💡 saved: Transform(uid='ANW20Fr4eZgM5zKv', version='1', name='What happens if I save the same artifacts & records twice?', key='idempotency', type='notebook', created_by_id=1, updated_at='2024-06-05 10:52:43 UTC')
💡 saved: Run(uid='8lyXO5CwQdiQMTjWaH73', transform_id=1, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_8lyXO5CwQdiQMTjWaH73.txt
Run(uid='8lyXO5CwQdiQMTjWaH73', started_at='2024-06-05 10:52:43 UTC', is_consecutive=True, transform_id=1, created_by_id=1)

Metadata records

assert ln.settings.upon_create_search_names

Let us add a first record to the ULabel registry:

label = ln.ULabel(name="My project 1")
label.save()
ULabel(uid='JzRCrpib', name='My project 1', created_by_id=1, run_id=1, updated_at='2024-06-05 10:52:45 UTC')

If we create a new record, we’ll automatically get search results that give clues on whether we are prone to duplicating an entry:

label = ln.ULabel(name="My project 1a")
❗ record with similar name exists! did you mean to load it?
uid name description reference reference_type run_id created_by_id updated_at
id
1 JzRCrpib My project 1 None None None 1 1 2024-06-05 10:52:45.422913+00:00
label.save()
ULabel(uid='BPkkAcGO', name='My project 1a', created_by_id=1, run_id=1, updated_at='2024-06-05 10:52:45 UTC')

In case we match an existing name directly, we’ll get the existing object:

label = ln.ULabel(name="My project 1")
💡 returning existing ULabel record with same name: 'My project 1'

If we save it again, it will not create a new entry in the registry:

label.save()
ULabel(uid='JzRCrpib', name='My project 1', created_by_id=1, run_id=1, updated_at='2024-06-05 10:52:45 UTC')

Now, if we create a third record, we’ll get two alternatives:

label = ln.ULabel(name="My project 1b")
❗ records with similar names exist! did you mean to load one of them?
uid name description reference reference_type run_id created_by_id updated_at
id
1 JzRCrpib My project 1 None None None 1 1 2024-06-05 10:52:45.487034+00:00
2 BPkkAcGO My project 1a None None None 1 1 2024-06-05 10:52:45.462732+00:00

If we prefer to not perform a search, e.g. for performance reasons or too noisy logging, we can switch it off.

ln.settings.upon_create_search_names = False
label = ln.ULabel(name="My project 1c")

In this walkthrough, switch it back on:

ln.settings.upon_create_search_names = True

Data: artifacts and collections

Warn upon trying to re-ingest an existing artifact

assert ln.settings.upon_artifact_create_if_hash_exists == "warn_return_existing"
filepath = ln.core.datasets.file_fcs()

Create an Artifact:

artifact = ln.Artifact(filepath, description="My fcs artifact")
artifact.save()
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/WXhmbUyehNc1eNCBPrCO.fcs')
✅ storing artifact 'WXhmbUyehNc1eNCBPrCO' at '/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb/WXhmbUyehNc1eNCBPrCO.fcs'
Artifact(uid='WXhmbUyehNc1eNCBPrCO', description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:52:45 UTC')
Hide code cell content
assert artifact.hash == "KCEXRahJ-Ui9Y6nksQ8z1A"

Create an Artifact from the same path:

artifact2 = ln.Artifact(filepath, description="My fcs artifact")
💡 returning existing artifact with same hash: Artifact(uid='WXhmbUyehNc1eNCBPrCO', description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:52:45 UTC')

It gives us the existing object:

assert artifact.id == artifact2.id
assert artifact.run == artifact2.run

If you save it again, nothing will happen (the operation is idempotent):

artifact2.save()
Artifact(uid='WXhmbUyehNc1eNCBPrCO', description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:52:45 UTC')

In the hidden cell below, you’ll see how this interplays with data lineage.

Hide code cell content
ln.track(new_run=True)
artifact3 = ln.Artifact(filepath, description="My fcs artifact")
assert artifact3.id == artifact2.id
assert artifact3.run != artifact2.run
assert artifact3.previous_runs.first() == artifact2.run
💡 notebook imports: lamindb==0.73.0 pytest==8.2.2
💡 loaded: Transform(uid='ANW20Fr4eZgM5zKv', version='1', name='What happens if I save the same artifacts & records twice?', key='idempotency', type='notebook', created_by_id=1, updated_at='2024-06-05 10:52:43 UTC')
💡 saved: Run(uid='FHtlUaXnwTzH7SlMhxBw', transform_id=1, created_by_id=1)
💡 tracked pip freeze > /home/runner/.cache/lamindb/run_env_pip_FHtlUaXnwTzH7SlMhxBw.txt
💡 returning existing artifact with same hash: Artifact(uid='WXhmbUyehNc1eNCBPrCO', description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:52:45 UTC')

Error upon trying to re-ingest an existing artifact

ln.settings.upon_artifact_create_if_hash_exists = "error"

In this case, you’ll not be able to create an object from the same content:

with pytest.raises(FileExistsError):
    artifact3 = ln.Artifact(filepath, description="My new fcs artifact")

Warn and create a new artifact

Lastly, let us discuss the following setting:

ln.settings.upon_artifact_create_if_hash_exists = "warn_create_new"

In this case, you’ll create a new object:

artifact4 = ln.Artifact(filepath, description="My new fcs artifact")
artifact4.save()
❗ creating new Artifact object despite existing artifact with same hash: Artifact(uid='WXhmbUyehNc1eNCBPrCO', description='My fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-05 10:52:45 UTC')
💡 path content will be copied to default storage upon `save()` with key `None` ('.lamindb/g9iwz0V9TMDvjaFvdl3Q.fcs')
✅ storing artifact 'g9iwz0V9TMDvjaFvdl3Q' at '/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb/g9iwz0V9TMDvjaFvdl3Q.fcs'
Artifact(uid='g9iwz0V9TMDvjaFvdl3Q', description='My new fcs artifact', suffix='.fcs', size=6785467, hash='KCEXRahJ-Ui9Y6nksQ8z1A', hash_type='md5', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=2, updated_at='2024-06-05 10:52:46 UTC')

You can verify that it’s a new entry by comparing the ids:

assert artifact4.id != artifact.id
artifact4.filter(hash="KCEXRahJ-Ui9Y6nksQ8z1A").df()
uid version description key suffix accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
1 WXhmbUyehNc1eNCBPrCO None My fcs artifact None .fcs None 6785467 KCEXRahJ-Ui9Y6nksQ8z1A md5 None None 1 True 1 1 1 1 2024-06-05 10:52:45.906855+00:00
2 g9iwz0V9TMDvjaFvdl3Q None My new fcs artifact None .fcs None 6785467 KCEXRahJ-Ui9Y6nksQ8z1A md5 None None 1 True 1 1 2 1 2024-06-05 10:52:46.978567+00:00
Hide code cell content
assert len(artifact.filter(hash="KCEXRahJ-Ui9Y6nksQ8z1A").list()) == 2
!lamin delete --force test-idempotency
!rm -r test-idempotency
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.11.9/x64/bin/lamin", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 367, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamin_cli/__main__.py", line 103, in delete
    return delete(instance, force=force)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/_delete.py", line 98, in delete
    n_objects = check_storage_is_empty(
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.9/x64/lib/python3.11/site-packages/lamindb_setup/core/upath.py", line 779, in check_storage_is_empty
    raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb contains 2 objects ('_is_initialized' ignored) - delete them prior to deleting the instance
['/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb/WXhmbUyehNc1eNCBPrCO.fcs', '/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb/_is_initialized', '/home/runner/work/lamindb/lamindb/docs/faq/test-idempotency/.lamindb/g9iwz0V9TMDvjaFvdl3Q.fcs']