← Back to context

Comment by bahmboo

1 day ago

I'm just dipping my toe into Data Version Control - DVC. It is aimed towards data science and large digital asset management using configurable storage sources under a git meta layer. The goal is separation of concerns: git is used for versioning and the storage layers are dumb storage.

Does anyone have feedback about personally using DVC vs LFS?

I'm in the same boat - I decided this week for DVC over LFS.

For me, the deciding factor was that with LFS, if you want to delete objects from storage, you have to rewrite git history. At least, that's what both the Github and Gitlab docs specify.

DVC adds a layer of indirection, so that its structure is not directly tied to git. If I change my mind and delete the objects from S3, dvc might stop working, but git will be fine.

Some extra pluses about DVC: - It can point to versioned S3 objects that you might already have as part of existing data pipelines. - It integrates with the Python fsspec library to read the files on demand using paths like "dvc://path/to/file.parquet". This feels nicer than needing to download all the files up front.

I did a simple test tracking a few hundred gigs of random /dev/urandom data. LFS choked on upload speed while DVC worked fine. My team is using DVC now

My main complaint about DVC is that it's hard to manage the files and if you keep modifying a big file you are going to end up with all the revisions stored in S3 (or whichever storage you choose). This is by design but I wish it was easier to set up like "store only the latest 3 revisions"