Comment by bahmboo
1 day ago
I'm just dipping my toe into Data Version Control - DVC. It is aimed towards data science and large digital asset management using configurable storage sources under a git meta layer. The goal is separation of concerns: git is used for versioning and the storage layers are dumb storage.
Does anyone have feedback about personally using DVC vs LFS?
I'm in the same boat - I decided this week for DVC over LFS.
For me, the deciding factor was that with LFS, if you want to delete objects from storage, you have to rewrite git history. At least, that's what both the Github and Gitlab docs specify.
DVC adds a layer of indirection, so that its structure is not directly tied to git. If I change my mind and delete the objects from S3, dvc might stop working, but git will be fine.
Some extra pluses about DVC: - It can point to versioned S3 objects that you might already have as part of existing data pipelines. - It integrates with the Python fsspec library to read the files on demand using paths like "dvc://path/to/file.parquet". This feels nicer than needing to download all the files up front.
When I tried DVC ~5 years ago it was very slow as it constantly hashed files for some reason.
Switched to https://github.com/kevin-hanselman/dud and I have been happy since ..
Dud author here. Happy to hear it's working well for you!
I did a simple test tracking a few hundred gigs of random /dev/urandom data. LFS choked on upload speed while DVC worked fine. My team is using DVC now
It sounds like git-annex might be a good option for you. There is also https://www.datalad.org/ built on top of git-annex for large data management.
Writing type-free Python in 2025 is malpractice
To be fair, they use types for the complex parts. [0]
[0]: https://github.com/datalad/datalad/blob/maint/datalad/suppor...
We built `oxen` to solve the issues we (and many others) had with DVC and LFS. The highlights: open source cli and server, mirrors git for easy learning, handles large files and millions of files, performant af. Would love feedback if you check it out.
https://github.com/Oxen-AI/Oxen
or check out the performance numbers https://docs.oxen.ai/features/performance
My main complaint about DVC is that it's hard to manage the files and if you keep modifying a big file you are going to end up with all the revisions stored in S3 (or whichever storage you choose). This is by design but I wish it was easier to set up like "store only the latest 3 revisions"