Perkeep – Open-source data modeling, storing, search, sharing and synchronizing

8 years ago (perkeep.org)

Since people are confused about what this is I'll write a summary (from old memory so it's probably 80% correct)

It is a consumer-oriented storage system that is:

- Content addressable

- Indexed

- Tag-oriented (vs. hierarchical)

- Permissions, encryption, compression, sharing, etc.

- Spans storage across machines and clouds

- FUSE mountable

- Has CLI and Web interfaces built-in

The intent is to be a personal data dumpster that you can throw all of your files and other data (tweets, etc.) into for search and backup.

The website could be better organized to convey this information quickly.

  • Camlistore (renamed to Perkeep) author here.

    It is true that the website needs some love & updated docs. We've been working on Camlistore for 8 years now (with a few drier spells) but our focus has never been marketing. If anything, we didn't want too many non-nerd users for a number of years because it wasn't ready for non-developer usage. That's starting to change.

    We have pretty good docs for configuration and such, but we lack some concise high-level text about what the project is and why.

    I'll prioritize that.

    • For everyone else reading this, here's more context. I once tried creating durable physical storage that spanned multiple external hard-disks with a single logical schema, but then discovered Camlistore and git-annex and decided to let more competent people build it.

      The idea is that we should be able to own and manage our personal data - which runs into terabytes across one lifetime - without having to trust and/or pay the big cloud companies. So Camlistore from its earliest days had integrated photo gallery since multimedia is where most of the bytes are consumed.

      The whole thing once had the label the IndieWeb movement (which we should revive), and Wired wrote about it here - https://www.wired.com/2013/08/indie-web/

      Brad Fitzpatrick is also the creator of LiveJournal where he wrote the original version of Memcached in Perl. He also wrote OpenID, and then went on to work with Rob Pike and team on the Go Programming language. Camlistore was one of the earliest projects written in Go (before Hashicorp made it cool) and I imagine that had something to do with him getting into the language itself, but that's for Brad to clarify :)

      1 reply →

The thing is that nothing is good enough for keeping it for lifetime. A hardware might be broken, a supply might be discontinued and a software maintainer might disappear. You'll need to keep refreshing the data from one device to another, for the rest of your life. That said, I'm curious how easy this system can handle porting from one device or service to another, in varying formats and architectures. The only way to stay relevant is to constantly keep changing/adapting to new things.

  • A huge focus of the project is on human-readable schemas and formats. Even if all specs & source code of the project is lost, the data should still be recoverable from a curious archaeologist.

    Between replicating between several companies as well as your own hardware & having friends & family mirror your stuff (encrypted or not), the ideas is that some copies will continue to exist.

    Hardware failures are a given. Companies failing and friends & family dying is also a given. Natural disasters too. The only option seems to be trusting nothing and replicating all your data to lots of places, in future-friendly formats, and that's what Perkeep aims to do. And then a ton of tooling on top of that.

    • Interesting. I thought plaintext + .tar.gz or .zip format on either FAT or ext2 fs is the best bet for forward compatibility, and anything beyond that is too complex or obscure for future archaeologists. The obvious problem is the searchability, but I'd imagine in future that indexing a few TB of text/image will be a breeze.

Looks like there's been some nice progress since I last looked at Camlistore! The importers from cloud services like Twitter look really interesting.

Camlistore & Brad Fitzpatrick's original writings are what initially got me into decentralized web advocacy. Since then, I've moved on from this project, since it seems to move at a very slow place and the authors do not seem very interested in widespread user adoption.

With this name change, I'm slightly more interested again. We'll have to see in the coming months whether they become ready to displace actual large social media platforms or whether it remains a toy project.

How does this work?

I've been watching Camlistore for a few years. I peek in on it every once in a while, long enough between that I usually can't remember the name. I like the look of it, but haven't been convinced to go from my decade old ZFS setup to Camlistore.

I feel like OwnCloud is more compelling, from a glance. Anyone use one or both and able to comment?

  • Camlistore author here.

    If you only store files, sure, use ZFS.

    Perkeep (Camlistore) doesn't write to a block device. It has storage backends for a filesystem (which can be ZFS) and any number of cloud object storage providers (S3, GCS, etc).

    Perkeep's main value over a fancy POSIX filesystem is storing nameless things (tweets, other social media content + interactions, bookmarks) in common schemas, and permitting search over it all, and then having a variety of ways to browse it (CLI, FUSE, API, web UI, etc).

    It's also good at sync to & from things any which way without merge conflicts.

How is this any better than just burning your data to a blu-ray, which lasts centuries when stored under proper conditions (theoretically, anyway) I need to give this a closer look.

  • Not having to worry if there will be any Blu-Ray readers available in a century.

    • >> Not having to worry if there will be any Blu-Ray readers available in a century.

      Century? Startup sites like the one above last on average 6 months, that is, until they find out that their $6/mo DigitalOcean droplet suddenly costs... $10/mo! Or $100/mo or whatever and then they find out they cannot fund their $100/mo droplet and call it quits.

      So... if you need the data to be around for 100 years, maybe not give it to the random startup.

      2 replies →

  • M-DISC is even better. Burnable discs use an organic dye which oxidizes over time. M-DISC uses a "glassy carbon" layer that is inert to oxidation.

    They adhere to DVD-R, BD-R, and BD-XL standards so it's readable in standard disc drives. You need a special drive to burn them, however (requires a high-power laser).

    • > Burnable discs use an organic dye which oxidizes over time.

      This is only true of DVDs and a rare variant of Blu-Ray called LTH. Even cheap shitty Blu-Rays from Chinese manufacturers use inorganic dyes these days.

      Also, the French Archives did a test of a variety of DVDs for longevity in adverse conditions and found that M-DISC didn't last significantly longer than competitors, even those with inorganic dyes: https://documents.lne.fr/publications/guides-documents-techn...

      The US DoD also did a similar test under different conditions and found it performed much better than the competition though: http://www.esystor.com/images/China_Lake_Full_Report.pdf

      I suspect the difference between the French and US tests might be the French using a longer test duration and the Americans using light. The French went up to 1000h while the Americans only went to 24 as far as I can tell.

      And unlike DVDs, I haven't seen any studies of longevity for M-DISC Blu-Rays.

  • It's different (better?) in that it doesn't rely on you remembering to actually burn that data, then store it safely. It comes with an app you can run on your phone to upload all your photos immediately, for instance. It has importers to archive all your tweets automatically, for example. It allows you to outsource the task of "Keep this blu-ray safe" to a cloud provider (or a friend) while encrypting your data to keep it private.

I've been keeping an eye on this project for years, because it seems well-designed, and the authors are very capable developers.

The biggest problem I found was getting documentation on replication. Having two+ servers mirror-each other, across the internet, seems like a good idea given that otherwise you have a single point of failure as you import all your media/files.

The perfect tool for a digital hoarder like myself. Will follow this with attention.

So, its just a document server that can be run over multiple computers? I was expecting something peer to peer. If I understand correctly, you can think of this as a dropbox that you can self host?

What is the target audience of this? What are the intended use cases?

Is this supposed to be used directly by users or as an API for a user-facing application? How is this different from a document DB like MongoDB?

  • Long time follower of the project here... So far it's been aimed at geeks who want to archive their content from the cloud, eg tweets, but it also stores files. Because of the way it is designed I've always thought there is a compelling use case for its use as a file and object store for organizations where auditing of data records is expected and sharing of data is a requirement.

So is this ready for prime time yet? I used to follow camlistore, and it was still a little rough even for CLI nerds.

  • So I just downloaded it and played around and as far as I can tell there is no way to delete files. Or, more specifically there is a way but it's not implemented or otherwise accessible as far as I can figure from the rather sparse documentation.

    If someone would like to explain to me how (if?) the garbage collection works I'd appreciate it, because I like the concept and kinda want to use this, but deleting stuff is a rather important feature for me. All I could find searching was a post by the devs saying it was already mostly implemented but not finished and not a priority...

    https://github.com/camlistore/camlistore/issues/792

    Like, I understand that this is a spare time project (I think) but not considering deleting/pruning files to be an important feature is really confusing to me. In its current state, if I accidentally upload the wrong file, am I now stuck with it forever?

    Edit: ok I figured out how to at least delete things in the UI (clicking the check mark opens a side menu apparently, `camput delete` doesn't seem to do anything), but as far as I can tell it doesn't actually delete them from the database without running a garbage collect, which isn't implemented so it just hangs around in purgatory.

Is this possibly a Dropbox replacement ? do I have to host the files on my own server ?

Alternatively: "Hard-drives let you permanently keep your stuff, for life"

  • Check out M-DISC https://en.wikipedia.org/wiki/M-DISC

    • It's not clear how much better than they are than regular media since there haven't been many tests. There are two that I'm aware of, one by the French Archives (who've done this a few times it so happens) and one by the US DoD.

      The French found that M-DISC didn't perform much better than regular DVDs and that a weird kind of glass DVD beat everything else hands down.

      The Americans found no errors at all in their tests of M-DISC while all other disks encountered them.

      I suspect the important differences were:

      - The Americans' tested the discs after light exposure, the French did not. It may be that the light caused the regular DVDs to fail but not the M-DISC.

      - The French tests were far longer (1000h) than the Americans' (24h). It may be that M-DISC can't survive the adverse conditions past a certain point that the Americans didn't reach.

      Also as far as I'm aware, there are no tests of the Blu-Ray variant of M-DISC.

      Personally, given the cost of M-DISC, I'd buy a few cheap terrible Blu-Rays instead and just make sure they're not exposed to too much light.

      French test: https://documents.lne.fr/publications/guides-documents-techn...

      American: http://www.esystor.com/images/China_Lake_Full_Report.pdf

Question if anybody gets to this: I'm taking a break from work and computers for a year. How would you guys suggest I store my kbdx data securely In a failsafe manner without worrying about forgetting passwords or losing paper chits or USB keys?

Edit: after seeing some good suggestions about physical storage, I've decided to increase the difficulty of the question, hard mode- How would you do this without physical stuff? (more, new answers about physical welcome too)

  • For something on the timescale of a year I would just keep the system that you already have up and running. It it were much longer than that I'd go with a bank vault that contains the access keys and something like tarsnap and yet another backup with another cloud provider.

    • I'm assuming all my electronics fries, papers burn and memory goes away. (to be safe)

      Bank vault might be a good idea (assuming they id me fine)

  • > Edit: after seeing some good suggestions about physical storage, I've decided to increase the difficulty of the question, hard mode- How would you do this without physical stuff? (more, new answers about physical welcome too)

    Store one copy in a gmail account, and another on imgur.

    > assuming [...] memory goes away. (to be safe)

    And tattoo the site+username+pass on your thigh.

  • I wonder if a system like this would be good for your general problem:

    Generate a random seed sentence of so many words. From the secret seed + site domain name generate a password

    Store piece of paper with:

    Algorithm (could be public in github too) Seed word Site names

  • For a year? a burned CD in a safe deposit box. Also a USB key there for convenience. Basically paying for physical security of the devices/data.