Comment by nullnix
10 years ago
Yeah. I suspect the answer is 'store all binary data in BAM', which then uses some different encoding for the binary stuff -- but that then makes my gittish soul wonder why not just use that encoding for everything. (It works for git packfiles... though 'git gc' on large repos is a total memory and CPU hog, one presumes that whatever delta encoding BAM uses is not.)
We support the uuencode horror for compat (and for smaller binaries that don't change) but the answer for binaries is BAM, there is no data in the weave for BAM files.
I don't agree that the weave is horrible, it's fantastic for text. Try git blame on a file in a repo with a lot of history then try the same thing in BK. Orders and orders of magnitude faster.
And go understand smerge.c and the weave lightbulb will come on.
Yeah, that's the problem; it's optimizing for the wrong thing. It speeds up blame at the expense of absolutely every other operation you ever need to carry out; the only thing which avoids reading (or, for checkins, writing) the whole file is a simple log. Blame is a relatively rare operation: its needs should not dominate the representation.
The fact that the largest file you mention is frankly tiny shows why your performance was good: we had ~50,000 line text files (yeah, I know, damn copy-and-paste coders) with a thousand-odd revisions and a resulting SCCS filesize exceeding three million lines, and every one of those lines had to be read on every checkout: dozens to hundreds of megabytes, and of course the cache would hardly ever be hot where that much data was concerned, so it all had to come off the disk and/or across NFS, taking tens of seconds or more in many cases. RCS could have avoided reading all but 50,000 of them in the common case of checkouts of most recent changes. (git would have reduced read volume even more because although it is deltified the chains are of finite length, unlike the weave, and all the data is compressed.)
Give me a file that was slow and lets see how it is in BitKeeper. I bet you'll be impressed.
50K lines is not even 3x bigger than the file I mentioned. Which we check out in 20 milliseconds.
As for optimizing blame, you are missing the point, it's not blame, it's merge, it's copy by reference rather than copy by value.
2 replies →