I have been thinking for a while, that a completely new “repository” format (an experimental one) would be in place for darcs 2.4. I have previously outlined a way I’d like to go about building up new things within the darcs 2.x series. Now a darcs repository has two basic “components”: the “file” part of the layout: truly a repository format, and a “patch format”: which determines not only how patches are written out to disk, but more importantly, their exact semantics. Once you set up a “patch format”, this is set in stone and repositories with different patch types cannot exchange patches between them (at least not without an in-between conversion). This is the case between darcs-1 and darcs-2 format repositories, as they use a different patch format. The case of darcs-1 vs “hashed” repositories, as darcs calls them, is only on the file level though: the patch formats are identical, and that’s why hashed and plain darcs-1 repositories can exchange patches just fine. (I will from now on refer to repository and patch format as two orthogonal things, as they mostly are.)
Now I have been working on a packed repository format… one that would allow to store the repository — regardless of patch format used — in a compact form suitable for HTTP-based retrieval. In this post, I’d like to address the other thing: a patch format. It seems worthwhile to improve our current system, since it has a number of weak spots. Currently, we have a number of “primitive” patch types, and some more complicated ones — conflictors in darcs-2 or mergers in darcs-1. I am not going to talk about these — we’ll focus on the primitive patches for now.
The primitives in darcs are addfile
, rmfile
, addir
, rmdir
, move
,
replace
and most importantly hunk
. (You should be able to look up what
these roughly represent somewhere else.)
Let’s address these addfile
and friend patches, that create and remove files
or directories. Obviously, addfile foo.txt
and a different addfile
foo.txt
are going to conflict. Also, all hunks for foo.txt obviously depend on
addfile foo.txt
— which means that if you pull two branches together with
nontrivial files of the same name, you are going to end up with a massive
conflict (at least in terms of darcs data structures) for virtually no reason.
So my proposal is to divorce the filename from file identity: this is something that has been pondered before, I believe. The result would look something like:
hunk fileid 1
- a
+ b
This means that hunks would exist without any dependence on addfile
: the
abstract file would pop into existence with first hunk touching its
identity. Of course this would be no good, since you just lost the relation
between a working copy and whatever darcs tracks. To put that relation back on
track, we add two patch types:
manifest fileid ./file/path
demanifest fileid ./file/path
A manifest
patch will tell darcs to associate the fileid with your working
file at ./file/path. The inverse operation is demanifest, and that would remove
the association: and your working copy file. The abstract identity continues to
exist just fine, and can be manifested again (under the same or different
filepath). Basically, this completely de-couples the “hunk-space” from the
“filepath-space” — manifest/demanifest/move(/adddir/removedir) patches commute
completely freely with hunk/replace patches. To make the de-coupling complete,
you want a “manifest” of a non-existent fileid to pop that fileid into
existence as well. No problem.
Basically, this means that as far as darcs is concerned, file content manipulation is orthogonal to the directory tree manipulation: and this is good and well, since it allows us to solve conflicts on both of those levels separately, without dragging in a lot of stuff from the other level. Moreover, the add-add conflict no longer exists.
As for the hunk
format itself, there is also a number of issues: it uses a
GNU-patch-like format with ‘+’ or ‘-’ sign in front of each line. It will
usually look like a block of ‘-’ lines followed by a block of ‘+’ lines (either
of these may be empty). Parsing this format is not quite simple, you have to
look up all the newlines, chop off the ‘+’ and ‘-’ signs etc. Lots of work for
darcs.
hunk ./foo.txt 1
- a
+ b
Now a friendly-to-parse alternative would be to use a chunk
kind of patch:
chunk fileid - 0 2
a\n
chunk fileid + 0 2
b\n
the 0 and the 2 are an offset and length (in bytes), respectively. What follows the patch header is a monolithic block of text to be removed from (or pasted at) the given offset (and the block is of a given length). This would produce more primitives than the original hunks, but they would be vastly simpler to process in bulk. Each basically represents a string “splice” operation. No newline shuffling whatsoever. The commutation rules should be as simple as they were with hunks: you just wibble the offset (after checking for overlap).
But now that the data in those chunks is basically a blob of (anything), there is an extra thing that can be done. Instead of keeping this data inline in the patch, we could refer to it by a hash and store it elsewhere:
chunk fileid - 0 2 hash
chunk fileid + 0 2 hash
Of course, for the example patch, this is going to inflate the patch size quite a bit. But let’s not care for a while. We now have a chunk format that is O(1) in size (well O(logn) for purists, given the length needs to be represented, but we don’t care about that either). We can still commute and invert it just fine: inversion is flip-flopping the ‘-‘/’+’ sign, and commutation just wibbles the offset. Awesome. (Besides that, it also effectively obliterates any need for a “binary” kind of patch: chunks will do just fine for that, and we could even use binary diff algorithm if we wanted…)
We will need to dereference that hash sometimes of course: when we want to actually apply the patch, and when we want to show it to user. The latter is a non-issue, since our processing power vastly exceeds that of the user, we can play with the patch as much as we like for presentation issues. So when we want to apply that patch, we need to fetch its content… but wait, we already have a mechanism for buckets of bits by their hashes: the hashed pristine! So yes, we can just dump the data bits of the chunks into hashed pristine (plus some wibbling of garbage collection, see my previous post about that).
Now that no actual file content is ever part of any patch representation, we can consider some new options. One would be to store patches inline in the inventory files: this would probably inflate their sizes by a small factor. Looking at darcs repository itself, we have 750K of compressed inventories (1.5M uncompressed). There are 43225 hunks — this would add about 6M of uncompressed (but relatively compressible) text to the inventories (considering about 150 bytes per a chunk patch: 2x sha256 + a little). That is about factor 5. We would probably have to play around a little to find out how (un)reasonable this is. We could also cut a bit of the cost by using a less-stupid encoding of hashes than ascii-hex (aka base16)… say base64 (see RFC 3548/4).
I should probably also note that this would save some bits on conflictor representation (no copies of hunk data) and it should also solve the “big initial patch” problem — the patch itself would be O(n) in number of files (instead of O(n) in number of bytes of the initial tree). There are of course some drawbacks and some other advantages, but I don’t have the amount of time to go into more details just now. Instead, I’ll let people think about this for a while. Comments are definitely welcome (probably address them to our darcs-users@ mailing list).