This year, I have accepted Eric’s invitation to submit a proposal to Google Summer of Code again. I am not going to repeat the proposal itself here, so please read that as well. This post is more about filling in the technical details and setting out a plan of work.
First a clarification: even though there are no V2 prims, I call those V3, because the V1 prims have slightly (and somewhat confusingly) different semantics in Darcs1 and Darcs2 repositories; if nothing else filename encoding has changed incompatibly. There have been some commute rule changes as well, although I am not sure this wasn’t retroactively changed even for Darcs1 repos. Not important anyway.
In the rest of the post, I will lay out how I anticipate the V3 prims to work and further what I intend to implement and roughly how.
Primitive patches shall operate on a collection of abstract objects (which define the “pristine” state), each object in the collection being uniquely identified. Objects come into existence the first time they are referenced and they are never destroyed. We assign a type to each object (and object patches get corresponding arrow types).
I imagine there would be a few object types: binary, text, directory. We can add “bugs” objects and stuff like that later. The patch types should be monomorphic to simplify things. We can share implementations between different patch types if they are identical apart from their type.
Directories
A directory object is a map from names (strings) to object ids. (I say map and not bimap since there seems no good reason to prevent multiple manifestations of a single object.) We should however take care to avoid loops in the structure and such. We could even tie hardlinking to this, although that’s probably pretty useless in practice. We definitely should take care to avoid loops and similar abominations in the directory structure.
Among other repository properties, we keep a “root” object — this is the UUID of a (directory) object that represents the root of the working copy of the repository. The directory can map names to things, like text or binary files, or other directories.
Akin to the “root” object, we may want to keep track of a “preferences” object as well. Again, this would be just an UUID of a directory object. The directory object could then list individual preference files.
Some examples:
- bhunk (binary hunk) :: binary -> binary
- hunk (text hunk) :: text -> text
- bin2text :: binary -> text
- text2bin :: text -> binary
- manifest :: directory -> directory
- demanifest :: directory -> directory
Patches of different types on the same object clearly don’t need to commute. If there is a binary -> binary patch and a text -> text patch affecting the same object, they can never change their order. In fact, a -> b patches for a != b can’t realistically commute with any a -> a patch. This should drastically reduce the exponential number of commute rules we’d otherwise need to deal with, and should make the primitive commute function much more modular. In fact, only a -> a patches for same a become involved in the exponential commute definition blowup. This should be manageable.
Moreover, if we impose a map from patches to the object they affect, we can also trivially commute patches that affect different objects. We will need to generalise this later, however, since some patch types may involve multiple objects (even though our type system can’t handle that yet, either), or even involve a list of objects variable under commute.
Patch application needs to obey the type restrictions of course.
We will possibly want to attach a UUID to each primitive patch as well, so it can be readily identified. Of course, this increases the space overhead, but presumably not exceedingly so.
Hunks
The basic patch type is the hunk: the representation may be identical for both binary and text objects. What is not the same is how binary and text hunks are obtained. For text objects, we should use a whitespace-sensitive diffing algorithm (line diff, most likely; either the one we already have in darcs, or alternatively patience diff). For binary objects, we can use one of the binary delta algorithms. It may be prudent to disallow commute for binary hunks, too.
The format is still not defined, although there is a “first shot” at http://web.mornfall.net/blog/patch_formats.html. But in the end, we probably want a somewhat different format anyway, or an additional hunk type, because we apparently want both removal and addition to happen in a single Prim, for commute to make more sense. So the format could instead look like:
hunk 123 "old text" "new text"
or
hunk 123 old_hash new_hash
with the quoted-string version being a text-escaped, “inline” variant of the
patch to be used when length of old\_text
+ new\_text
is less than two
times the length of a hash. Even though patches of this textual form are not of
constant byte width, that doesn’t matter since other patch types cannot be made
to be that way either (like replace, or manifest). Any “substantial” text that
is the actual problematic part to parse is moved away by indirecting it through
hashes.
Multi-object patches
Until now, we restricted ourselves to patches that affect a single object. This may be genuinely impractical for patches that move around things, be it complete files (move) or pieces of content (hunk move). We want such patches to commute as a single unit, either commuting completely or not at all. This could be achieved differently, by adding a concept of atomic patch group. I am not entirely sure if that is right or not, but it currently seems like the more complicated option.
Therefore, we can go on adding multi-object patches. Presumably, the correct type would be (a, b) -> (c, d). Most commonly of course (i.e. in the two abovementioned cases), this would end up being (a, a) -> (a, a).
Generic commute rules
Let’s assume a function
touches :: Prim -> [UUID]
we can say that
commute (a :> b) | null (touches a `intersect` touches b) = (b :> a)
Now we can also add the type restrictions. We demand that for each touched object, all the types in both patches match for the commute to be allowed.
commute (a :> b) | not (a `typematch` b) = fail
Where
typematch a b = all match (touches a `union` touches b)
where match x | type a x /= type b x = False
| fst (type a x) /= snd (type a x) = False
| fst (type b x) /= snd (type b x) = False
| otherwise = True
With type :: Patch -> UUID -> (ObjectType, ObjectType)
.
A few more restrictions will need to be applied in case we need a patch type that may operate on object collections such as a “subtree”, whose composition can change in time. I have a rather vague idea that some kind of general mechanism could be used for designating such an object collection could be used, which could then be built into the framework, making it also possible to define that kind of patches without introducing exponentially big commute definitions.
Adding new patch/object types
With the generic commute rules, it becomes possible (and easy) to add new object types and corresponding patch types to the system, without ending up in an exponential tangle. One such object type could be “haskell” (holding a representation of a Haskell AST), or “bug” (for in-repo bug tracker, ala bugs everywhere). Another useful object type could be “set”, keeping a sorted set of lines, or a “changelog”, keeping a timestamped list of textual chunks.
Of course, there are other options to achieve a similar effect, but I like how this “falls out” more or less naturally from the design.
Optimisations
The suggested patch representation allows for some optimisations in the way patches are stored. This could include per-object buckets, detached hunk storage (which is more or less implied by the hunk format) or the like. Per-object buckets are slightly complicated by multi-object patches, but probably not ruled out. With per-object buckets, UUIDs based on hashing and minimal context for prims may be feasible, granting a number of desirable security and convenience properties to the system as a whole.
There’s a couple of issues that have been raised about the proposed design so far. I think the major one is that merge behaviour might be somewhat surprising in some respects, especially if root and preference object UUIDs are picked fresh, instead of being hardcoded: it is not clear how to manage these cases. On the repository level, any two repositories with disjoint sets of live UUIDs merge cleanly. However, they don’t necessarily merge cleanly on the working directory level.
The most counter-intuitive case seems to be when you initialise an empty
repository and then immediately pull from somewhere: you get a conflict here. A
possible solution: darcs init
should create a repo with no objects. Any and
all objects come into existence through patches. And once you have two patches
that add a root object, you get a reasonable conflict that can be reasonably
resolved. (The preferred solution probably entailing recursive merge of the two
trees, with conflicting leaves being renamed out of way; this would constitute
“conflict markup” and would need to be explicitly recorded, of course, like
with other conflicts… another desirable solution may be renaming one of the
roots and making it a child of the other, confirming the other as the
repository root.)
About preferences, that’s a bit more tricky. The tradeoff is slightly different, since pivoting repository roots sanely is probably significantly more valuable than it is for preferences. More discussion is probably needed on how to exactly arrange this, but the exact details are not important for the project. Most of the available options can be implemented relatively painlessly on top of the proposed infrastructure.
So in the above, I have outlined the design of what I intend to implement. I think I am now reasonably happy with the abstract level of design. There is of course a number of implementation challenges to be solved.
Nevertheless, since it is really getting late now, so I will keep the “battle plan” for another post. I intend to spell out roughly what and how I intend to do on the implementation level, and break this down into a couple of phases, time-wise. A very quick summary for now…
The core of the implementation work will happen directly in the darcs
repository, since each “version” of primitive patches comes with a separate
module hierarchy (even if there is only V1 so far). I can work in
Darcs.Patch.Prim.V3
more or less without disturbing other work or the release
process. I expect that coverage of the newly written code will come mostly from
QC and unit tests, which will likely live in the darcs source tree as well.
An external, experimental “client” may be an option, depending on how things go. It would probably implement a simple repository format without any of the higher level patch types (i.e. no conflict handling etc.), mostly for demonstration purposes.
A semi-independent vein of the implementation work would entail pushing out lazy IO out of the patch code in darcs per se. This is tied in with decoupling the hunk content (the text or binary blobs) from the patches themselves in memory. Even though it may well be possible to achieve using IO interleaving, an explicit approach will be more transparent, presumably much easier to reason about and, ultimately, debug, than the current lazy IO code.