patch formats

I have been thinking for a while, that a completely new “repository” format (an experimental one) would be in place for darcs 2.4. I have previously outlined a way I’d like to go about building up new things within the darcs 2.x series. Now a darcs repository has two basic “components”: the “file” part of the layout: truly a repository format, and a “patch format”: which determines not only how patches are written out to disk, but more importantly, their exact semantics. Once you set up a “patch format”, this is set in stone and repositories with different patch types cannot exchange patches between them (at least not without an in-between conversion). This is the case between darcs-1 and darcs-2 format repositories, as they use a different patch format. The case of darcs-1 vs “hashed” repositories, as darcs calls them, is only on the file level though: the patch formats are identical, and that’s why hashed and plain darcs-1 repositories can exchange patches just fine. (I will from now on refer to repository and patch format as two orthogonal things, as they mostly are.)

Now I have been working on a packed repository format… one that would allow to store the repository — regardless of patch format used — in a compact form suitable for HTTP-based retrieval. In this post, I’d like to address the other thing: a patch format. It seems worthwhile to improve our current system, since it has a number of weak spots. Currently, we have a number of “primitive” patch types, and some more complicated ones — conflictors in darcs-2 or mergers in darcs-1. I am not going to talk about these — we’ll focus on the primitive patches for now.

The primitives in darcs are addfile, rmfile, addir, rmdir, move, replace and most importantly hunk. (You should be able to look up what these roughly represent somewhere else.)

Let’s address these addfile and friend patches, that create and remove files or directories. Obviously, addfile foo.txt and a different addfile foo.txt are going to conflict. Also, all hunks for foo.txt obviously depend on addfile foo.txt — which means that if you pull two branches together with nontrivial files of the same name, you are going to end up with a massive conflict (at least in terms of darcs data structures) for virtually no reason.

So my proposal is to divorce the filename from file identity: this is something that has been pondered before, I believe. The result would look something like:

hunk fileid 1
- a
+ b

This means that hunks would exist without any dependence on addfile: the abstract file would pop into existence with first hunk touching its identity. Of course this would be no good, since you just lost the relation between a working copy and whatever darcs tracks. To put that relation back on track, we add two patch types:

manifest fileid ./file/path
demanifest fileid ./file/path

A manifest patch will tell darcs to associate the fileid with your working file at ./file/path. The inverse operation is demanifest, and that would remove the association: and your working copy file. The abstract identity continues to exist just fine, and can be manifested again (under the same or different filepath). Basically, this completely de-couples the “hunk-space” from the “filepath-space” — manifest/demanifest/move(/adddir/removedir) patches commute completely freely with hunk/replace patches. To make the de-coupling complete, you want a “manifest” of a non-existent fileid to pop that fileid into existence as well. No problem.

Basically, this means that as far as darcs is concerned, file content manipulation is orthogonal to the directory tree manipulation: and this is good and well, since it allows us to solve conflicts on both of those levels separately, without dragging in a lot of stuff from the other level. Moreover, the add-add conflict no longer exists.

As for the hunk format itself, there is also a number of issues: it uses a GNU-patch-like format with ‘+’ or ‘-’ sign in front of each line. It will usually look like a block of ‘-’ lines followed by a block of ‘+’ lines (either of these may be empty). Parsing this format is not quite simple, you have to look up all the newlines, chop off the ‘+’ and ‘-’ signs etc. Lots of work for darcs.

hunk ./foo.txt 1
- a
+ b

Now a friendly-to-parse alternative would be to use a chunk kind of patch:

chunk fileid - 0 2
a\n
chunk fileid + 0 2
b\n

the 0 and the 2 are an offset and length (in bytes), respectively. What follows the patch header is a monolithic block of text to be removed from (or pasted at) the given offset (and the block is of a given length). This would produce more primitives than the original hunks, but they would be vastly simpler to process in bulk. Each basically represents a string “splice” operation. No newline shuffling whatsoever. The commutation rules should be as simple as they were with hunks: you just wibble the offset (after checking for overlap).

But now that the data in those chunks is basically a blob of (anything), there is an extra thing that can be done. Instead of keeping this data inline in the patch, we could refer to it by a hash and store it elsewhere:

chunk fileid - 0 2 hash
chunk fileid + 0 2 hash

Of course, for the example patch, this is going to inflate the patch size quite a bit. But let’s not care for a while. We now have a chunk format that is O(1) in size (well O(logn) for purists, given the length needs to be represented, but we don’t care about that either). We can still commute and invert it just fine: inversion is flip-flopping the ‘-‘/’+’ sign, and commutation just wibbles the offset. Awesome. (Besides that, it also effectively obliterates any need for a “binary” kind of patch: chunks will do just fine for that, and we could even use binary diff algorithm if we wanted…)

We will need to dereference that hash sometimes of course: when we want to actually apply the patch, and when we want to show it to user. The latter is a non-issue, since our processing power vastly exceeds that of the user, we can play with the patch as much as we like for presentation issues. So when we want to apply that patch, we need to fetch its content… but wait, we already have a mechanism for buckets of bits by their hashes: the hashed pristine! So yes, we can just dump the data bits of the chunks into hashed pristine (plus some wibbling of garbage collection, see my previous post about that).

Now that no actual file content is ever part of any patch representation, we can consider some new options. One would be to store patches inline in the inventory files: this would probably inflate their sizes by a small factor. Looking at darcs repository itself, we have 750K of compressed inventories (1.5M uncompressed). There are 43225 hunks — this would add about 6M of uncompressed (but relatively compressible) text to the inventories (considering about 150 bytes per a chunk patch: 2x sha256 + a little). That is about factor 5. We would probably have to play around a little to find out how (un)reasonable this is. We could also cut a bit of the cost by using a less-stupid encoding of hashes than ascii-hex (aka base16)… say base64 (see RFC 3548/4).

I should probably also note that this would save some bits on conflictor representation (no copies of hunk data) and it should also solve the “big initial patch” problem — the patch itself would be O(n) in number of files (instead of O(n) in number of bytes of the initial tree). There are of course some drawbacks and some other advantages, but I don’t have the amount of time to go into more details just now. Instead, I’ll let people think about this for a while. Comments are definitely welcome (probably address them to our darcs-users@ mailing list).

patch formats

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112