This blog is kind of a final report for the summer. I had a progress report drafted for about two weeks, so let me paste that here just for the record. To read about the actual results, please skip to the “State of the Code” section below.
The Last Progress Report
This is the second (and last) progress update for this summer of code project. It was written something like a week before the pencil’s down, but I got disheartened and instead of finishing and posting, I went on to code some more. Here it goes…
Since my last report, I have decided to turn somewhat more radical again. The original plan was to stick with the darcs codebase and do most (all) of the work within that, based primarily on writing tests for the testsuite and not exposing anything of the new functionality in a user-visible fashion. I changed my mind about this. The main reason was that the test environment, as it is, makes certain properties hard to express: a typical test-suite works with assertions (HUnit) and invariants (QC). In this environment, expressing ideas like “the displayed patches are aesthetically pleasing” or “the files in the repository have reasonable shape” is impractical at best.
An alternative would have been to make myself a playground using the darcs library to expose the new code. But the fact is, our current codebase is entrenched in all kinds of legacy issues, like handling filenames and duplicated code. It makes the experimenter’s life harder than necessary, and it also involves rebuilding a whole lot of code that I never use, over and over.
All in all, I made a somewhat bold decision to cut everything that lived under
Darcs.Patch (plus a few dependencies, as few as possible) into a new library,
which I named patchlib
, in the best tradition of cmdlib
, pathlib
and
fslib
. At that point, I also removed custom file path handling from that
portion of code, removed the use of a custom Printer
(a pretty-printer
implementation) module and a made few other incompatible changes.
Of course, the testing code went along. The net result, at least for me, was an ability to build and test a much smaller piece of self-contained code. It also allowed me to experiment with APIs a bit, where those were used all over darcs, which made it, within the big codebase, impossible to advance without expending disproportionate amount of work on every change. Of course, part of that will be paid back when we decide to port darcs over to use patchlib.
I originally planned this report for the start of this week, but then I got
caught in a big refactor of the ApplyMonad/Apply classes (again). The refactor
was triggered by the need to pretty-print patches, which is not a completely
easy task (made more complicated by the fact that UUIDs are meaningless for the
user, so formatting patches without context is essentially useless
now). Anyway, I am now much happier with how the ApplyMonad class looks (the
ApplyMonadBase
thing was genuinely hideous… good riddance). As a net result
the ApplyMonad class and, even more importantly, the ApplyMonad transformer
(used in applyTo{Tree,State}
among other things) is substantially easier to
use on the client side (while maybe very slightly harder on the provider side,
it is also much clearer, IMHO). Overall win.
As for formatting and summarising patches, I have created a new Display
class
(I plan to nuke the existing viewing classes more or less, when I manage to
make meaningful Display
instances for V1 Prim). The API lets you format patches
based on their ending or starting context. Since bits of the patches need to be
fetched from the hashed store, the display needs to run in a LoadMonad
. Since
any reasonable patch formatter also needs to pass state around (and since the
actual type of state passed around will be different for different Prim
implementations, we hide the state in a type family of monad transformers; this
also opens the option to use something else than StateT
when appropriate).
(This is the end of the “progress” post. The remainder more or less describes the end state of the project.)
State of the Code
You can look at the code in my darcs repositories. Specifically, to play
around with a “demo”, you need pathlib
, fslib
, patchlib
, cmdlib
and
gorsvet
. A bit more about the individual libraries:
pathlib
is small library for handling file paths; it uses strict bytestring representation, and adds a certain amount of type safety and a few form invariantsfslib
is the successor of hashed-storage; it deals with accessing the filesystem in an efficient manner, with good support for hashing files and using the hashes for efficient comparisonspatchlib
, as outlined above, came into existence as a “fork” of the Darcs.Patch hierarchy, add and take some; the code is self-contained, and has a set of QC/HUnit tests, although these definitely need extending to cover more of the functionality (and some of the functionality needs to be removed)cmdlib
is my commandline parsing library, and is needed by the test clientgorsvet
(see next point)gorsvet
is an experimental client using version 3 primitive patches, and a very simple hashed repository format; its main purpose is to demo the implementation of V3 prims, in addition to existing QC coverage;
About patchlib
One of the main problems of the darcs codebase today is insufficient modularity, and patchlib is an experiment in an attempt to address that concern. Efforts to bring about significant improvement of the situation from “inside” (by restructuring existing code) have to date failed. Even though there have been local improvements, the overall problem stubbornly persists.
Hand in hand with modularity problems come issues with unclear and underspecified (both internal end external) APIs. Since the separation between different components of darcs is blurry at best, the pressure to introduce clean, testable interfaces is minimal. The external library, on the other hand, is forced to put up a presentable façade. Luckily, the Patch subhierarchy in darcs is, compared to remainder of darcs, in a fairly good shape in this respect.
Eventually, patchlib should provide leverage to work with darcs(-style) patches, including at least:
- Implementation of primitive patches, both version 1 (as used by darcs 1 and 2), compatible with existing repositories, and version 3 (with better semantics and more efficient representation; subject of this SoC project).
- Implementation of the “real” patches, in version 1 (as used by darcs 1) and version 2 (as used by darcs 2), and at some point, version 3 (of as yet unspecified properties).
- Implementation of PatchInfo and “named” patches, which implement changesets in the darcs sense, and allow tracking their metadata. With PatchInfo (and to a limited extent, patch implementations themselves) goes a set of matchers and support code for interactive dependency-aware patch selection.
With a homogenous set of APIs, mostly mediated by type classes and appropriate instances, to:
- Create primitive patches (mainly through diffing, but also by
version-specific direct construction). The most important type class is
Diff
, defined inData.Patch.Diff
. The interface allows multiple equivalent representations of a single change, intended to provide an interface for heuristic detection of high-level patch types like hunk move or token replace. - Create real and named patches by building them from constituent primitives.
- Store and load all types of patches. This role used to be filled by the
ShowPatchBasic
andShowPatch
classes, but is being superseded byStore
andLoad
instances (the classes are provided byfslib
for generic hash-based stores). - Format (and summarise) patches for user-friendly display, using relevant
context to improve the legibility and usefulness of the output. The new
class to achieve this is
Display
. Previously, this role was filled byShowPatch
, with itsshowContextPatch
method (which is, however, significantly under-used by darcs). Since V3 primitive patches are substantially harder to read without context, the interface for user-level rendering of patches mediated byDisplay
mandates context use. - Commute, invert (all patch types) and merge (“real” and “named”
patches). The primary classes are
Commute
andMerge
, currently both faithful copies of their darcs versions. They will however need to be changed to allow these operations access to aLoadMonad
, for the benefit of delayed, on-demand loading of extensive text data (applies to hunks at the moment). Currently, theCommute
class allows this, by substituting the Maybe monad for aCommuteMonad
constraint, when compared to the original darcs definition. - Load, store and convert (to/from legacy format)
PatchInfo
objects that are used to track metadata inNamed
patches.
About V3 Prims in patchlib
The version 3 primitives in their current incarnation in patchlib have the following traits (based on the list from the proposal):
- File content and file location/existence are tracked separately, tied together through universally-unique identifiers (of 256 bits). This avoids a number of conflict scenarios and may allow further novel features, like non-history-disruptive project splits and merges or subtree “checkouts”.
- Hunk content is detached from the hunk patch itself through a hash (unless shorter than twice the length of the hash representation). This makes the patch files themselves very compact and allows most commute rules to avoid loading the hunk content altogether (improving speed and reducing memory footprint). It also makes handling of binary files vastly more efficient.
- The hunks are byte-based, making it substantially more efficient to apply a patch to a file, since no newline scanning needs to happen.
- The same hunk format that is used for text files is reused for binary hunks, basically encoding a range substitution operation on the binary file. A binary delta algorithm can be plugged in to compute more efficient binary hunks, although even full content replace is much cheaper than the binary patch type available in V1 prims.
- A set of primitive patches implementing a hunk “move” operation is implemented, and is passing the generic commutation / application tests. Unfortunately, at this point there is no diffing algorithm to detect such moves, although one is planned for the future.
- Hunk and hunk move patches are the only content-editing patches available to date. Further patch types can be added to the library without restriction as long as the format is not frozen (at least token replace and indentation patch types are planned before any such freeze). New object types and accompanying patch types may be added in later backward-compatible revisions of the format.
About gorsvet
Gorsvet is a toy implementation of a repository layer that uses V3 primitive
patches. (Un)fortunately, the V3 prims violate fundamental assumptions made by
the repository and command layers of darcs, which means that an integration is
substantially more expensive than fits a single SoC project. However, as
outlined above, it is useful to be able to play around with the V3 prim patches
in a realistic environment. Therefore, gorsvet. I made it into a rather thin
user shell based on cmdlib
and a prototype repository layer. The UI more or
less resembles darcs (without interactivity, since that’d be superfluous for a
tool of this scope). You are of course welcome to try the experimental tool
out: the online help should give you an idea how to use it.
One thing I would like to discuss a bit here is the repository format. Since the patch types are incompatible anyway, we are fully liberated from backward compatibility considerations. The next darcs repository format can be designed from scratch, keeping in mind the shortcomings of the previous two formats. The implementation in gorsvet is a peek at what the result might look like. Anyway, we still need a better “composite” patch layer, which represents conflicts (and sits one level above primitive patches), since the current (version 2) composite patches in darcs are quite unsatisfactory. That also means we have plenty of time to play around with the repository format, which is more or less independent of the composite patch format.
As for the format itself, I went for as simple as possible (but no simpler). So
far, I have 2 files and a hashed store: .gorsvet/hashed
is a sha256-based,
uncompressed “dumping grounds” for stuff of all kinds. The files are (not
implemented as of this writing) .gorsvet/index
(with the same purpose as
_darcs/index
— fast, efficient working copy access) and .gorsvet/meta
— a
small set of root pointers. This has two very beneficial side effects: very
atomic oupdates and transactional semantics (free transaction
rollback). Compression and garbage collection of the hashed store can then be
sorted out separately (and does not affect semnatics of the repository either
way). There are currently 4 root pointers in meta
: shadow
, pristine
,
inventory
and order
. The pristine
is the same thing as in darcs, shadow
is a similar thing, but at any time reflects the current state of the working
copy (it is automatically updated from working every time, before anything else
happens with the repository). The reason is mainly that the working copy has
entirely wrong semantics for diffing V3 primitive patches: most importantly,
UUID tracking is implemented through shadow
.
The remaining two root pointers represent two new data structures (and
inventory
is different from what darcs calls an inventory). The order
simply lists patchinfo hashes (a handle for each patch that does not change
through commutation) in the application order for the repository: in this
sense, order
replaces hashed_inventory
known from darcs. It is pretty
compact, but on its own also useless (since it give us no way to get the
patches themselves). The inventory
then, on the other hand, is an efficient
map (currently a sorted pair list, written and read as a Data.Map) from
patchinfo hashes to the current patch storage hashes. Moreover, the patchinfo
objects are, unlike in darcs, stored in the hashed store as separate entities,
and the patchinfo hash can be used to efficiently fetch the patchinfo
itself. Therefore, the named patch can be assembled from inventory
and
order
puts the patch into correct context.
Apart from storing and showing things, I have also implemented a “pull” command for gorsvet, but it’s currently fairly unusable, since any conflict automatically means failure (there is no layer to handle conflicts; we could use version 2 darcs patches, but I think it would constitute a dangerous slippery slope: we definitely want a more solid implementation, and also want to avoid a double transition).
Future Work
The obvious future work lies in the conflict handling. There are two main options in this regard: either re-engineer a patch-level, commute-based representation of conflicts (in the spirit of mergers and conflictors), as V3 “composite” patches, or alternatively, use a non-patch based mechanism for tracking conflicts and resolutions. It’s still somewhat early to decide which is a better choice, and they come with different trade-offs. Nevertheless, the decision, and the implementation, constitute a major step towards darcs 3.
The other major piece of work that remains is the repository format: in this area, I have done some research in both the previous and this year’s project, but there are no definitive answers, even less an implementation. I think we now have a number of good ideas on how to approach this. We do need to sort out a few issues though, and the decision on the conflict layer also influences the shape of the repository.
Each of these two open problems is probably about the size of an ambitious SoC project. On top of that, a lot of integration work needs to happen to actually make real use of the advancements. We shall see how much time and resources can be found for advancing this cause, but I am relatively optimistic: the primitive level has turned out fairly well, and to me it seems that shedding the shackles of legacy code sprawl can boost the project as a whole significantly forward.
(PS: While I agree, on the theoretical level, that nuking significant amounts of legacy code carries non-trivial risks, for a small volunteer project like darcs it is imperative to be fun. And a trench war against legacy code is not fun. Writing new things and exploring possibilities, on the other hand, is fun. Which means we need a bit more of the latter and a bit less of the former, even though the project sits in the more conservative camp — afterall, we handle rather precious data… Even when taking natural conservativism into account, a well-motivated, honest subsystem rewrite is better than a half-hearted, someone-has-to-do-it maintenance of a piece of code that everyone hates…)