Quantcast
Channel: tag/darcs
Viewing all articles
Browse latest Browse all 25

soc reloaded: outcomes

$
0
0

This blog is kind of a final report for the summer. I had a progress report drafted for about two weeks, so let me paste that here just for the record. To read about the actual results, please skip to the “State of the Code” section below.

The Last Progress Report

This is the second (and last) progress update for this summer of code project. It was written something like a week before the pencil’s down, but I got disheartened and instead of finishing and posting, I went on to code some more. Here it goes…

Since my last report, I have decided to turn somewhat more radical again. The original plan was to stick with the darcs codebase and do most (all) of the work within that, based primarily on writing tests for the testsuite and not exposing anything of the new functionality in a user-visible fashion. I changed my mind about this. The main reason was that the test environment, as it is, makes certain properties hard to express: a typical test-suite works with assertions (HUnit) and invariants (QC). In this environment, expressing ideas like “the displayed patches are aesthetically pleasing” or “the files in the repository have reasonable shape” is impractical at best.

An alternative would have been to make myself a playground using the darcs library to expose the new code. But the fact is, our current codebase is entrenched in all kinds of legacy issues, like handling filenames and duplicated code. It makes the experimenter’s life harder than necessary, and it also involves rebuilding a whole lot of code that I never use, over and over.

All in all, I made a somewhat bold decision to cut everything that lived under Darcs.Patch (plus a few dependencies, as few as possible) into a new library, which I named patchlib, in the best tradition of cmdlib, pathlib and fslib. At that point, I also removed custom file path handling from that portion of code, removed the use of a custom Printer (a pretty-printer implementation) module and a made few other incompatible changes.

Of course, the testing code went along. The net result, at least for me, was an ability to build and test a much smaller piece of self-contained code. It also allowed me to experiment with APIs a bit, where those were used all over darcs, which made it, within the big codebase, impossible to advance without expending disproportionate amount of work on every change. Of course, part of that will be paid back when we decide to port darcs over to use patchlib.

I originally planned this report for the start of this week, but then I got caught in a big refactor of the ApplyMonad/Apply classes (again). The refactor was triggered by the need to pretty-print patches, which is not a completely easy task (made more complicated by the fact that UUIDs are meaningless for the user, so formatting patches without context is essentially useless now). Anyway, I am now much happier with how the ApplyMonad class looks (the ApplyMonadBase thing was genuinely hideous… good riddance). As a net result the ApplyMonad class and, even more importantly, the ApplyMonad transformer (used in applyTo{Tree,State} among other things) is substantially easier to use on the client side (while maybe very slightly harder on the provider side, it is also much clearer, IMHO). Overall win.

As for formatting and summarising patches, I have created a new Display class (I plan to nuke the existing viewing classes more or less, when I manage to make meaningful Display instances for V1 Prim). The API lets you format patches based on their ending or starting context. Since bits of the patches need to be fetched from the hashed store, the display needs to run in a LoadMonad. Since any reasonable patch formatter also needs to pass state around (and since the actual type of state passed around will be different for different Prim implementations, we hide the state in a type family of monad transformers; this also opens the option to use something else than StateT when appropriate).

(This is the end of the “progress” post. The remainder more or less describes the end state of the project.)

State of the Code

You can look at the code in my darcs repositories. Specifically, to play around with a “demo”, you need pathlib, fslib, patchlib, cmdlib and gorsvet. A bit more about the individual libraries:

  • pathlib is small library for handling file paths; it uses strict bytestring representation, and adds a certain amount of type safety and a few form invariants
  • fslib is the successor of hashed-storage; it deals with accessing the filesystem in an efficient manner, with good support for hashing files and using the hashes for efficient comparisons
  • patchlib, as outlined above, came into existence as a “fork” of the Darcs.Patch hierarchy, add and take some; the code is self-contained, and has a set of QC/HUnit tests, although these definitely need extending to cover more of the functionality (and some of the functionality needs to be removed)
  • cmdlib is my commandline parsing library, and is needed by the test client gorsvet (see next point)
  • gorsvet is an experimental client using version 3 primitive patches, and a very simple hashed repository format; its main purpose is to demo the implementation of V3 prims, in addition to existing QC coverage;

About patchlib

One of the main problems of the darcs codebase today is insufficient modularity, and patchlib is an experiment in an attempt to address that concern. Efforts to bring about significant improvement of the situation from “inside” (by restructuring existing code) have to date failed. Even though there have been local improvements, the overall problem stubbornly persists.

Hand in hand with modularity problems come issues with unclear and underspecified (both internal end external) APIs. Since the separation between different components of darcs is blurry at best, the pressure to introduce clean, testable interfaces is minimal. The external library, on the other hand, is forced to put up a presentable façade. Luckily, the Patch subhierarchy in darcs is, compared to remainder of darcs, in a fairly good shape in this respect.

Eventually, patchlib should provide leverage to work with darcs(-style) patches, including at least:

  • Implementation of primitive patches, both version 1 (as used by darcs 1 and 2), compatible with existing repositories, and version 3 (with better semantics and more efficient representation; subject of this SoC project).
  • Implementation of the “real” patches, in version 1 (as used by darcs 1) and version 2 (as used by darcs 2), and at some point, version 3 (of as yet unspecified properties).
  • Implementation of PatchInfo and “named” patches, which implement changesets in the darcs sense, and allow tracking their metadata. With PatchInfo (and to a limited extent, patch implementations themselves) goes a set of matchers and support code for interactive dependency-aware patch selection.

With a homogenous set of APIs, mostly mediated by type classes and appropriate instances, to:

  • Create primitive patches (mainly through diffing, but also by version-specific direct construction). The most important type class is Diff, defined in Data.Patch.Diff. The interface allows multiple equivalent representations of a single change, intended to provide an interface for heuristic detection of high-level patch types like hunk move or token replace.
  • Create real and named patches by building them from constituent primitives.
  • Store and load all types of patches. This role used to be filled by the ShowPatchBasic and ShowPatch classes, but is being superseded by Store and Load instances (the classes are provided by fslib for generic hash-based stores).
  • Format (and summarise) patches for user-friendly display, using relevant context to improve the legibility and usefulness of the output. The new class to achieve this is Display. Previously, this role was filled by ShowPatch, with its showContextPatch method (which is, however, significantly under-used by darcs). Since V3 primitive patches are substantially harder to read without context, the interface for user-level rendering of patches mediated by Display mandates context use.
  • Commute, invert (all patch types) and merge (“real” and “named” patches). The primary classes are Commute and Merge, currently both faithful copies of their darcs versions. They will however need to be changed to allow these operations access to a LoadMonad, for the benefit of delayed, on-demand loading of extensive text data (applies to hunks at the moment). Currently, the Commute class allows this, by substituting the Maybe monad for a CommuteMonad constraint, when compared to the original darcs definition.
  • Load, store and convert (to/from legacy format) PatchInfo objects that are used to track metadata in Named patches.

About V3 Prims in patchlib

The version 3 primitives in their current incarnation in patchlib have the following traits (based on the list from the proposal):

  • File content and file location/existence are tracked separately, tied together through universally-unique identifiers (of 256 bits). This avoids a number of conflict scenarios and may allow further novel features, like non-history-disruptive project splits and merges or subtree “checkouts”.
  • Hunk content is detached from the hunk patch itself through a hash (unless shorter than twice the length of the hash representation). This makes the patch files themselves very compact and allows most commute rules to avoid loading the hunk content altogether (improving speed and reducing memory footprint). It also makes handling of binary files vastly more efficient.
  • The hunks are byte-based, making it substantially more efficient to apply a patch to a file, since no newline scanning needs to happen.
  • The same hunk format that is used for text files is reused for binary hunks, basically encoding a range substitution operation on the binary file. A binary delta algorithm can be plugged in to compute more efficient binary hunks, although even full content replace is much cheaper than the binary patch type available in V1 prims.
  • A set of primitive patches implementing a hunk “move” operation is implemented, and is passing the generic commutation / application tests. Unfortunately, at this point there is no diffing algorithm to detect such moves, although one is planned for the future.
  • Hunk and hunk move patches are the only content-editing patches available to date. Further patch types can be added to the library without restriction as long as the format is not frozen (at least token replace and indentation patch types are planned before any such freeze). New object types and accompanying patch types may be added in later backward-compatible revisions of the format.

About gorsvet

Gorsvet is a toy implementation of a repository layer that uses V3 primitive patches. (Un)fortunately, the V3 prims violate fundamental assumptions made by the repository and command layers of darcs, which means that an integration is substantially more expensive than fits a single SoC project. However, as outlined above, it is useful to be able to play around with the V3 prim patches in a realistic environment. Therefore, gorsvet. I made it into a rather thin user shell based on cmdlib and a prototype repository layer. The UI more or less resembles darcs (without interactivity, since that’d be superfluous for a tool of this scope). You are of course welcome to try the experimental tool out: the online help should give you an idea how to use it.

One thing I would like to discuss a bit here is the repository format. Since the patch types are incompatible anyway, we are fully liberated from backward compatibility considerations. The next darcs repository format can be designed from scratch, keeping in mind the shortcomings of the previous two formats. The implementation in gorsvet is a peek at what the result might look like. Anyway, we still need a better “composite” patch layer, which represents conflicts (and sits one level above primitive patches), since the current (version 2) composite patches in darcs are quite unsatisfactory. That also means we have plenty of time to play around with the repository format, which is more or less independent of the composite patch format.

As for the format itself, I went for as simple as possible (but no simpler). So far, I have 2 files and a hashed store: .gorsvet/hashed is a sha256-based, uncompressed “dumping grounds” for stuff of all kinds. The files are (not implemented as of this writing) .gorsvet/index (with the same purpose as _darcs/index — fast, efficient working copy access) and .gorsvet/meta — a small set of root pointers. This has two very beneficial side effects: very atomic oupdates and transactional semantics (free transaction rollback). Compression and garbage collection of the hashed store can then be sorted out separately (and does not affect semnatics of the repository either way). There are currently 4 root pointers in meta: shadow, pristine, inventory and order. The pristine is the same thing as in darcs, shadow is a similar thing, but at any time reflects the current state of the working copy (it is automatically updated from working every time, before anything else happens with the repository). The reason is mainly that the working copy has entirely wrong semantics for diffing V3 primitive patches: most importantly, UUID tracking is implemented through shadow.

The remaining two root pointers represent two new data structures (and inventory is different from what darcs calls an inventory). The order simply lists patchinfo hashes (a handle for each patch that does not change through commutation) in the application order for the repository: in this sense, order replaces hashed_inventory known from darcs. It is pretty compact, but on its own also useless (since it give us no way to get the patches themselves). The inventory then, on the other hand, is an efficient map (currently a sorted pair list, written and read as a Data.Map) from patchinfo hashes to the current patch storage hashes. Moreover, the patchinfo objects are, unlike in darcs, stored in the hashed store as separate entities, and the patchinfo hash can be used to efficiently fetch the patchinfo itself. Therefore, the named patch can be assembled from inventory and order puts the patch into correct context.

Apart from storing and showing things, I have also implemented a “pull” command for gorsvet, but it’s currently fairly unusable, since any conflict automatically means failure (there is no layer to handle conflicts; we could use version 2 darcs patches, but I think it would constitute a dangerous slippery slope: we definitely want a more solid implementation, and also want to avoid a double transition).

Future Work

The obvious future work lies in the conflict handling. There are two main options in this regard: either re-engineer a patch-level, commute-based representation of conflicts (in the spirit of mergers and conflictors), as V3 “composite” patches, or alternatively, use a non-patch based mechanism for tracking conflicts and resolutions. It’s still somewhat early to decide which is a better choice, and they come with different trade-offs. Nevertheless, the decision, and the implementation, constitute a major step towards darcs 3.

The other major piece of work that remains is the repository format: in this area, I have done some research in both the previous and this year’s project, but there are no definitive answers, even less an implementation. I think we now have a number of good ideas on how to approach this. We do need to sort out a few issues though, and the decision on the conflict layer also influences the shape of the repository.

Each of these two open problems is probably about the size of an ambitious SoC project. On top of that, a lot of integration work needs to happen to actually make real use of the advancements. We shall see how much time and resources can be found for advancing this cause, but I am relatively optimistic: the primitive level has turned out fairly well, and to me it seems that shedding the shackles of legacy code sprawl can boost the project as a whole significantly forward.

(PS: While I agree, on the theoretical level, that nuking significant amounts of legacy code carries non-trivial risks, for a small volunteer project like darcs it is imperative to be fun. And a trench war against legacy code is not fun. Writing new things and exploring possibilities, on the other hand, is fun. Which means we need a bit more of the latter and a bit less of the former, even though the project sits in the more conservative camp — afterall, we handle rather precious data… Even when taking natural conservativism into account, a well-motivated, honest subsystem rewrite is better than a half-hearted, someone-has-to-do-it maintenance of a piece of code that everyone hates…)


Viewing all articles
Browse latest Browse all 25

Latest Images

Trending Articles





Latest Images