Quantcast
Viewing latest article 24
Browse Latest Browse All 25

soc reloaded: progress 1

Oops! There has been no update for a long while from me, although I have been busy with code/patches. So far, I have tackled two areas: generalising and cleaning up the existing Patch testsuite, so I could apply it to the in-progress V3 Prim code later. This has been quite successful, although it took a little longer than I would have liked. With the new structure, (QC) properties for single patches, commutes and merges can be applied to any concrete patch type that supports the respective operations. Therefore, I now have coverage of both V3 Prims as standalone patches (for single patch and commute properties) and also when used with RealPatch (the implementation of non-primitive darcs 2 patches).

The latter part was then to make all these tests pass. Since I finished the respective Read/Show instances yesterday, all tests pass. Commute, apply and friends have been done couple days before that. So the next step is to write more tests that can demonstrate where the code needs to be augmented.

Hunk storage

I have slightly changed the actual (on-disk) hunk format slightly. For now, “detached” hunk-text storage is not quite supported, I am keeping that post-midterms. But the format still counts on that being possible. We do need a new monad (class) for writing patches though, since the Show instance is somewhat inadequate: the detached storage needs to be handled somehow.

Anyway, the format now looks like this:

hunk NNN .whitespace_encoded_old .new_text

(we use the same method for encoding whitespace as we do in filenames here). We might want to change to a format that’s faster to produce/parse, since hunks typically do contain whitespace. On the other hand, only very short hunks will be encoded in this form. Also, an empty string is encoded as “!”. So the hunk text (old and new) can take following forms: .whitespace_encoded, ! (empty) or “@”. The last form, @ should take 2 parameters, a hash and length, like e.g. this: @123456789ABCD<...>:65000. That means that most of the time, we can ignore the hash and only fool around with the numbers (offset and lengths) when commuting patches. Of course, applying them is a different matter: nevertheless, we still do have a substantial advantage over V1 prims there, since each hunk is a simple splice/catenate operation on bytestrings. With V1 prims, we had to chop up the hunk at newlines and remove the +/- signs.

As for implementation, this means we need to abstract commute over a monad class, which besides commutation failure can express an “fetch text for this hash” operation. This might be simpler than it was in the Apply case, a lot of code had to be modified to accomodate for V3 Prims, since the existing commute runs in the Maybe monad and we can make Maybe an instance of CommuteMonad. Nevertheless, to make actual use of this, toplevel code that runs commutes will definitely need to be modified, and in effect, all the intermediate code that relies on Maybe for commutation will need to be modified as well. This part could become actually more hairy than was the case with Apply.

The question of coalesce

Apart from apply and commute, there is currently one more “core” operation in Darcs patches: coalesce. This operation takes two primitive patches and decides whether they can be merged into a single primitive patch. This can only happen if the patches do not commute. Unfortunately, coalesce does not preserve commutation behaviour: move a b :> move b c gets coalesced into move a c, which modifies its commutation behaviour with (add, remove, move) patches mentioning b. On the other hand, coalesce is normally only used during operations like amend-record, rebase and when handling the “pending” patch. All these operations modify the identity of a patch, and therefore it shouldn’t matter much that coalesce fails to commute with commute.

On the downside, coalesce is currently a first-class operation that cannot be derived from the remaining. Most importantly, it is “redundant” with the diff operation, that constructs patches from two states. The problem is that with current (V1) prims, there is no diff operation for some patches. If we had reliable diff for all patch types, we could implement coalesce in terms of diff, commute and apply (pseudocode):

coalesce context (a :> b) | isJust (commute (a :> b)) =
    diff context (apply context $ a :> b)
                          | otherwise = a :> b

This would work as long as there was a deterministic diff operation, i.e. one with the property (for a being a primitive patch) that diff ctx (apply ctx a) == a for all a. This diff operation does not need (and indeed cannot be made) universal over different patch types, but fortunately that doesn’t matter. We can always call it with a specific patch type as one of its inputs:

diff TextHunk ctx1 ctx2

I believe this operation should be possible to have, and it would also allow significant improvements in the “record” user experience: darcs could try various differs on a pair of states and offer “better” patches than just hunks (like eg. replace, move, etc.).

The ability to implement coalesce in terms of other operations is important because even more than commute, the implementation of coalesce is O(n^2), with n being the number of different primitive patch types: it needs to take into account any pair of types. With the above approach, since coalesce always yields a single resulting patch, we can implement it as follows:

  • try to commute a with b; if this works, there’s nothing to coalesce
  • if it fails, take a context in which (a :> b) can be applied, ideally as small as possible (since diff is somewhat expensive)
  • try to diff the original and post-a/b context, using both the “a” differ and “b” differ: if either succeeds (i.e. produces a single primitive patch), then we have a winner; otherwise, we cannot coalesce either

An UI digression

This is not GSoC related: you can skip this section if all you care about is GSoC…

Anyway, when I am talking about user interface. I think a substantial improvement in the way “semantic” patches work would come with the ability to infer those patches “automatically”. In the spirit of the above diff operation that is available with every patch type. However, it is hard to do a semantic “diff” on significantly divergent repository states. This is simply because further changes obscure the relationships of entities in question. When the only thing that changes is a “mv”, it is easy to detect. But when you also edit the file, it is no longer possible to tell for sure if this is a move or a new object.

What would help substantially, then, would be to be able to run diff much more often. This prompts a workflow change. Of course, we cannot ask people to commit every time they do a small change. However, we could ask them to “amend” an “in-progress” patch when they do. This would be especially useful if they can be coached into stashing their changes before and after things like applying “sed” to whole codebase, moving around files etc. This would be basically a supercharged version of “darcs mv”: it would say “I did something, you figure what is the right way to represent it”. It adds the burden of having to call this both before and after the “contentious” operation. But it also adds significant benefits (IMHO).

The other thing that would this kind of “open” patch (that you keep adding things to, until you are satisfied, and then you commit) allow is progressive (time-sensitive) revert. This is something that I would be completely sold on: if I kept telling the VCS to note down my changes reasonably often, I could get, in exchange, a whole-repository (but still granular) undo operation.

(It is not hard to imagine that you could also have more than one “open” patch at a time, sorting changes into different buckets for semi-related changes. The UI for that one would be more tricky though.)

The story of Apply

To get back to GSoC though. For what it’s worth, the test-suite part of the work and a sketch of the V3 prim implementation is already in the screened branch. The changes to the Apply class are almost ready for getting into screened as well; they are currently available as patch635.

The basic challenge with Apply was that V1 prims and V3 prims apply to different kinds of things. While V1 applies to a filesystem tree (basically your working copy), this is not the case with V3 anymore. The V3 state is modelled as a map from UUIDs to Objects. However, it would be extremely uncool to have two different “Apply” type classes to apply different kinds of patches: this would also mean that all higher layers of patch code would need to duplicate their apply implementations. Not cool.

Therefore, associated types to the rescue: I have added an ApplyState associated type to the Apply class. The Prim level patches then decide what they can be applied to (currently a Tree, from hashed-storage, for V1 and an ObjectMap, which still needs to be fleshed out in more detail, for V3). Any higher levels inherit their apply state from the prims. Cool.

Of course, it’s not as simple, since we actually have to implement that cool apply method. This was traditionally (well, since I merged my new annotate code; not that traditional I guess) been done through the ApplyMonad class. Now ApplyMonad used to have operations like “create a file”, “create a directory” or “write this bytestring to this file”. That’s cool for V1 prims, but not very useful for V3 prims. So ApplyMonad needed to be generalised over multiple apply states. This forced a multi-parameter type class, since there are no functional dependencies to save us (and therefore associated types do not apply either). This is because we expect some monads (e.g. IO) to be instances of both ObjectMap- and Tree-based ApplyMonad. In general, I didn’t want to limit this to special monads, although we might go for that option later if it turns out to be superior.

Anyway, the ApplyMonad class is a bit of a meta-class, since the actually useful set operations is different for different apply state representations. But since the methods can carry their own constraints, I have added a couple of fully generic methods (get current state, set current state and the like) and a set that only applies to ObjectMap and one that only applies to Tree. This doesn’t seem to pose significant trouble. Haskell doesn’t seem to have higher-order type classes that would solve this maybe slightly more elegantly. (I don’t even know if they are possible. Don’t crucify me if they aren’t.)

Anyway, long story short, we now have a single apply method in a single Apply class, that works on both V1 and V3 prims, as witnessed by running the same set of tests, which sometimes do invoke apply, on both implementations.

The story of Commute

There isn’t much of a story behind this one so far. As I outlined above, there are things coming here as well, but they are not required to allow V3 prims per se: only needed for the detached storage optimisation. The commute as it is implemented works, at least as far as tests are concerned. This goes hand in hand with some kind of StoreMonad and LoadMonad abstraction, that will actually allow us to implement the detached storage. The CommuteMonad can then be a

class (LoadMonad m) => CommuteMonad m where
    commuteFail :: m a

kind of deal. A LoadMonad superclass constraint can (should) appear on ApplyMonad as well. For now, the instances don’t need to be too elaborate (they can simply fail to fetch anything at all, which will work just fine for V1 prims).

The story of unsafeInterleaveIO

I don’t like unsafeInterleaveIO. At all. My last summer of code was, after a fashion, about removing a significant source of unsafe, ugly and outright dangerous lazy IO from darcs. I believe it was a significant success. Now the LoadMonad / StoreMonad abstraction has a potential to rid us of another source of lazy IO in darcs: currently, the reading of patch content is unsafeInterleaveIO`d and the rest of the code treats the repository as a kind-of-pure data structure built of patches. Of course, the unsafeInterleaveIO is unsafe because it breaks the type system. Since darcs uses it a lot, there is no telling which value is actually pure. Arguing about runtime behaviour of lazy code is hard enough when it is actually pure. Random IO thunks lurking inside pure values (kind of like an alien in Sigourney’s stomach) turn it into a nightmare.

This will involve a lot of code being lifted into a monad, fortunately a significantly restricted monad. In practice, it’ll be IO more often than not (although in the testsuite, it’ll probably be the like of Maybe, or StateT Maybe). What matters is not that the code will execute in IO, but it is statically well protected: it cannot access the IO monad and it has to be explicit about side effects (like loading things from disk). Therefore, we give things types that actually reflect impurity, without allowing them to spin out of control with side effects (like they could if they were simply in the IO monad). A static type system win.

Conclusion

Ok, I guess this is more than enough for today. I’ll try to keep you folks more informed about the progress in the second half of the endeavour. On the other hand, I am a bit worried that these posts are more useful as a note-to-self resource than for general reading by others. Well, let’s hope, dear reader, that you found at least something of this post useful and/or interesting.


Viewing latest article 24
Browse Latest Browse All 25

Trending Articles