Archives

Creative Commons License
This blog is licensed under a Creative Commons License.

Using Git as a versioned data store in Python

| 18 Comments | No TrackBacks

Git has sometimes been described as a versioning file-system which happens to support the underlying notions of version control. And while most people do simply use Git as a version control system, it remains true that it can be used for other tasks as well.

For example, if you ever need to store mutating data in a series of snapshots, Git may be just what you need. It’s fast, efficient, and offers a large array of command-line tools for examining and mutating the resulting data store.

To support this kind of usage – for the upcoming purpose of maintaining issue tracking data in a Git repository – I’ve created a Python class that wraps Git as a basic shelve object.

Here is how you normally use the standard shelve module:

import shelve

data = shelve.open('data.db')

# data.db may or may not have existed on disk before now.  If not,
# We're Manipulating an Empty Dictionary.  If so, we can examine or
# modify the previous run's state data.  In both cases, the database
# is manipulated like a standard Python dictionary.

data[key] = "Hello, world!"
data.sync()        # Write out changes to the dictionary

del data[key]
data.close()       # Close and clean up, sync'ing only if necessary

This provides the simplest kind of database, without any query language or notion of whether previous state did or did not exist. Both of those are services you’d have to layer on top of the shelve object if you wanted them.

Now consider gitshelve. Whereas the Python shelve module stores your data by pickling all of the dictionary values, I pass whatever data you place in the dictionary straight on to Git’s standard input. In the default mode, this means you work strictly with string data:

import gitshelve

data = gitshelve.open(repository = '/tmp/data.git')

data[key] = "Hello, world!"
Data.Sync()                  # Repository is created if it doesn't exist

del data[key]
data.close()

The interface is identical, but with the Git version you can now examine the resulting repository’s yourself, using regular Git commands:

$ GIT_DIR=/tmp/data.git git log

By default, the commits have no associated comment text, but the sync method doesn’t accept parameters. If you wish to add transaction notes, use the commit method instead:

data.commit("This is a comment")

You can store data this way either in a separate repository, or in named branches within any repository. If the repository argument is not given, the named branch within the current Git repository is used. An exception will be raised, however, if you do this and there is no Git repository related to the current directory.

# I'm expecting to use the 'data' branch of the current repository, but
# I ran the script in a directory unknown to Git!
data = gitshelve.open(branch = 'data')

# It appears to work, because no Git commands are run until the last
# possible moment
data['foo/bar/hello.txt'] = "Hello!"

# This raises an exception, because there is no current repository.  To fix
# it, either run "git init", or use a specific 'repository' argument above.
data.commit("I just said hello")

The really nice thing about using Git this way is that you get all of its best features for free.

Added non-text values

If you have a need to store non-textual values, you’ll have to let gitshelve know how to deal with them. I don’t do any such handling by default, because of the big chance of doing the wrong thing, and having you not find out about it until it’s much too late. Just pickling data like shelve does isn’t very smart, for example, because it will wreak havoc on Git’s merge algorithms should you ever need to incorporate new data from another source.

So, let’s see how to add a custom data translator. First, you need to subclass a new type of gitbook, which is the wrapper used to interface with the blobs in the Git repository. There are only two methods you need to override:

class my_gitbook(gitshelve.gitbook):
    def serialize_data(self, data):
        return object_to_string(data)

    def deserialize_data(self, data):
        return object_from_string(data)

Now you must define object_to_string and object_from_string, which should examine the types of the objects passed and turn them into merge-friendly string as appropriate. Certain forms of XML work well for this job, as do ini-style configuration files in some cases. It’s up to you and what works best for your usage.

Once you have this new class type, you must pass it to the gitshelve.open function:

data = gitshelve.open(repository = '/tmp/foo', book_type = my_gitbook)

Making things even faster

Every time you open a gitshelve, it must walk through the assoicated branch and determine its contents in order to build the key/value relationships in the dictionary. If you find that this ever gets slow, what you can do is just pickle the gitshelve! The only caveat is that you must take care to delete it if the HEAD you created it from is different from the current HEAD. Here’s an example:

import gitshelve
import cPickle
import os

data = None
if os.path.isfile('data.cache'):
    fd = open('data.cache', 'rb')
    data = cPickle.load(fd)

    # I'm using an arbitrary file name here, __HEAD__
    if data['__HEAD__'] != data.current_head():
        data = None       # Out of date, we can't use it

if not data:
    data = gitshelve.open(branch = 'data')
    data['__HEAD__'] = data.current_head()

# ... for data sets with enormous quantities of tiny files, this
#     could really speed things up ...

Where can you get it?

The gitshelve module is being maintained as part of the git-issue project, which is yet another attempt to bring distributed bug tracking to Git. Actually, I tend to support multiple repositories as data backends, but right now Git is my initial focus. You can clone the project and test it out as such:

git clone git://github.com/jwiegley/git-issues.git
cd git-issues
python t_gitshelve.py

If see “OK” at the end of the unit tests, you’re good to go! There isn’t much documentation on gitshelve.py itself right now, beyond this blog entry, but then again the shelve-like interface is simple enough that you really shouldn’t need much more.

Or if you prefer, you can just browse the project at the GitHub project page.

No TrackBacks

TrackBack URL: http://www.newartisans.com/mt/mt-tb.cgi/10

18 Comments

It would be interesting to make it a backend for Shove.

For example, if you ever need to store mutating data in a series of snapshots, Git may be just what you need.

I’ve thought about this… do you think Git could be used to support real-time collaborative editing (like Gobby) from within Emacs or other editors?

It could be used for collaborative editing by automatically taking a snapshot every time sometimes saves. The only time it would seem to run slowly is if you exceed the “loose object” threshhold, at which point Git will auto-garbage collect your repository. You can turn this off by running “git config –global gc.auto 0” in the repo before you start things.

With that said, in Emacs you would add a function to `after-save-hook’ that calls out to Git and runs “git-commit” and auto-pushes. You would have to handle merge conflicts on pull if several of you push changes at the same time.

Interesting. I’ve envisioned it as something that would run on an idle hook, but on save might be easier to implement.

The advantage of the more immediate version is that you could tell git to ignore merge conflicts and always take the newer version since it would be more visible to the user.

Ya know, this is much easier to do in bzr – its complete functionality is exposed via a built-in Python API.

I once wrote a decorator (as part of a certificate management system) that effectively did a “bzr commit” with the method name and its arguments as the commit message if the wrapped operation succeeded, and did a rollback on the filesystem if an exception was passing through. SCM integration with Python is indeed a shiny thing at times.

Ya know, this is much easier to do in bzr – its complete functionality is exposed via a built-in Python API.

I once wrote a decorator (as part of a certificate management system) that effectively did a “bzr commit” with the method name and its arguments as the commit message if the wrapped operation succeeded, and did a rollback on the filesystem if an exception was passing through. SCM integration with Python is indeed a shiny thing at times.

If you are using Python why not use something better?

http://piranha.org.ua/blog/2008/05/19/hgshelve/

Why store the issues in a separate branch? Storing them alongside the normal source files has the benefit that the issues’ status is consistent with the source. So if I fixed an issue in ‘newver’ branch (some changes to source files and change the issue’s status), but not yet merged it to HEAD, then in HEAD the issue is still open. After the merge the issue will be closed.

Very high level design (we can formalize more via email if you’d like):

Each issue has a list of properties (e.g., assignee, creator, status, comments). They can be simple or lists. Each can also be either ‘global’ or ‘branched’. ‘global’ properties are stored in a dedicated branch (where also the issue’s unique id is stored). this branch is not used for normal development. ‘branched’ properties are stored in the same branches used for developement, probably in a subdir under the top dir (something like .issues)

This means that merging of branches only affects some properties and that some properties can be the same regardless of what branch you’re in (so creator, comments etc. are global, but status is branched)

When viewing an issue, it is composed from the branched and global properties.

I think that with careful granularity of branched properties (put each per file, not in one file, or make one file, but sorted), merging of two branches should not create conflicts, unless when called for (e.g., someone changed the priority of an issue in two branches)

Hope I am clear enough, Ittay

Ittay, I’m also interested in the approach you’re presenting. Please let me know if you get something going.

Would it be a good idea to use Python’s pickle module for serialising and restoring objects in object_to_string and object_from_string? Since it is native to Python it should work quite well unless it does not satisfy the criteria of being merge-friendly.

Sure, you can do that, you just couldn’t use git diff or git log -p anymore, since the contained data would be binary. I left it open the way that I did so that others could use XML, JSON, Pickle, etc.

Not trying to say that you should have used pickle. It is good that users have the option to choose their own serialisation.

Just wondering why pickle as a possible serialisation was not mentioned and if there are reasons some reasons to avoid using it in gitshelve.

Oh, in that case there was no reason at all that I avoided pickle; it just so happened that I needed XML and so made the design more abstract to accommodate that from the beginning.

Hi,

I wrote something similar for Ruby. Actually it was part of my blog engine and some day I realized, that your library is basically the same in Python.

http://matthias-georgi.de/2008/12/git-store-using-git-as-versioned-data-store-in-ruby

Hi Matthias, but your link comes up 404?

Leave a comment

About this Entry

This page contains a single entry by John Wiegley published on May 14, 2008 9:00 PM.

Emacs Chess now hosted at GitHub was the previous entry in this blog.

A new Ledger mailing list is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

Recent Comments

  • Przemek: Great article, indeed. @Kyle: I’m a git newbie and SVN read more
  • Kyle Bennett: John, thanks for the work you put into this. Since read more
  • Sigi: You deserve a lot of praise for this article. It’s read more
  • Tony: this is an excellent write up; I’ve been reading much read more
  • John Wiegley: Thanks for letting me know, I’ll try to rectify the read more
  • Rudi Farkas: Hello John Above, you say “The date at the front read more
  • Uwe Kleine-König: Hi John, a comment to the paragraph about reset: $ read more
  • Leonardo Boiko: Thank you very much for this; as a bottom-up guy read more
  • John Wiegley: Thanks for the update. I’ll include this among the next read more
  • Laust Rud: Excellent writing, thanks! The url for the git-core tutorial has read more
OpenID accepted here Learn more about OpenID
Powered by Movable Type 4.25