Wednesday, August 20, 2008

File Management

I need to access my personal files in different locations such as home, work office or when traveling. I wish to be able to move between a number of trusted computers, rather than being tied to a specific laptop, as I like to cycle to and from work. I also don't want the hassle of running my own server at home (have done this in the past).

So my first step in solving this is to store all my files in a Distributed Version Control System. I get all the benefits of a common centralised VCS such as version history and version management between multiple machines, as well as being able to work normally when I don't have an internet connection (e.g. travelling). I have chosen Git, although Mercurial would probably be a fine choice to.

The second step is having a location for a master repository that is accessible over the internet. I could have purchased a hosted linux virtual machine, but I didn't want to deal with setting it up, security, software upgrades, etc. Git can synchronise repositories located at different points on the same file system, so I thought I would try a locally mounted, encrypted virtual file system over Amazon S3. I chose JungleDisk for this purpose.

As it is only me using these Git repositories, I only have one machine writing to the master at any one time, so I don't have to worry about concurrency issues. Secondly, whenever I clone a repository from the master, I use the --no-hardlinks option, although I am not sure if that is necessary.

In principle the ideas have worked out pretty well. I have run into some issues though. From minor to major:

  • S3 has been unavailable on two occasions, when I have tried to access it in the last three months.

  • Sometimes I have had errors pulling (synchronising) from the master. Recreating the local repository by cloning it again from the master has solved these issues. This may also be similar to the next one.

  • I have had a case where I don't get any errors pulling from the master, but I don't get the latest commits pushed from another machine either. This one has been a real pain. In the process of getting everything back to a stable state, I updated to Git 1.6.0, JungleDisk 2.10a, deleted my local JungleDisk caches and reduced the cache size down to the minimum (I would have liked to turn caching off altogether). I suspect the JungleDisk caching was the issue, but that is only a guess. Will see how things go over the next few weeks.

I now don't need backups from a file deletion point of view, as the VCS takes care of that (I am not using any of the Git feature to modify history). I also keep a subset of the machines synchronised on a daily basis, so I don't need backups from a hardware failure/lost/stolen perspective either.

No comments: