Tuesday, July 8, 2008

Music Management

After going to all the trouble of ripping and encoding my CD's to a lossless format, I want to:

  • Ensure integrity of the music library, i.e. at any point be able to validate that all the files exist, their contents haven't changed and that there are no extra files.

  • Have a recovery strategy should there be a problem with the files.

Ideally, I would satisfy these requirements by placing all the music in a Distributed VCS and storing a master copy somewhere like S3. Unfortunately there are a couple of problems:
  • I tried out Git, but after the initial commit of a music file, the repository storage space on the filesystem took up twice the size of the music file. Furthermore, changing metadata such as fixing a spelling mistake in the track name and committing increases the repository by the full size of the file again. I assume this is because the files are binary and already compressed. I didn't try out Mercurial, but I expect it will be the same.

  • The music files are already large, even without the extra overhead of the previous point and the data transfer costs here in Australia are just too high.

My current solution:
  • Store the music library on a removable drive on the Mac at home.

  • Keep a copy of the music library on my computer at work by either periodically taking in the removable drive and using rsync or copying newer music onto a USB drive if physical space is at a premium, such as when cycling.

  • Put checksums of the files in a Git repository stored on both machines. I can then verify the integrity of a music library at any time. Currently I use md5deep because it can recursively process a directory tree and is available for both linux and Mac OS X. The default md5 program on the Mac does not seem to have the same feature set as md5sum on linux.

  • I also store FLAC fingerprints in the Git repository. FLAC files store a checksum of the uncompressed audio in the metadata and various tools, such as xAct on the Mac, can verify the file against that. I am not sure how useful storing the fingerprints is, but I can think of a few unlikely situations where it might be helpful, plus it is small and easy to generate anyway.

To verify a music library, I do:

$ cd $MUSIC_LIBRARY
$ md5deep -rl * | sort | diff $GIT_REPO/md5deep.txt -


where $MUSIC_LIBRARY and $GIT_REPO represent appropriate file paths.

I originally tried the matching feature of md5deep instead:

$ cd $MUSIC_LIBRARY
$ md5deep -rX $GIT_REPO/md5deep.txt *


However this does not catch the case where a file has been deleted in the music library but is still present in the Git checksum file.

No comments: