I have recently gone through the exercise of restoring a medium sized SubVersion repository from an older backup and wanted to share my experience with everyone. First, the problem:
After you restore the older backup, if any work was performed between the last backup and when the repository crashed, the repository will be “older” than what developers have on their boxes. Here is an example:
- Let’s say you have a backup from Monday evening
- The repository crashed sometime Tuesday
- There was work done during the day Monday
Now, let’s say there is a project A that looks like this:
- The last change to the project was at revision 100 just before the repository crashed
- The last change to the project from the backup was 80
- Frank’s computer contains a checkout of the project at revision 85
- Mary’s computer contains a checkout of the project at revision 100
This means that both Frank and Mary’s computers contain newer code than the repository, but not the latest code. Mary’s computer contains newer code than the repository, but might not be a complete snapshot of what version 100 was before the repository was corrupt. The reason why Mary’s computer might not be 100% correct is that Mary might have committed files to the repository but not performed a “svn update” prior to committing.
SubVersion is like CVS in that each file contains a version number. So, you might have a local checkout that contains version 100 for one file and 90 of another file. Therefore, you might be missing an updated version of the file from revision 93 when it was checked in.
Okay, now onto the fix:
Each developer’s computer must be analyzed before anything new is put into the repository. You must have a complete picture of the entire company, otherwise you might miss some changes. These changes can be merged in by hand from each developers machine, but this could be error prone and lengthy process. It is usually better to script out as much as possible.
In order to determine the local “revision” of a project on a developers computer, you will need to look at each file in the checkout. You can run an ‘svn stat’ on each file to determine the version number of that file. Write a script to output a file like this for each local checked out project on all developers machines:
The first part is the file and the second part is the revision of that file.
Next, once you have the complete list of revisions for all projects on all developer’s computers in the entire company, you can compare each file with the current revision in the restored repository to determine if the developer has a later version of a file than the repository. This should ignore all files that the developer has modified locally, but not committed.
This comparison will look like this:
Mary has a later version of foo.java!
You should script all of these comparisons out. If a developer doesn’t have later revisions than the repository or any locally modified files, they can safely take these steps:
- Make a backup of the local checkout
- Delete the local checkout
- Re-checkout the project
- Don’t do anything until the restore is complete
The next step is to make a list of the revision that each file in the project was lasted changed on in the restored repository. This report will look like this:
Next, for each project, collect all of the revision reports from the previous step into a single location. These reports look like this:
Combine each of the developer reports into a global report that gives you the revision number for each file on every developers computer. Next, use this global developer report to determine whose computer contains the latest version for each file in the project. Based on the examples above, you can see that build.xml didn’t change recently enough to matter since the version in the repository is the same as the version on Frank’s and Mary’s computers. However, Frank made the most recent commit in revision 16405 to logging.properties and Mary made the most recent commit in revision 16410 to foo.java. Both of these files are more current that the repository and therefore need to be re-committed.
Finally, setup a staging area that contains a copy of every checked out project from every developers computer who has a later revision than the repository (from step 1). This will be pretty large, but necessary. Based on the results from step 2, copy the latest version of each file over to a clean checkout of the project from the restored repository. Once you have all of the changes copied over for a single project, commit all of those files for that project back to TRUNK.
You should now have a fully restored repository based on the files from various developers computers.