Could use some guidance on troubleshooting a failed SVN-to-Git migration with svn-all-fast-export
Nicholas Williams
nicholas.williams at zaxiom.com
Fri Jul 1 05:01:30 BST 2022
tl;dr of the below: We’re migrating a large, single Subversion repository into 857 Git repositories, and it went fine with some warnings [3], but some branches of some repositories are missing some individual files, and investigating that revealed some commits have a blank diff / no changed files even though revision metadata in the commit message points to revisions containing diffs and changed files.
*Background*
We have a large (~30GB) single Subversion repository with about 78,000 revisions. It is not OSS/publicly available. Its contents started out as a monolith in 2003, and over the years 857 “components” were extracted from the monolith and put in their own locations in the same Subversion repository, with their own version/branch tracking. We’re beyond ready to make the switch to Git (each of the 857 components will get its own Git repository) and plan on using svn2git/svn-all-fast-export [1] (we did initial testing with three different tools and this tool was the only one that was even close to being up to the task).
One of the first things we did was spend about two weeks writing and running various Python scripts to help us fully understand the revision history of the repository, primarily using `svn log —stop-on-copy`. With this, we were able to figure out what mapping rules we needed to write.
We spent another week writing and proofreading about 7,000 lines of rules, and determined that, because some of the components were extracted from other components, we actually had to run the migration in four passes: One with the rules for 854 of the components/repositories and then three more times with the rules for the three oldest components from which any other components can trace their lineage. This is to prevent rules for some of the 854 repositories from short-circuiting the rules for those other three repositories. And, by the way, this is the full command we are running:
/path/to/svn-all-fast-export --debug-rules --rules /path/to/rules.config --identity-map /path/to/authors.config --identity-domain “example.org" --add-metadata --stats /path/to/svn_repo
We then spent the last week debugging the rules with dry-run after dry-run and then wet-run after wet-run until we had a fully-completed “dress rehearsal” of sorts, giving us all of the migrated repositories to push to Bitbucket so that we could begin analyzing the results and look for any problems before we schedule the real/final migration. And boy did we find problems…
The first thing we did was perform recursive diffs on some (not all) strategic repositories and their more modern branches. We still have more to inspect, but what the first four showed us set off some alarm bells. The first indicator of problems were than some individual files were missing from some branches on these repositories and we’re having a really hard time figuring out how.
So, here’s the list of (obscured) filenames in question from one example repository:
README.txt
Foo/Somefile
Foo/Bar/Somefile
Foo/Baz/Somefile
Foo/Baz/Qux/Somefile
In Subversion, those Somefiles exist in branches 6.0.5, 6.0.6, and 6.0.7, but they do not exist in branches 6.0.8, 6.0.9, and 6.1.0 (they were deleted in a revision in 6.0.8). In Git, those Somefiles do not exist in any of those six branches.
In Subversion, README.txt exists in all six branches. In Git, README.txt exists only in 6.1.0 and does not exist in any of the other five branches.
Those are the only differences. All other files exist and all file contents match.
In trying to track down the issue, the first thing we discovered is that there was a revision about 11 years ago in branch 6.0.2 that did a *replacement” of README.txt. It wasn’t an M (modification), and it wasn’t a D (delete) in one revision followed by an A (add) in another revision. It was an R (replacement) in a single revision. So looking at even older branches in Git, README.txt exists in 6.0.1 through 6.0.4 and disappears in 6.0.5 only to re-appear in 6.1.0. But that file has never been deleted in Subversion in any of those branches, yet it’s mysteriously missing in those branches on Git. We tried searching Git history for the deleting commit using every single technique suggested on this StackOverflow question [2], but none of them gave us any results.
So we moved on to tracking down the issue with the four Somefiles. Grep-ing the svn2git gitlog file for the repository yielded the following (obscured) commit:
commit refs/heads/main-6.0.8
mark :15516
committer Some Person <some.person at example.org> 1624637813 +0000
data 96
MW-0 Prune cruft.
svn path=/all/source/Library/ComponentName/trunk-6.0.8/; revision=73604
D Foo/Somefile
D Foo/Bar/Somefile
D Foo/Baz/Somefile
D Foo/Baz/Qux/Somefile
progress SVN r73604 branch main-6.0.8 = :15516
This wasn’t a surprise. We knew these files had been deleted in branch 6.0.8, so it made sense to find that in the gitlog. A sanity check `svn log -r` confirms the committer, message, and deleted files match the gitlog. But, again, all of the techniques for finding when a file was deleted in Git failed, even on branch 6.0.8 where we knew it had been deleted. So we used `git log --grep=73604` to find the commit with that revision metadata, and then `git show` for that commit and … nothing. The commit diff is blank, the commit stats show no files changed. (Which, I’m sure, is why the git commits for finding when a file was deleted weren’t working.)
So we're at a point where we don’t know where else to look or what else to try. Keep in mind that it takes several hours to run the migration again, so if you have several suggestions that I can reasonably try simultaneously, that could ease things along.
Thanks,
Nick Williams
[1] https://github.com/svn-all-fast-export/svn2git <https://github.com/svn-all-fast-export/svn2git>
[2] https://stackoverflow.com/questions/6839398/find-when-a-file-was-deleted-in-git <https://stackoverflow.com/questions/6839398/find-when-a-file-was-deleted-in-git>
[3] There were some warnings during the migration. The vast majority looked like this:
WARN: Branch “foo" in repository “bar" doesn't exist at revision 12345 -- did you resume from the wrong revision?
But we were not stopping/resuming. I’m unsure if this could be a problem.
A couple other warnings I saw:
WARN: repository foo branch bar has some files copied from baz at 12345
WARN: backing up branch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kde.org/pipermail/kde-scm-interest/attachments/20220630/3e85c912/attachment.htm>
More information about the Kde-scm-interest
mailing list