forked from jyknight/llvm-git-migration
-
Notifications
You must be signed in to change notification settings - Fork 0
/
zip-downstream-fork.py
executable file
·454 lines (387 loc) · 17.9 KB
/
zip-downstream-fork.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
#!/usr/bin/python
#
# This tool takes a repository containing monorepo history, rewritten
# subproject fork histories (done by migrate-downstream-fork.py) along
# with the revmap produced by migrage-downstream-fork.py, an
# "umbrella" history consisting of submodule updates from subprojects
# and rewrites the umbrella history so that the submodule updates are
# "inlined" directly from the rewritten subproject histories. The
# result is a history that interleaves rewritten subproject commits
# (zips them) according to the submodules updates, making it appear as
# if the commits were originally against the monorepo in the order
# implied by the umbrella history.
#
# Any non-LLVM submodules will be retained in their directories as
# they appear in the umbrella history.
#
# Usage:
#
# First, prepare a repository by following the instructions in
# migrate-downstream-fork.py. Pass --revmap-out=$file to create a
# mapping from old downstream commits to new downstream commits.
#
# Then add umbrella history:
# git remote add umbrella https://...
#
# Be sure to add the history from any non-llvm submodules:
#
# for submodule in ${my_non_llvm_submodule_list[@]}; do
# git remote add ${submodule} $(my_submodule_url ${submodule})
# done
#
# Pull it all down:
# git fetch --all
#
# Then, run this script:
# zip-downstream-fork.py refs/remotes/umbrella --revmap-in=$file
#
# With --revmap-out=$outfile the tool will dump a map from original
# umbrella commit hash to rewritten umbrella commit hash.
#
# TODO/Limitations:
#
# - The script requires a history with submodule updates. It should
# be fairly straightforward to enhance the script to take a revlist
# directly, ordering the commits according to the revlist. Such a
# revlist could be generated from an umbrella repository or via
# site-specific mechanisms. This would be passed to
# fast_filter_branch.py directly, rather than generating a list via
# expand_ref_pattern(self.reflist) in Zipper.run as is currently
# done. Changes would need to be made to fast_filter_branch.py to
# accept a revlist to process directly, bypassing its invocation of
# git rev-list within do_filter.
#
# - Submodule removal is not handled at all. The subproject will
# continue to exist though no updates to it will be made. This
# could by added by judicial use of fast_filter_branch.py's
# TreeEntry.remove_entry.
#
# - The script assumes that any commits in the umbrella history that
# do not update submodules should be discarded. It is not clear
# what should happen if such a commit happens to touch files with
# the same name as those in the monorepo (README files are typical).
# Adding support to keep these commits should be straightforward,
# but because decisions are likely to vary based on particular
# setups, we just punt for now.
#
# - Subproject tags are not rewritten. Because the subproject commits
# themselves are not rewritten, any downstream tags pointing to them
# won't be updated to point to the zipped history. We could provide
# this capability if we updated the revmap entry for subproject
# commits to point to the corresponding zipped commit during
# filtering.
#
# - If a downstream commit merged in an upstream commit, parents for
# the "inlined" submodule update are rewritten correctly, though the
# history can look a bit strange as updates to multiple submodules
# can create parents to history that is "already merged." For
# example:
#
# * (HEAD -> zip/master) Merge commit FOO from clang
# |\
# * \ Merge commit BAR from llvm
# |\ \
# | \ \
# | * | Do commit BAR in llvm
# | |/
# | * Do commit FOO in clang
# | |
# * | Downstream llvm work
# | |
# Monorepo
#
# There's no real harm in this, it just looks strange. A possible
# enhancement for this script is to collapse submodule updates that
# merge from upstream and have the result point to the most recent
# upstream commit merged in. However, this is difficult to do in
# general, because subprojects might have been updated from upstream
# at very different times and detecting a related set of submodule
# updates is not straightforward. Even a simple heuristic of
# "collapse all submodule upstream updates between downstream
# commits" won't always work, because it's possible that a
# downstream commit was submodule-updated in the middle of someone
# else updating all the subprojects from upstream.
#
import argparse
import fast_filter_branch
import os
import re
import subprocess
import sys
def expand_ref_pattern(patterns):
return subprocess.check_output(
["git", "for-each-ref", "--format=%(refname)"] + patterns
).split("\n")[:-1]
class Zipper:
"""Destructively zip a submodule umbrella repository."""
def __init__(self, new_upstream_prefix, revmap_in_file, revmap_out_file,
reflist, debug, abort_bad_submodule, no_rewrite_commit_msg,
skipped_pick_first):
if not new_upstream_prefix.endswith('/'):
new_upstream_prefix = new_upstream_prefix + '/'
self.new_upstream_prefix = new_upstream_prefix
self.revmap_in_file = revmap_in_file
self.revmap_out_file = revmap_out_file
self.reflist = reflist
self.new_upstream_hashes = set()
self.added_submodules = set()
self.merged_upstream_parents = set()
self.revap = {}
self.dbg = debug
self.prev_submodules = []
self.abort_bad_submodule = abort_bad_submodule
self.no_rewrite_commit_msg = no_rewrite_commit_msg
self.skipped_pick_first = skipped_pick_first
def debug(self, msg):
if self.dbg:
print msg
sys.stdout.flush
def get_user_yes_no(self, msg):
sys.stdout.flush
done = False
while not done:
answer = raw_input(msg + " (y/n) ")
answer = answer.strip()
done = True
if answer is not "y" and answer is not "n":
done = False
return answer
def gather_upstream_commits(self):
"""Walk all refs under new_upstream_prefix and record hashes."""
refs = expand_ref_pattern([self.new_upstream_prefix])
if not refs:
raise Exception("No refs matched new upstream prefix %s" % self.new_upstream_prefix)
# Save the set of git hashes for the new monorepo.
self.new_upstream_hashes = set(subprocess.check_output(['git', 'rev-list'] + refs).split('\n')[:-1])
def find_submodules_in_entry(self, githash, tree):
"""Figure out which submodules/submodules commit an existing tree references.
Returns [(submodule name, hash)], or [] if there are no submodule
updates to submodules we care about. Recurses on subentries.
"""
subentries = tree.get_subentries(self.fm)
submodules = []
for name, e in subentries.iteritems():
if e.mode == '160000':
# A commit; this is a submodule gitlink.
try:
commit = self.fm.get_commit(e.githash)
except:
# It can happen that a submodule update refers to a commit
# that no longer exists. This is usually the result of user
# error with a submodule update to a commit not reachable by
# any branch in the subproject. We almost always want to
# skip these, but ask the user to make sure. If they don't
# want to skip it, then we really don't know what to do and
# the user will have to fix things up and try again.
print 'WARNING: No commit %s for submodule %s in commit %s' % (e.githash, name, githash)
if self.abort_bad_submodule:
raise Exception('No commit %s for submodule %s in commit %s' % (e.githash, name, githash))
continue
submodule_entry = (name, e.githash)
submodules.append(submodule_entry)
elif e.mode == '40000':
submodules.extend(self.find_submodules_in_entry(githash, e))
return submodules
def find_submodules(self, commit, githash):
"""Figure out which submodules/submodule commits an existing commit references.
Returns [(submodule name, hash)], or [] if there are no submodule
updates to submodules we care about. Recurses the tree structure.
"""
return self.find_submodules_in_entry(githash, commit.get_tree_entry())
def prompt_for_parent(self, commit, githash):
# We have a commit that we've decided we want to skip. We need to
# return a different commit to take its place. If the commit has
# a single parent we just return that but if it has multiple
# parents we need to ask the user what to do.
parents = commit.parents
subject = commit.msg.splitlines()[0]
print 'Multiple parents for skipped commit %s %s' % (githash, subject)
print 'Pick a parent to use as a substitute:'
for i, parent in enumerate(parents):
parent_commit = self.fm.get_commit(parent)
parent_subject = parent_commit.msg.splitlines()[0]
print '[%d] %s %s' % (i, parent, parent_subject)
sys.stdout.flush
done = False
while not done:
answer = input('Selection: ')
done = True
try:
parents[answer]
except:
done = False
return answer
def substitute_commit(self, commit, githash):
parent_choice = 0
if len(commit.parents) != 1:
if self.skipped_pick_first:
subject = commit.msg.splitlines()[0]
first_parent = commit.parents[0]
first_parent_commit = self.fm.get_commit(first_parent)
first_parent_subject = first_parent_commit.msg.splitlines()[0]
print 'WARNING: Multiple parents for skipped commit (%s %s), picking first (%s %s)' % (githash, subject, first_parent, first_parent_subject)
else:
parent_choice = self.prompt_for_parent(commit, githash)
# Map this to the parent commit to skip it.
return commit.parents[parent_choice]
def zip_filter(self, fm, githash, commit, oldparents):
"""Rewrite an umbrella branch with interleaved commits
These commits are assumed to be from an 'umbrella' repository
which has a linear ordering of commits that update submodule
links. This routine rewrites such commits so that their content
is that of the submodule commit(s).
Each rewritten commit has a first parent of the previous rewritten
umbrella commit. If the commit added submodules, the parent list
includes the rewritten commits of the added submodules.
Given a revmap of rewritten commits and a ref to a linear order of
commits that update submodule references to rewritten commits (an
"umbrella" repository branch), create a map from each rewritten
downstream commit to a list of new parents it should have to make
it appear as if the commits had been interleaved in the monorepo
as in the umbrella branch. Any parent references to upstream
commits will be left alone. References to downstream commits will
be changed to reflect the interleaved linear ordering in the
umbrella history.
"""
self.debug('--- commit %s' % githash)
submodules = self.find_submodules(commit, githash)
if not submodules:
self.debug('No submodules')
return self.substitute_commit(commit, githash)
# self.debug('Previous submodules: [%s]' % ', '.join(map(str, self.prev_submodules)))
# self.debug('Current submodules: [%s]' % ', '.join(map(str, submodules)))
if self.prev_submodules == submodules:
# This is a commit that modified some file in the umbrella and
# didn't update any submodules.. Assume we don't want it.
self.debug('No submodule updates')
return self.substitute_commit(commit, githash)
prev_submodules_map = {}
if not self.no_rewrite_commit_msg:
# Track the old hashes for submodules so we know which
# submodules this commit updated below.
for prev_submodule_name, prev_submodule_hash in self.prev_submodules:
prev_submodules_map[prev_submodule_name] = prev_submodule_hash
self.prev_submodules = submodules
# The content of the commit should be the combination of the
# content from the submodules.
newtree = None
upstream_parents = []
submodule_add_parents = []
new_commit_msg = ''
if self.no_rewrite_commit_msg:
new_commit_msg = commit.msg
for name, oldhash in submodules:
self.debug('Found submodule (%s, %s)' % (name, oldhash))
newhash = self.revmap.get(oldhash, oldhash)
newcommit = self.fm.get_commit(newhash)
self.debug('New hash: %s' % newhash)
if not newtree:
self.debug('First submodule %s' % name)
newtree = newcommit.get_tree_entry()
submodule_tree = newcommit.get_tree_entry().get_path(self.fm, name.split('/'))
if not submodule_tree:
raise Exception('Initial found submodule %s not in monorepo' % name)
else:
self.debug('Next submodule %s' % name)
submodule_tree = newcommit.get_tree_entry().get_path(self.fm, name.split('/'))
if not submodule_tree:
# This submodule doesn't exist in the monorepo, add the
# entire contents of the commit's tree.
submodule_tree = newcommit.get_tree_entry()
newtree = newtree.add_path(self.fm, name.split('/'), submodule_tree)
if not self.no_rewrite_commit_msg:
if not name in prev_submodules_map or prev_submodules_map[name] != oldhash:
if not new_commit_msg:
new_commit_msg = newcommit.msg
else:
new_commit_msg += '\n' + newcommit.msg
# Rewrite parents. If this commit added a new submodule, add a
# parent to the corresponding commit. If one of the submodule
# commits merged from upstream, add the upstream commit.
if name not in self.added_submodules:
self.debug('Merge new submodule %s' % name)
submodule_add_parents.append(newhash)
self.added_submodules.add(name)
for parent in newcommit.parents:
self.debug('Checking parent %s' % parent)
if parent in self.new_upstream_hashes and not parent in self.merged_upstream_parents:
self.debug('Merge upstream commit %s' % parent)
upstream_parents.append(parent)
self.merged_upstream_parents.add(parent)
for name, e in newtree.get_subentries(fm).iteritems():
self.debug('NEWTREE: %s %s' % (name, str(e)))
newtree.write_subentries(fm)
commit.treehash = newtree.githash
commit.parents.extend(submodule_add_parents)
commit.parents.extend(upstream_parents)
commit.msg = new_commit_msg
return commit
def run(self):
if not self.revmap_in_file:
raise Exception("No revmap specified, use --revmap-in")
if self.revmap_out_file:
# Only supports output, not input
try:
os.remove(self.revmap_out_file)
except OSError:
pass
print "Mapping commits..."
self.revmap = dict((line.strip().split(' ') for line in file(self.revmap_in_file)))
self.fm = fast_filter_branch.FilterManager()
print "Getting upstream commits..."
self.gather_upstream_commits()
print "Done."
print "Zipping commits..."
fast_filter_branch.do_filter(commit_filter=self.zip_filter,
filter_manager=self.fm,
revmap_filename=self.revmap_out_file,
reflist=expand_ref_pattern(self.reflist))
self.fm.close()
print "Done -- refs updated in-place."
if __name__=="__main__":
parser = argparse.ArgumentParser(description="""
This tool zips up downstream commits created by migrate-downstream-fork.py
according to a set of commits assumed to be from an 'umbrella' repository.
The umbrella history is a series of commits that do submodule updates from
split-project git repositories. Any commits without submodule modifications
are skipped.
The umbrella history is rewritten so that each commit appears to have
been done directly to the umbrella, instead of via a submodule update.
Merges from upstream monorepo commits are preserved. The commit
message is replaced by the commit message(s) from the updated
submodule(s), unless --no-rewrite-commit-msg is given.
This tool DESTRUCTIVELY MODIFIES the umbrella branch it is run on!
Typical usage:
# First, prepare a repository by following the instructions in
# migrate-downstream-fork.py. Pass --revmap-out=$file to create
# a mapping from old downstream commits to new downstream commits.
# Then add umbrella history:
git remote add umbrella https://...
# Be sure to add the history from any non-llvm submodules:
for submodule in ${my_non_llvm_submodule_list[@]}; do
git remote add ${submodule} $(my_submodule_url ${submodule})
done
# Pull it all down:
git fetch --all
# Then, run this script:
zip-downstream-fork.py refs/remotes/umbrella --revmap-in=$file
""",
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("--new-repo-prefix", metavar="REFNAME",
default="refs/remotes/new",
help="The prefix for all the refs of the new repository (default: %(default)s).")
parser.add_argument("reflist", metavar="REFPATTERN",
help="Patterns of the references to convert.", nargs='*')
parser.add_argument("--revmap-in", metavar="FILE", default=None)
parser.add_argument("--revmap-out", metavar="FILE", default=None)
parser.add_argument("--debug", help="Turn on debug output.", action="store_true")
parser.add_argument("--abort-bad-submodule",
help="Abort on bad submodule updates.", action="store_true")
parser.add_argument("--no-rewrite-commit-msg",
help="Don't rewrite the submodule update commit message with the merged commit message.", action="store_true")
parser.add_argument("--skipped-pick-first",
help="If a skipped commit has multiple parents, pick the first one as a replacement.", action="store_true")
args = parser.parse_args()
Zipper(args.new_repo_prefix, args.revmap_in, args.revmap_out, args.reflist,
args.debug, args.abort_bad_submodule, args.no_rewrite_commit_msg,
args.skipped_pick_first).run()