修补文本文件
我正在尝试连续建立一个带有差异补丁的文本文件。 从一个空文本文件开始,我需要应用 600 多个补丁才能得到最终文档(我编写的文本 + 使用 Mercurial 跟踪更改)。文件中的每次更改都需要添加额外信息,因此我不能简单地在命令行中使用 diff 和 patch 。
我花了一整天的时间编写(并重新编写)一个工具来解析 diff 文件并相应地对文本文件进行更改,但是其中一个 diff 文件使我的程序以一种我无法理解的方式运行的。
每个 diff 文件都会调用此函数:
# filename = name of the diff file
# date = extra information to be added as a prefix to each added line
def process_diff(filename, date):
# that's the file all the patches will be applied to
merge_file = open("thesis_merged.txt", "r")
# map its content to a list to manipulate it in memory
merge_file_lines = []
for line in merge_file:
line = line.rstrip()
merge_file_lines.append(line)
merge_file.close()
# open for writing:
merge_file = open("thesis_merged.txt", "w")
# that's the diff file, containing all the changes
diff_file = open(filename, "r")
print "-", filename, "-" * 20
# also map it to a list
diff_file_lines = []
for line in diff_file:
line = line.rstrip()
if not line.startswith("\\ No newline at end of file"): # useless information ... or not?
diff_file_lines.append(line)
# ignore header:
#--- thesis_words_0.txt 2010-12-04 18:16:26.020000000 +0100
#+++ thesis_words_1.txt 2010-12-04 18:16:26.197000000 +0100
diff_file_lines = diff_file_lines[2:]
hunks = []
for i, line in enumerate(diff_file_lines):
if line.startswith("@@"):
hunks.append( get_hunk(diff_file_lines, i) )
for hunk in hunks:
head = hunk[0]
# @@ -252,10 +251,9 @@
tmp = head[3:-3].split(" ") # [-252,10] [+251,9]
line_nr_minus = tmp[0].split(",")[0]
line_nr_minus = int(line_nr_minus[1:]) # 252
line_nr_plus = tmp[1].split(",")[0]
line_nr_plus = int(line_nr_plus[1:]) # 251
for j, line in enumerate(hunk[1:]):
if line.startswith("-"):
# delete line from the file in memory
del merge_file_lines[line_nr_minus-1]
plus_counter = 0 # counts the number of added lines
for k, line in enumerate(hunk[1:]):
if line.startswith("+"):
# insert line, one after another
merge_file_lines.insert((line_nr_plus-1)+plus_counter, line[1:])
plus_counter += 1
for line in merge_file_lines:
# write the updated file back to the disk
merge_file.write(line.rstrip() + "\n")
merge_file.close()
diff_file.close()
print "\n\n"
def get_hunk(lines, i):
hunk = []
hunk.append(lines[i])
# @@ -252,10 +251,9 @@
lines = lines[i+1:]
for line in lines:
if line.startswith("@@"):
# next hunk begins, so stop here
break
else:
hunk.append(line)
return hunk
diff 文件看起来像这样 - 这里是麻烦制造者:
--- thesis_words_12.txt 2011-01-17 20:35:50.804000000 +0100
+++ thesis_words_13.txt 2011-01-17 20:35:51.057000000 +0100
@@ -245 +245,2 @@
-As
+Per
+definition
@@ -248,3 +249 @@
-already
-proposes,
-"generative"
+generative
@@ -252,10 +251,9 @@
-that
-something
-is
-created
-based
-on
-a
-set
-of
-rules.
+"having
+the
+ability
+to
+originate,
+produce,
+or
+procreate."
+<http://www.thefreedictionary.com/generative>
输出:
[...]
Per
definition
the
"generative"
generative
means
"having
the
ability
to
originate,
produce,
or
procreate."
<http://www.thefreedictionary.com/generative>
that
[...]
所有以前的补丁都按预期重现文本。我已经重写了很多次,但是错误行为仍然存在——所以现在我一无所知。
我将非常感谢有关如何以不同方式做到这一点的提示和技巧。预先非常感谢!
编辑: - 最后每一行应该看起来像这样:{date_and_time_of_text_change}word
它基本上是为了跟踪一个单词添加到文本中的日期和时间。
I'm trying to successively build up a text file with diff patches.
starting from an empty text file I need to apply 600+ patches to end up with the final document (a text I have written + tracked the changes with mercurial). to each change in the file extra information needs to be added, so I can't simply use diff and patch in the commandline.
I have spent all day to write (and re-write) a tool that parses the diff files and makes changes to the text file accordingly, but one of the diff files makes my program behave in a way that I can't make any sense of.
this function gets called for each of the diff files:
# filename = name of the diff file
# date = extra information to be added as a prefix to each added line
def process_diff(filename, date):
# that's the file all the patches will be applied to
merge_file = open("thesis_merged.txt", "r")
# map its content to a list to manipulate it in memory
merge_file_lines = []
for line in merge_file:
line = line.rstrip()
merge_file_lines.append(line)
merge_file.close()
# open for writing:
merge_file = open("thesis_merged.txt", "w")
# that's the diff file, containing all the changes
diff_file = open(filename, "r")
print "-", filename, "-" * 20
# also map it to a list
diff_file_lines = []
for line in diff_file:
line = line.rstrip()
if not line.startswith("\\ No newline at end of file"): # useless information ... or not?
diff_file_lines.append(line)
# ignore header:
#--- thesis_words_0.txt 2010-12-04 18:16:26.020000000 +0100
#+++ thesis_words_1.txt 2010-12-04 18:16:26.197000000 +0100
diff_file_lines = diff_file_lines[2:]
hunks = []
for i, line in enumerate(diff_file_lines):
if line.startswith("@@"):
hunks.append( get_hunk(diff_file_lines, i) )
for hunk in hunks:
head = hunk[0]
# @@ -252,10 +251,9 @@
tmp = head[3:-3].split(" ") # [-252,10] [+251,9]
line_nr_minus = tmp[0].split(",")[0]
line_nr_minus = int(line_nr_minus[1:]) # 252
line_nr_plus = tmp[1].split(",")[0]
line_nr_plus = int(line_nr_plus[1:]) # 251
for j, line in enumerate(hunk[1:]):
if line.startswith("-"):
# delete line from the file in memory
del merge_file_lines[line_nr_minus-1]
plus_counter = 0 # counts the number of added lines
for k, line in enumerate(hunk[1:]):
if line.startswith("+"):
# insert line, one after another
merge_file_lines.insert((line_nr_plus-1)+plus_counter, line[1:])
plus_counter += 1
for line in merge_file_lines:
# write the updated file back to the disk
merge_file.write(line.rstrip() + "\n")
merge_file.close()
diff_file.close()
print "\n\n"
def get_hunk(lines, i):
hunk = []
hunk.append(lines[i])
# @@ -252,10 +251,9 @@
lines = lines[i+1:]
for line in lines:
if line.startswith("@@"):
# next hunk begins, so stop here
break
else:
hunk.append(line)
return hunk
the diff files look like this -- here the trouble maker:
--- thesis_words_12.txt 2011-01-17 20:35:50.804000000 +0100
+++ thesis_words_13.txt 2011-01-17 20:35:51.057000000 +0100
@@ -245 +245,2 @@
-As
+Per
+definition
@@ -248,3 +249 @@
-already
-proposes,
-"generative"
+generative
@@ -252,10 +251,9 @@
-that
-something
-is
-created
-based
-on
-a
-set
-of
-rules.
+"having
+the
+ability
+to
+originate,
+produce,
+or
+procreate."
+<http://www.thefreedictionary.com/generative>
output:
[...]
Per
definition
the
"generative"
generative
means
"having
the
ability
to
originate,
produce,
or
procreate."
<http://www.thefreedictionary.com/generative>
that
[...]
all the previous patches reproduce the text just as expected. I have rewritten this many times, but that buggy behavior persists -- so right now I am clueless.
I'd be very thankful for hints and tips on how to do this differently. thanks a lot in advance!
EDIT:
- in the end each line is supposed to look like this: {date_and_time_of_text_change}word
it's basicly about keeping track at what date and time a word was added to the text.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
代码中确实存在一个错误——我没有正确解释 diff 文件(没有意识到当一个 diff 文件中有多个块时需要进行换行)
there was indeed a bug in the code -- I wasn't interpreting the diff files correctly (didn't realize there needed to be a line shift, when there are multiple hunks in one diff file)
尝试使用 python-patch 中的解析器 - 至少你可以申请手动一一进行测试,看看哪一个失败了。 API 不稳定,但解析器稳定,因此您只需将 patch.py 从 trunk/ 复制到您的项目即可。不过,如果能得到一些关于所需 API 的建议那就太好了。
Try to use parser from python-patch - at least you'll be able to apply hunks one by one manually to see which one fails. API is not stable, but parser is, so you can just copy patch.py from trunk/ to your project. It would be nice to get some proposal on desired API, though.