修补文本文件

发布于 2024-10-12 14:51:32 字数 3833 浏览 7 评论 0原文

我正在尝试连续建立一个带有差异补丁的文本文件。从一个空文本文件开始，我需要应用 600 多个补丁才能得到最终文档（我编写的文本 + 使用 Mercurial 跟踪更改）。文件中的每次更改都需要添加额外信息，因此我不能简单地在命令行中使用 diff 和 patch 。

我花了一整天的时间编写（并重新编写）一个工具来解析 diff 文件并相应地对文本文件进行更改，但是其中一个 diff 文件使我的程序以一种我无法理解的方式运行的。

每个 diff 文件都会调用此函数：

# filename = name of the diff file
# date = extra information to be added as a prefix to each added line
def process_diff(filename, date):
    # that's the file all the patches will be applied to
    merge_file = open("thesis_merged.txt", "r")
    # map its content to a list to manipulate it in memory
    merge_file_lines = []
    for line in merge_file:
        line = line.rstrip()
        merge_file_lines.append(line)
    merge_file.close()

    # open for writing:
    merge_file = open("thesis_merged.txt", "w")

    # that's the diff file, containing all the changes
    diff_file = open(filename, "r")
    print "-", filename, "-" * 20

    # also map it to a list
    diff_file_lines = []
    for line in diff_file:
        line = line.rstrip()

        if not line.startswith("\\ No newline at end of file"): # useless information ... or not?
        diff_file_lines.append(line)

    # ignore header:
    #--- thesis_words_0.txt 2010-12-04 18:16:26.020000000 +0100
    #+++ thesis_words_1.txt 2010-12-04 18:16:26.197000000 +0100
    diff_file_lines = diff_file_lines[2:]

    hunks = []
    for i, line in enumerate(diff_file_lines):
        if line.startswith("@@"):
            hunks.append( get_hunk(diff_file_lines, i) )

    for hunk in hunks:
        head = hunk[0]
        # @@ -252,10 +251,9 @@
        tmp = head[3:-3].split(" ") # [-252,10] [+251,9]
        line_nr_minus = tmp[0].split(",")[0]
        line_nr_minus = int(line_nr_minus[1:]) # 252
        line_nr_plus = tmp[1].split(",")[0]
        line_nr_plus = int(line_nr_plus[1:]) # 251

        for j, line in enumerate(hunk[1:]):
            if line.startswith("-"):
            # delete line from the file in memory
            del merge_file_lines[line_nr_minus-1]

        plus_counter = 0 # counts the number of added lines
        for k, line in enumerate(hunk[1:]):
            if line.startswith("+"):
                # insert line, one after another
                merge_file_lines.insert((line_nr_plus-1)+plus_counter, line[1:])
                plus_counter += 1

    for line in merge_file_lines:
        # write the updated file back to the disk
        merge_file.write(line.rstrip() + "\n")

    merge_file.close()
    diff_file.close()
    print "\n\n"


    def get_hunk(lines, i):
        hunk = []
        hunk.append(lines[i])
        # @@ -252,10 +251,9 @@

        lines = lines[i+1:]

        for line in lines:
            if line.startswith("@@"):
                # next hunk begins, so stop here
                break
            else:
                hunk.append(line)

            return hunk

diff 文件看起来像这样 - 这里是麻烦制造者：

--- thesis_words_12.txt 2011-01-17 20:35:50.804000000 +0100
+++ thesis_words_13.txt 2011-01-17 20:35:51.057000000 +0100
@@ -245 +245,2 @@
-As
+Per
+definition
@@ -248,3 +249 @@
-already
-proposes,
-"generative"
+generative
@@ -252,10 +251,9 @@
-that
-something
-is
-created
-based
-on
-a
-set
-of
-rules.
+"having
+the
+ability
+to
+originate,
+produce,
+or
+procreate."
+<http://www.thefreedictionary.com/generative>

输出：

[...]

Per
definition
the
"generative"
generative
means
"having
the
ability
to
originate,
produce,
or
procreate."
<http://www.thefreedictionary.com/generative>
that

[...]

所有以前的补丁都按预期重现文本。我已经重写了很多次，但是错误行为仍然存在——所以现在我一无所知。

我将非常感谢有关如何以不同方式做到这一点的提示和技巧。预先非常感谢！

编辑： - 最后每一行应该看起来像这样：{date_and_time_of_text_change}word

它基本上是为了跟踪一个单词添加到文本中的日期和时间。

原文

I'm trying to successively build up a text file with diff patches.
starting from an empty text file I need to apply 600+ patches to end up with the final document (a text I have written + tracked the changes with mercurial). to each change in the file extra information needs to be added, so I can't simply use diff and patch in the commandline.

I have spent all day to write (and re-write) a tool that parses the diff files and makes changes to the text file accordingly, but one of the diff files makes my program behave in a way that I can't make any sense of.

this function gets called for each of the diff files:

# filename = name of the diff file
# date = extra information to be added as a prefix to each added line
def process_diff(filename, date):
    # that's the file all the patches will be applied to
    merge_file = open("thesis_merged.txt", "r")
    # map its content to a list to manipulate it in memory
    merge_file_lines = []
    for line in merge_file:
        line = line.rstrip()
        merge_file_lines.append(line)
    merge_file.close()

    # open for writing:
    merge_file = open("thesis_merged.txt", "w")

    # that's the diff file, containing all the changes
    diff_file = open(filename, "r")
    print "-", filename, "-" * 20

    # also map it to a list
    diff_file_lines = []
    for line in diff_file:
        line = line.rstrip()

        if not line.startswith("\\ No newline at end of file"): # useless information ... or not?
        diff_file_lines.append(line)

    # ignore header:
    #--- thesis_words_0.txt 2010-12-04 18:16:26.020000000 +0100
    #+++ thesis_words_1.txt 2010-12-04 18:16:26.197000000 +0100
    diff_file_lines = diff_file_lines[2:]

    hunks = []
    for i, line in enumerate(diff_file_lines):
        if line.startswith("@@"):
            hunks.append( get_hunk(diff_file_lines, i) )

    for hunk in hunks:
        head = hunk[0]
        # @@ -252,10 +251,9 @@
        tmp = head[3:-3].split(" ") # [-252,10] [+251,9]
        line_nr_minus = tmp[0].split(",")[0]
        line_nr_minus = int(line_nr_minus[1:]) # 252
        line_nr_plus = tmp[1].split(",")[0]
        line_nr_plus = int(line_nr_plus[1:]) # 251

        for j, line in enumerate(hunk[1:]):
            if line.startswith("-"):
            # delete line from the file in memory
            del merge_file_lines[line_nr_minus-1]

        plus_counter = 0 # counts the number of added lines
        for k, line in enumerate(hunk[1:]):
            if line.startswith("+"):
                # insert line, one after another
                merge_file_lines.insert((line_nr_plus-1)+plus_counter, line[1:])
                plus_counter += 1

    for line in merge_file_lines:
        # write the updated file back to the disk
        merge_file.write(line.rstrip() + "\n")

    merge_file.close()
    diff_file.close()
    print "\n\n"


    def get_hunk(lines, i):
        hunk = []
        hunk.append(lines[i])
        # @@ -252,10 +251,9 @@

        lines = lines[i+1:]

        for line in lines:
            if line.startswith("@@"):
                # next hunk begins, so stop here
                break
            else:
                hunk.append(line)

            return hunk

the diff files look like this -- here the trouble maker:

--- thesis_words_12.txt 2011-01-17 20:35:50.804000000 +0100
+++ thesis_words_13.txt 2011-01-17 20:35:51.057000000 +0100
@@ -245 +245,2 @@
-As
+Per
+definition
@@ -248,3 +249 @@
-already
-proposes,
-"generative"
+generative
@@ -252,10 +251,9 @@
-that
-something
-is
-created
-based
-on
-a
-set
-of
-rules.
+"having
+the
+ability
+to
+originate,
+produce,
+or
+procreate."
+<http://www.thefreedictionary.com/generative>

output:

[...]

Per
definition
the
"generative"
generative
means
"having
the
ability
to
originate,
produce,
or
procreate."
<http://www.thefreedictionary.com/generative>
that

[...]

all the previous patches reproduce the text just as expected. I have rewritten this many times, but that buggy behavior persists -- so right now I am clueless.

I'd be very thankful for hints and tips on how to do this differently. thanks a lot in advance!

EDIT:
- in the end each line is supposed to look like this: {date_and_time_of_text_change}word

it's basicly about keeping track at what date and time a word was added to the text.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我们的影子 2024-10-19 14:51:32

代码中确实存在一个错误——我没有正确解释 diff 文件（没有意识到当一个 diff 文件中有多个块时需要进行换行）

def process_diff(filename, date, step_nr):
    merge_file = open("thesis_merged.txt", "r")
    merge_file_lines = [line.rstrip() for line in merge_file]
    merge_file.close()

    diff_file = open(filename, "r")
    print "-", filename, "-"*2, step_nr, "-"*2, date

    diff_file_lines = [line.rstrip() for line in diff_file]
    hunks = []
    for i, line in enumerate(diff_file_lines):
        if line.startswith("@@"):
            hunks.append( get_hunk(diff_file_lines, i) )
    diff_file.close()

    line_shift = 0
    for hunk in hunks:
        head = hunk[0]
        # @@ -252,10 +251,9 @@
        tmp = head[3:-3].split(" ") # [-252,10] [+251,9]

        line_nr_minus = tmp[0].split(",")[0]
        minusses = 1
        if len( tmp[0].split(",") ) > 1:
            minusses = int( tmp[0].split(",")[1] )
        line_nr_minus = int(line_nr_minus[1:]) # 252

        line_nr_plus = tmp[1].split(",")[0]
        plusses = 1
        if len( tmp[1].split(",") ) > 1:
            plusses = int( tmp[1].split(",")[1] )
        line_nr_plus = int(line_nr_plus[1:]) # 251

        line_nr_minus += line_shift

        #@@ -248,3 +249 @@
        #-already
        #-proposes,
        #-"generative"
        #+generative

        if hunk[1]: # -
            for line in hunk[1]:
                del merge_file_lines[line_nr_minus-1]

        plus_counter = 0
        if hunk[2]: # +
            for line in hunk[2]:
                prefix = ""
                if len(line) > 1:
                    prefix = "{" + date + "}"
                merge_file_lines.insert((line_nr_plus-1)+plus_counter, prefix + line[1:])
                plus_counter += 1

        line_shift += plusses - minusses

there was indeed a bug in the code -- I wasn't interpreting the diff files correctly (didn't realize there needed to be a line shift, when there are multiple hunks in one diff file)

def process_diff(filename, date, step_nr):
    merge_file = open("thesis_merged.txt", "r")
    merge_file_lines = [line.rstrip() for line in merge_file]
    merge_file.close()

    diff_file = open(filename, "r")
    print "-", filename, "-"*2, step_nr, "-"*2, date

    diff_file_lines = [line.rstrip() for line in diff_file]
    hunks = []
    for i, line in enumerate(diff_file_lines):
        if line.startswith("@@"):
            hunks.append( get_hunk(diff_file_lines, i) )
    diff_file.close()

    line_shift = 0
    for hunk in hunks:
        head = hunk[0]
        # @@ -252,10 +251,9 @@
        tmp = head[3:-3].split(" ") # [-252,10] [+251,9]

        line_nr_minus = tmp[0].split(",")[0]
        minusses = 1
        if len( tmp[0].split(",") ) > 1:
            minusses = int( tmp[0].split(",")[1] )
        line_nr_minus = int(line_nr_minus[1:]) # 252

        line_nr_plus = tmp[1].split(",")[0]
        plusses = 1
        if len( tmp[1].split(",") ) > 1:
            plusses = int( tmp[1].split(",")[1] )
        line_nr_plus = int(line_nr_plus[1:]) # 251

        line_nr_minus += line_shift

        #@@ -248,3 +249 @@
        #-already
        #-proposes,
        #-"generative"
        #+generative

        if hunk[1]: # -
            for line in hunk[1]:
                del merge_file_lines[line_nr_minus-1]

        plus_counter = 0
        if hunk[2]: # +
            for line in hunk[2]:
                prefix = ""
                if len(line) > 1:
                    prefix = "{" + date + "}"
                merge_file_lines.insert((line_nr_plus-1)+plus_counter, prefix + line[1:])
                plus_counter += 1

        line_shift += plusses - minusses

回复收藏 0 原文