Python Difflib Delta 和比较 Ndiff

发布于 2024-10-13 17:40:00 字数 3281 浏览 6 评论 0原文

我想做一些类似我认为变更控制系统所做的事情,它们比较两个文件,并在每次文件更改时保存一个小的差异。 我一直在阅读此页面: http://docs.python.org/library/difflib.html 显然它并没有深入我的脑海。

我试图在一个简单的程序中重新创建这个,如下所示, 但我似乎遗漏的是,Delta 包含的内容至少与原始文件一样多,甚至更多。

难道就不能得到纯粹的改变吗? 我问的原因希望是显而易见的 - 节省磁盘空间。
我可以每次保存整个代码块,但最好保存当前代码一次,然后保存更改的小差异。

我还在试图弄清楚为什么许多 difflib 函数返回生成器而不是列表,那里有什么优势?

difflib 适合我吗?还是我需要找到一个具有更多功能的更专业的软件包?

# Python Difflib demo 
# Author: Neal Walters 
# loosely based on http://ahlawat.net/wordpress/?p=371
# 01/17/2011 

# build the files here - later we will just read the files probably 
file1Contents="""
for j = 1 to 10: 
   print "ABC"
   print "DEF" 
   print "HIJ"
   print "JKL"
   print "Hello World"
   print "j=" + j 
   print "XYZ"
"""

file2Contents = """
for j = 1 to 10: 
   print "ABC"
   print "DEF" 
   print "HIJ"
   print "JKL"
   print "Hello World"
   print "XYZ"
print "The end"
"""

filename1 = "diff_file1.txt" 
filename2 = "diff_file2.txt" 

file1 = open(filename1,"w") 
file2 = open(filename2,"w") 

file1.write(file1Contents) 
file2.write(file2Contents) 

file1.close()
file2.close() 
#end of file build 

lines1 = open(filename1, "r").readlines()
lines2 = open(filename2, "r").readlines()

import difflib

print "\n FILE 1 \n" 
for line in lines1:
  print line 

print "\n FILE 2 \n" 
for line in lines2: 
  print line 

diffSequence = difflib.ndiff(lines1, lines2) 

print "\n ----- SHOW DIFF ----- \n" 
for i, line in enumerate(diffSequence):
    print line

diffObj = difflib.Differ() 
deltaSequence = diffObj.compare(lines1, lines2) 
deltaList = list(deltaSequence) 

print "\n ----- SHOW DELTALIST ----- \n" 
for i, line in enumerate(deltaList):
    print line



#let's suppose we store just the diffSequence in the database 
#then we want to take the current file (file2) and recreate the original (file1) from it
#by backward applying the diff 

restoredFile1Lines = difflib.restore(diffSequence,1)  # 1 indicates file1 of 2 used to create the diff 

restoreFileList = list(restoredFile1Lines)

print "\n ----- SHOW REBUILD OF FILE1 ----- \n" 
# this is not showing anything! 
for i, line in enumerate(restoreFileList): 
    print line

谢谢!

更新:

contextDiffSeq = difflib.context_diff(lines1, lines2) 
contextDiffList = list(contextDiffSeq) 

print "\n ----- SHOW CONTEXTDIFF ----- \n" 
for i, line in enumerate(contextDiffList):
    print line

-----显示上下文差异-----

<小时> <小时> <小时>

* 5,9 **

 打印“HIJ”

 打印“JKL”

 打印“你好世界”
  • 打印“j=”+j

    打印“XYZ”

--- 5,9 ----

 打印“HIJ”

 打印“JKL”

 打印“你好世界”

 打印“XYZ”
  • 打印“结束”

另一个更新:

在旧版本中在 Panvalet 是 Librarian(大型机的源管理工具)的时代,您可以创建如下所示的变更集:

++ADD 9
   print "j=" + j 

这仅仅意味着在第 9 行之后添加一行(或多行)。 然后是诸如 ++REPLACE 或 ++UPDATE 之类的单词。 http://www4.hawaii.gov/dags/icsd/ppmo /Stds_Web_Pages/pdf/it110401.pdf

I was looking to do something like what I believe change control systems do, they compare two files, and save a small diff each time the file changes.
I've been reading this page: http://docs.python.org/library/difflib.html and it's not sinking in to my head apparently.

I was trying to recreate this in a somewhat simple program shown below,
but the thing that I seem to be missing is that the Delta's contain at least as much as the original file, and more.

Is it not possible to get to just the pure changes?
The reason I ask is hopefully obvious - to save disk space.
I could just save the entire chunk of code each time, but it would be better to save current code once, then small diffs of the changes.

I'm also still trying to figure out why many difflib functions return a generator instead of a list, what's the advantage there?

Will difflib work for me - or I need to find a more professional package with more features?

# Python Difflib demo 
# Author: Neal Walters 
# loosely based on http://ahlawat.net/wordpress/?p=371
# 01/17/2011 

# build the files here - later we will just read the files probably 
file1Contents="""
for j = 1 to 10: 
   print "ABC"
   print "DEF" 
   print "HIJ"
   print "JKL"
   print "Hello World"
   print "j=" + j 
   print "XYZ"
"""

file2Contents = """
for j = 1 to 10: 
   print "ABC"
   print "DEF" 
   print "HIJ"
   print "JKL"
   print "Hello World"
   print "XYZ"
print "The end"
"""

filename1 = "diff_file1.txt" 
filename2 = "diff_file2.txt" 

file1 = open(filename1,"w") 
file2 = open(filename2,"w") 

file1.write(file1Contents) 
file2.write(file2Contents) 

file1.close()
file2.close() 
#end of file build 

lines1 = open(filename1, "r").readlines()
lines2 = open(filename2, "r").readlines()

import difflib

print "\n FILE 1 \n" 
for line in lines1:
  print line 

print "\n FILE 2 \n" 
for line in lines2: 
  print line 

diffSequence = difflib.ndiff(lines1, lines2) 

print "\n ----- SHOW DIFF ----- \n" 
for i, line in enumerate(diffSequence):
    print line

diffObj = difflib.Differ() 
deltaSequence = diffObj.compare(lines1, lines2) 
deltaList = list(deltaSequence) 

print "\n ----- SHOW DELTALIST ----- \n" 
for i, line in enumerate(deltaList):
    print line



#let's suppose we store just the diffSequence in the database 
#then we want to take the current file (file2) and recreate the original (file1) from it
#by backward applying the diff 

restoredFile1Lines = difflib.restore(diffSequence,1)  # 1 indicates file1 of 2 used to create the diff 

restoreFileList = list(restoredFile1Lines)

print "\n ----- SHOW REBUILD OF FILE1 ----- \n" 
# this is not showing anything! 
for i, line in enumerate(restoreFileList): 
    print line

Thanks!

UPDATE:

contextDiffSeq = difflib.context_diff(lines1, lines2) 
contextDiffList = list(contextDiffSeq) 

print "\n ----- SHOW CONTEXTDIFF ----- \n" 
for i, line in enumerate(contextDiffList):
    print line

----- SHOW CONTEXTDIFF -----




* 5,9 **

 print "HIJ"

 print "JKL"

 print "Hello World"
  • print "j=" + j

    print "XYZ"

--- 5,9 ----

 print "HIJ"

 print "JKL"

 print "Hello World"

 print "XYZ"
  • print "The end"

Another update:

In the old days of Panvalet an Librarian, source management tools for the mainframe, you could create a changeset like this:

++ADD 9
   print "j=" + j 

Which simply mean add a line (or lines) after line 9.
Then there word words like ++REPLACE or ++UPDATE.
http://www4.hawaii.gov/dags/icsd/ppmo/Stds_Web_Pages/pdf/it110401.pdf

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

对风讲故事 2024-10-20 17:40:00

我还在想办法
为什么许多 difflib 函数返回一个
生成器而不是列表,是什么
优势在那里?

好吧,想一想 - 如果您比较文件,这些文件在理论上(并且在实践中)可能相当大 - 例如,将增量作为列表返回意味着将完整数据读入内存,这就是这不是明智之举。

至于仅返回差异,那么,使用生成器还有另一个优点 - 只需迭代增量并保留您感兴趣的任何行。

如果您阅读 difflib 文档 对于不同风格的增量,您将看到一段内容:

Each line of a Differ delta begins with a two-letter code:
Code    Meaning
'- '    line unique to sequence 1
'+ '    line unique to sequence 2
'  '    line common to both sequences
'? '    line not present in either input sequence

因此,如果您只想要差异,您可以轻松过滤这些差异使用 str.startswith

也可以使用 difflib.context_diff 获得仅显示变化的紧凑增量。

I'm also still trying to figure out
why many difflib functions return a
generator instead of a list, what's
the advantage there?

Well, think about it for a second - if you compare files, those files can in theory (and will be in practice) be quite large - returning the delta as a list, for exampe, means reading the complete data into memory, which is not a smart thing to do.

As for only returning the difference, well, there is another advantage in using a generator - just iterate over the delta and keep whatever lines you are interested in.

If you read the difflib documentation for Differ - style deltas, you will see a paragraph that reads:

Each line of a Differ delta begins with a two-letter code:
Code    Meaning
'- '    line unique to sequence 1
'+ '    line unique to sequence 2
'  '    line common to both sequences
'? '    line not present in either input sequence

So, if you only want differences, you can easily filter those out by using str.startswith

You can also use difflib.context_diff to obtain a compact delta which shows only the changes.

贪恋 2024-10-20 17:40:00

差异必须包含足够的信息,以便可以将一个版本修补到另一个版本,因此,是的,对于对非常小的文档进行单行更改的实验,存储整个文档可能会更便宜。

库函数返回迭代器,以方便内存紧张或只需要查看部分结果序列的客户端。在 Python 中这没问题,因为每个迭代器都可以使用非常短的 list(an_iterator) 表达式转换为列表。

大多数差异是在文本行上完成的,但也可以逐个字符地进行差异,difflib 可以做到这一点。看一下 Differdifflib 中的对象。

各地的示例都使用人性化的输出,但差异是在内部以更紧凑、计算机友好的方式管理的。此外,差异通常包含冗余信息(例如要删除的行的文本),以确保修补和合并更改的安全。如果您对此感到满意,可以通过您自己的代码删除冗余。

我刚刚读到,difflib 选择最不令人惊讶的方式来支持最优性,这是我不会反对的。有一些众所周知的算法可以快速产生最少的更改集。

我曾经用大约 1250 行 Java 代码(JRCS)。它适用于可以比较相等性的任何元素序列。如果你想构建自己的解决方案,我认为 JRCS 的翻译/重新实现应该不超过 300 行 Python。

处理 difflib 生成的输出以使其更加紧凑也是一种选择。这是一个包含三个更改(添加、更改和删除)的小文件的示例:

---  
+++  
@@ -7,0 +7,1 @@
+aaaaa
@@ -9,1 +10,1 @@
-c= 0
+c= 1
@@ -15,1 +16,0 @@
-    m = re.match(code_re, text)

补丁内容可以轻松压缩为:

+7,1 
aaaaa
-9,1 
+10,1
c= 1
-15,1

对于您自己的示例,压缩输出将是:

-8,1
+9,1
print "The end"

为了安全起见,保留前导必须插入的行的标记(“>”)可能是个好主意。

-8,1
+9,1
>print "The end"

这更接近你所需要的吗?

这是一个进行压缩的简单函数。您必须编写自己的代码才能应用该格式的补丁,但这应该很简单。

def compact_a_unidiff(s):
    s = [l for l in s if l[0] in ('+','@')]
    result = []
    for l in s:
        if l.startswith('++'):
            continue
        elif l.startswith('+'):
            result.append('>'+ l[1:])
        else:
            del_cmd, add_cmd = l[3:-3].split()
            del_pair, add_pair = (c.split(',') for c in (del_cmd,add_cmd))
            if del_pair[1]  != '0':
                result.append(del_cmd)
            if add_pair[1] != '0':
                result.append(add_cmd)
    return result

Diffs must contain enough information to make it possible to patch a version into another, so yes, for your experiment of a single-line change to a very small document, storing the whole documents could be cheaper.

Library functions return iterators to make it easier on clients that are tight on memory or only need to look at part of the resulting sequence. It's ok in Python because every iterator can be converted to a list with a very short list(an_iterator) expression.

Most differencing is done on lines of text, but it is possible to go down to the char-by-char, and difflib does it. Take a look at the Differ class of object in difflib.

The examples all over the place use human-friendly output, but the diffs are managed internally in a much more compact, computer-friendly way. Also, diffs usually contain redundant information (like the text of a line to delete) to make patching and merging changes safe. The redundancy can be removed by your own code, if you feel comfortable with that.

I just read that difflib opts for least-surprise in favor of optimality, which is something I won't argue against. There are well known algorithms that are fast at producing a minimum set of changes.

I once coded a generic diffing engine along with one of the optimum algorithms in about 1250 lines of Java (JRCS). It works for any sequence of elements that can be compared for equality. If you want to build your own solution, I think that a translation/reimplementation of JRCS should take no more than 300 lines of Python.

Processing the output produced by difflib to make it more compact is also an option. This is an example from a small files with three changes (an addition, a change, and a deletion):

---  
+++  
@@ -7,0 +7,1 @@
+aaaaa
@@ -9,1 +10,1 @@
-c= 0
+c= 1
@@ -15,1 +16,0 @@
-    m = re.match(code_re, text)

What the patch says can be easily condensed to:

+7,1 
aaaaa
-9,1 
+10,1
c= 1
-15,1

For your own example the condensed output would be:

-8,1
+9,1
print "The end"

For safety, leaving in a leading marker ('>') for lines that must be inserted might be a good idea.

-8,1
+9,1
>print "The end"

Is that closer to what you need?

This is a simple function to do the compacting. You'll have to write your own code to apply the patch in that format, but it should be straightforward.

def compact_a_unidiff(s):
    s = [l for l in s if l[0] in ('+','@')]
    result = []
    for l in s:
        if l.startswith('++'):
            continue
        elif l.startswith('+'):
            result.append('>'+ l[1:])
        else:
            del_cmd, add_cmd = l[3:-3].split()
            del_pair, add_pair = (c.split(',') for c in (del_cmd,add_cmd))
            if del_pair[1]  != '0':
                result.append(del_cmd)
            if add_pair[1] != '0':
                result.append(add_cmd)
    return result
悸初 2024-10-20 17:40:00

如果您只想进行更改,则需要使用统一或上下文差异。您会看到更大的文件,因为它包含它们共有的行。

返回生成器的优点是不需要立即将整个事情保存在内存中。这对于区分非常大的文件很有用。

You want to use the unified or context diff if you just want the changes. You're seeing bigger files because it includes the lines they have in common.

The advantage of returning a generator is that the entire thing doesn't need to be held in memory at once. This can be useful for diffing very large files.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文