.NET 中两个字符串的逐字差异比较
我需要对两个字符串进行逐字比较。 类似 diff 的东西,但用于单词,而不是行。
就像维基百科中所做的那样 http://en. wikipedia.org/w/index.php?title=Horapollo&action=historysubmit&diff=21895647&oldid=21893459
结果我想返回两个单词索引数组,它们在两个字符串中不同。
.NET 是否有任何库/框架/standalone_methods 可以做到这一点?
PS我想比较几千字节的文本
I need to do Word by word comparison of two strings.
Something like diff, but for words, not for lines.
Like it is done in wikipedia
http://en.wikipedia.org/w/index.php?title=Horapollo&action=historysubmit&diff=21895647&oldid=21893459
In result I want return the two arrays of indexes of words, which are different in two string.
Are there any libraries/frameworks/standalone_methods for .NET which can do this?
P.S. I want to compare several kilobytes of text
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
实际上,您可能想要实现我们在 DNA 中使用的局部比对/全局比对算法的变体序列比对。这是因为您可能无法对两个字符串进行逐字比较。 IE:
换句话说,如果您无法识别整个单词的插入和删除,您的比较算法可能会变得非常糟糕。查看 Smith-Waterman 算法和 Needleman-Wunsch 算法,并找到一种使它们适应您的需求的方法。由于如果字符串很长,这样的搜索空间可能会变得非常大,因此您还可以检查 BLAST。 BLAST 是一种非常常见的启发式算法,几乎是遗传搜索的标准。
Actually, you probably want to implement a variation of the Local Alignment/Global Alignment algorithms we use in DNA sequence alignments. This is because you probably cannot do a word-by-word comparison of the two strings. I.e:
In other words, if you cannot identify insertions and deletions of whole words, your comparison algorithm can become very sc(r)ewed. Take a look at the Smith-Waterman algorithm and the Needleman-Wunsch algorithm and find a way to adapt them to your needs. Since such a search space can become very large if the strings are long, you could also check out BLAST. BLAST is a very common heuristic algorithm, and is pretty much the standard in genetic searches.
看来我已经找到了所需的解决方案:
DiffPlex 是 .NET Diffing 库与 Silverlight 和 HTML diff 查看器的组合。
http://diffplex.codeplex.com/
但它有一个错误。在“Hello-Kitty”“Hello - Kitty”这些行中,单词“Hello”将被标记为差异。虽然区别只是空间符号。
It seems I've found needed solution:
DiffPlex is a combination of a .NET Diffing Library with both a Silverlight and HTML diff viewer.
http://diffplex.codeplex.com/
But It has one bug. In those lines "Hello-Kitty" "Hello - Kitty", the word "Hello" will be marked as difference. Although the difference is space symbol.
使用正则表达式。
就像例子中一样:
Use RegularExpressions.
Like in the example:
您可以将 2 个文本中的所有单词替换为唯一的数字,使用一些现成的代码进行编辑距离计算,并将其字符与字符的比较替换为数字与数字的比较,然后就完成了!
我不确定是否有任何库可以满足您的需求。但你肯定会发现很多关于编辑距离的代码。
此外,根据您是否确实希望在编辑距离计算中允许替换,您可以更改动态编程代码中的条件。
看到这个。 http://en.wikipedia.org/wiki/Levenshtein_distance
you can replace all the words in your 2 texts with unique numbers, take some ready made code for Edit distance computation and replace it's character to character comparison with number to number comparison and you are done!
I am not sure if there exists any library for exactly what u want. But you will surely find lots of code for edit distance.
Further, depending on whether you want to actually want to allow substitutions or not in the edit distance computation, you can change the conditions in the dynamic programming code.
See this. http://en.wikipedia.org/wiki/Levenshtein_distance
你可以尝试这个,虽然我不确定这就是你正在寻找的 StringUtils.difference() (http://commons.apache.org/lang/api-release/org/apache/commons /lang/StringUtils.html#difference%28java.lang.String,%20java.lang.String%29)
另外,Eclipse (eclipse.org) 项目具有 diff 比较功能,这意味着它们还必须具有代码来确定差异,您可以浏览他们的 API 或源代码以查看可以找到什么。
祝你好运。
You might try this, though I am not sure it's what you are looking for StringUtils.difference() (http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringUtils.html#difference%28java.lang.String,%20java.lang.String%29)
Alternately, the Eclipse (eclipse.org) project has a diff comparison feature, which means they must also have code to determine the differences, you might browse through their API or source to see what you can find.
Good luck.
另一种 C# 库是 diff-match-patch - http://code。 google.com/p/google-diff-match-patch/。
糟糕的是它发现了角色的差异。好消息是,有说明您必须添加什么来区分单词。
One more library for c# is diff-match-patch - http://code.google.com/p/google-diff-match-patch/.
The bad thing it finds difference in characters. The good thing, there is instruction what you have to add to diff the words.
看来我将使用这里使用的/端口算法
http://www.google.com/codesearch/p?hl=en&sa=N&cd=6&ct=rc# jc4aufN53J8/src/main/net/killingar/WordDiff.java&q=worddiff
It seems I will use/port algorithm used here
http://www.google.com/codesearch/p?hl=en&sa=N&cd=6&ct=rc#Jc4aufN53J8/src/main/net/killingar/WordDiff.java&q=worddiff