.NET 中两个字符串的逐字差异比较

发布于 2024-08-12 01:54:01 字数 449 浏览 6 评论 0原文

我需要对两个字符串进行逐字比较。 类似 diff 的东西,但用于单词,而不是行。

就像维基百科中所做的那样 http://en. wikipedia.org/w/index.php?title=Horapollo&action=historysubmit&diff=21895647&oldid=21893459

结果我想返回两个单词索引数组,它们在两个字符串中不同。

.NET 是否有任何库/框架/standalone_methods 可以做到这一点?

PS我想比较几千字节的文本

I need to do Word by word comparison of two strings.
Something like diff, but for words, not for lines.

Like it is done in wikipedia
http://en.wikipedia.org/w/index.php?title=Horapollo&action=historysubmit&diff=21895647&oldid=21893459

In result I want return the two arrays of indexes of words, which are different in two string.

Are there any libraries/frameworks/standalone_methods for .NET which can do this?

P.S. I want to compare several kilobytes of text

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

花开浅夏 2024-08-19 01:54:01

实际上,您可能想要实现我们在 DNA 中使用的局部比对/全局比对算法的变体序列比对。这是因为您可能无法对两个字符串进行逐字比较。 IE:

敏捷的棕色狐狸跳过了
懒狗
敏捷的狐狸跳过了
懒狗

换句话说,如果您无法识别整个单词的插入和删除,您的比较算法可能会变得非常糟糕。查看 Smith-Waterman 算法和 Needleman-Wunsch 算法,并找到一种使它们适应您的需求的方法。由于如果字符串很长,这样的搜索空间可能会变得非常大,因此您还可以检查 BLAST。 BLAST 是一种非常常见的启发式算法,几乎是遗传搜索的标准。

Actually, you probably want to implement a variation of the Local Alignment/Global Alignment algorithms we use in DNA sequence alignments. This is because you probably cannot do a word-by-word comparison of the two strings. I.e:

The quick brown fox jumps over the
lazy dog
The quick fox jumps over the
lazy dog

In other words, if you cannot identify insertions and deletions of whole words, your comparison algorithm can become very sc(r)ewed. Take a look at the Smith-Waterman algorithm and the Needleman-Wunsch algorithm and find a way to adapt them to your needs. Since such a search space can become very large if the strings are long, you could also check out BLAST. BLAST is a very common heuristic algorithm, and is pretty much the standard in genetic searches.

遥远的绿洲 2024-08-19 01:54:01

看来我已经找到了所需的解决方案:

DiffPlex 是 .NET Diffing 库与 Silverlight 和 HTML diff 查看器的组合。
http://diffplex.codeplex.com/

但它有一个错误。在“Hello-Kitty”“Hello - Kitty”这些行中,单词“Hello”将被标记为差异。虽然区别只是空间符号。

It seems I've found needed solution:

DiffPlex is a combination of a .NET Diffing Library with both a Silverlight and HTML diff viewer.
http://diffplex.codeplex.com/

But It has one bug. In those lines "Hello-Kitty" "Hello - Kitty", the word "Hello" will be marked as difference. Although the difference is space symbol.

往事随风而去 2024-08-19 01:54:01

使用正则表达式。

就像例子中一样:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.Collections.Specialized;

namespace WindowsApplication10
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button2_Click(object sender, EventArgs e)
        {
            decimal discrimation = 0.75M;
            string formHeading = "The brown dog jumped over the red lazy river, and then took a little nap! Fun!";
            string userSearch = "The brown dog jumped over the red lazy river, and then took a little ";
            //string userSearch = "brown dog nap fun";
            decimal res = CompareText(formHeading, userSearch);

            if (res >= discrimation)
            {
                MessageBox.Show("MATCH!" + res.ToString());
            }
            else 
            {
                MessageBox.Show("does not match! " + res.ToString());
            }
        }


        /// <summary>
        /// Returns a percentage of 1 on how many words were matched
        /// </summary>
        /// <returns></returns>
        private decimal CompareText(string formHeading, string userSearch)
        {
            StringCollection formHeadingWords = new StringCollection();
            StringCollection userSearchWords = new StringCollection();
            formHeadingWords.AddRange(System.Text.RegularExpressions.Regex.Split(formHeading, @"\W"));
            userSearchWords.AddRange(System.Text.RegularExpressions.Regex.Split(userSearch, @"\W"));

            int wordsFound = 0;
            for (int i1 = 0; i1 < userSearchWords.Count; i1++)
            {
                if (formHeadingWords.Contains(userSearchWords[i1]))
                    wordsFound += 1;
            }
            return (Convert.ToDecimal(wordsFound) / Convert.ToDecimal(formHeadingWords.Count));
        }
    }
}

Use RegularExpressions.

Like in the example:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Text;
using System.Windows.Forms;
using System.Collections.Specialized;

namespace WindowsApplication10
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void button2_Click(object sender, EventArgs e)
        {
            decimal discrimation = 0.75M;
            string formHeading = "The brown dog jumped over the red lazy river, and then took a little nap! Fun!";
            string userSearch = "The brown dog jumped over the red lazy river, and then took a little ";
            //string userSearch = "brown dog nap fun";
            decimal res = CompareText(formHeading, userSearch);

            if (res >= discrimation)
            {
                MessageBox.Show("MATCH!" + res.ToString());
            }
            else 
            {
                MessageBox.Show("does not match! " + res.ToString());
            }
        }


        /// <summary>
        /// Returns a percentage of 1 on how many words were matched
        /// </summary>
        /// <returns></returns>
        private decimal CompareText(string formHeading, string userSearch)
        {
            StringCollection formHeadingWords = new StringCollection();
            StringCollection userSearchWords = new StringCollection();
            formHeadingWords.AddRange(System.Text.RegularExpressions.Regex.Split(formHeading, @"\W"));
            userSearchWords.AddRange(System.Text.RegularExpressions.Regex.Split(userSearch, @"\W"));

            int wordsFound = 0;
            for (int i1 = 0; i1 < userSearchWords.Count; i1++)
            {
                if (formHeadingWords.Contains(userSearchWords[i1]))
                    wordsFound += 1;
            }
            return (Convert.ToDecimal(wordsFound) / Convert.ToDecimal(formHeadingWords.Count));
        }
    }
}
ㄟ。诗瑗 2024-08-19 01:54:01

您可以将 2 个文本中的所有单词替换为唯一的数字,使用一些现成的代码进行编辑距离计算,并将其字符与字符的比较替换为数字与数字的比较,然后就完成了!

我不确定是否有任何库可以满足您的需求。但你肯定会发现很多关于编辑距离的代码。

此外,根据您是否确实希望在编辑距离计算中允许替换,您可以更改动态编程代码中的条件。

看到这个。 http://en.wikipedia.org/wiki/Levenshtein_distance

you can replace all the words in your 2 texts with unique numbers, take some ready made code for Edit distance computation and replace it's character to character comparison with number to number comparison and you are done!

I am not sure if there exists any library for exactly what u want. But you will surely find lots of code for edit distance.

Further, depending on whether you want to actually want to allow substitutions or not in the edit distance computation, you can change the conditions in the dynamic programming code.

See this. http://en.wikipedia.org/wiki/Levenshtein_distance

李白 2024-08-19 01:54:01

你可以尝试这个,虽然我不确定这就是你正在寻找的 StringUtils.difference() (http://commons.apache.org/lang/api-release/org/apache/commons /lang/StringUtils.html#difference%28java.lang.String,%20java.lang.String%29)

另外,Eclipse (eclipse.org) 项目具有 diff 比较功能,这意味着它们还必须具有代码来确定差异,您可以浏览他们的 API 或源代码以查看可以找到什么。

祝你好运。

You might try this, though I am not sure it's what you are looking for StringUtils.difference() (http://commons.apache.org/lang/api-release/org/apache/commons/lang/StringUtils.html#difference%28java.lang.String,%20java.lang.String%29)

Alternately, the Eclipse (eclipse.org) project has a diff comparison feature, which means they must also have code to determine the differences, you might browse through their API or source to see what you can find.

Good luck.

第七度阳光i 2024-08-19 01:54:01

另一种 C# 库是 diff-match-patch - http://code。 google.com/p/google-diff-match-patch/

糟糕的是它发现了角色的差异。好消息是,有说明您必须添加什么来区分单词。

One more library for c# is diff-match-patch - http://code.google.com/p/google-diff-match-patch/.

The bad thing it finds difference in characters. The good thing, there is instruction what you have to add to diff the words.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文