评估字符串匹配的质量

发布于 2024-10-01 00:09:00 字数 464 浏览 0 评论 0原文

将模式与一组字符串逐一进行比较,同时评估模式与每个字符串的匹配程度的最佳方法是什么?根据我对正则表达式的有限经验,使用正则表达式将字符串与模式匹配似乎是一个相当二元的操作……无论模式有多复杂,最终它要么匹配,要么不匹配。我正在寻找更强大的功能,而不仅仅是匹配。有没有与此相关的好的技术或算法?

下面是一个示例:

假设我有一个模式 foo bar,我想从以下字符串中找到与其最匹配的字符串:

foo for
foo bax
foo buo
fxx bar

现在,这些字符串实际上都不匹配 模式,但是哪个非匹配最接近是匹配?在这种情况下,foo bax 将是最佳选择,因为它匹配 7 个字符中的 6 个。

抱歉,如果这是一个重复的问题,当我查看这个问题是否已经存在时,我真的不知道到底要搜索什么。

What would be the best way to compare a pattern with a set of strings, one by one, while rating the amount with which the pattern matches each string? In my limited experience with regex, matching strings with patterns using regex seems to be a pretty binary operation...no matter how complicated the pattern is, in the end, it either matches or it doesn't. I am looking for greater capabilities, beyond just matching. Is there a good technique or algorithm that relates to this?

Here's an example:

Lets say I have a pattern foo bar and I want to find the string that most closely matches it out of the following strings:

foo for
foo bax
foo buo
fxx bar

Now, none of these actually match the pattern, but which non-match is the closest to being a match? In this case, foo bax would be the best choice, since it matches 6 out of the 7 characters.

Apologies if this is a duplicate question, I didn't really know what exactly to search for when I looked to see if this question already exists.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

请恋爱 2024-10-08 00:09:00

这个有效,我检查了维基百科示例“小猫”和“坐”之间的距离是3

   public class LevenshteinDistance {

    public static final String TEST_STRING = "foo bar";

    public static void main(String ...args){
        LevenshteinDistance test = new LevenshteinDistance();
        List<String> testList = new ArrayList<String>();
        testList.add("foo for");
        testList.add("foo bax");
        testList.add("foo buo");
        testList.add("fxx bar");
        for (String string : testList) {
          System.out.println("Levenshtein Distance for " + string + " is " + test.getLevenshteinDistance(TEST_STRING, string)); 
        }
    }

    public int getLevenshteinDistance (String s, String t) {
          if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
          }

          int n = s.length(); // length of s
          int m = t.length(); // length of t

          if (n == 0) {
            return m;
          } else if (m == 0) {
            return n;
          }

          int p[] = new int[n+1]; //'previous' cost array, horizontally
          int d[] = new int[n+1]; // cost array, horizontally
          int _d[]; //placeholder to assist in swapping p and d

          // indexes into strings s and t
          int i; // iterates through s
          int j; // iterates through t

          char t_j; // jth character of t

          int cost; // cost

          for (i = 0; i<=n; i++) {
             p[i] = i;
          }

          for (j = 1; j<=m; j++) {
             t_j = t.charAt(j-1);
             d[0] = j;

             for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost                
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);  
             }

             // copy current distance counts to 'previous row' distance counts
             _d = p;
             p = d;
             d = _d;
          } 

          // our last action in the above loop was to switch d and p, so p now 
          // actually has the most recent cost counts
          return p[n];
        }

}

This one works, I checked with Wikipedia example distance between "kitten" and "sitting" is 3

   public class LevenshteinDistance {

    public static final String TEST_STRING = "foo bar";

    public static void main(String ...args){
        LevenshteinDistance test = new LevenshteinDistance();
        List<String> testList = new ArrayList<String>();
        testList.add("foo for");
        testList.add("foo bax");
        testList.add("foo buo");
        testList.add("fxx bar");
        for (String string : testList) {
          System.out.println("Levenshtein Distance for " + string + " is " + test.getLevenshteinDistance(TEST_STRING, string)); 
        }
    }

    public int getLevenshteinDistance (String s, String t) {
          if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
          }

          int n = s.length(); // length of s
          int m = t.length(); // length of t

          if (n == 0) {
            return m;
          } else if (m == 0) {
            return n;
          }

          int p[] = new int[n+1]; //'previous' cost array, horizontally
          int d[] = new int[n+1]; // cost array, horizontally
          int _d[]; //placeholder to assist in swapping p and d

          // indexes into strings s and t
          int i; // iterates through s
          int j; // iterates through t

          char t_j; // jth character of t

          int cost; // cost

          for (i = 0; i<=n; i++) {
             p[i] = i;
          }

          for (j = 1; j<=m; j++) {
             t_j = t.charAt(j-1);
             d[0] = j;

             for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost                
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);  
             }

             // copy current distance counts to 'previous row' distance counts
             _d = p;
             p = d;
             d = _d;
          } 

          // our last action in the above loop was to switch d and p, so p now 
          // actually has the most recent cost counts
          return p[n];
        }

}
梦纸 2024-10-08 00:09:00

这是一个有趣的问题!我首先想到的是正则表达式的匹配方式是构建一个 DFA。如果您可以直接访问为给定正则表达式构建(或刚刚构建)的 DFA你自己运行!)你可以运行输入测量从你转换到的最后一个状态到接受状态的距离,使用最短路径作为距离被接受的程度的度量,但我不知道有任何库可以会让你轻松做到这一点,甚至在许多情况下,这种测量方法也可能无法完全符合你的直觉。

That's an interesting question! The first thing that came to mind is that the way regular expressions are matched is by building a DFA. If you had direct access to the DFA that was built for a given regex (or just built it yourself!) you could run the input measure the distance from the last state you transitioned to and an accept state, using a shortest path as a measure of how close it was to being accepted, but I'm not aware of any libraries that would let you do that easily and even this measure probably wouldn't exactly map onto your intuition in a number of cases.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文