如何为正则表达式词干制作通用前缀？

发布于 2024-10-02 12:48:12 字数 502 浏览 0 评论 0原文

我有一个单词数组，需要通过正则表达式操作进行查找和替换，有时这个数组可能有数千个单词长。我经过测试发现，使用通用前缀来提取单词比单独搜索它们要快得多。也就是说，^where|why$ 比 ^wh(ere|y)$ 慢。显然，在如此短的示例中，这并不是一个明显的差异，但在有数千个替代方案并且主题字符串很长的情况下，它的速度要快得多。

所以我正在寻找一种自动执行此词干提取的方法，例如转换 string[] { "what", "why", "where", "when", "which" }进入 wh(at|y|e(re|n)|i(ch))

是否已经有一个公认的算法可以做到这一点？如果没有，你会怎么做？这似乎需要递归完成，但我不太清楚如何做到这一点。我写了一个方法，它的工作范围有限，但它很不优雅，有 60 行长，并且使用多个嵌套的 foreach 循环，所以这是未来维护的噩梦。我确信有更好的方法，如果有人能指出我正确的方向，我将不胜感激......

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

忆梦 2024-10-09 12:48:12

此代码应该有效：

public static class StemmingUtilities
{
    private class Node
    {
        public char? Value { get; private set; }
        public Node Parent { get; private set; }
        public List<Node> Children { get; private set; }
        public Node(char? c, Node parent)
        {
            this.Value = c;
            this.Parent = parent;
            this.Children = new List<Node>();
        }
    }

    public static string GetRegex(IEnumerable<string> tokens)
    {
        var root = new Node(null,null);
        foreach (var token in tokens)
        {
            var current = root;
            for (int i = 0; i < token.Length; i++)
            {
                char c = token[i];
                var node = current.Children.FirstOrDefault(x => x.Value.Value == c);
                if (node == null)
                {
                    node = new Node(c,current);
                    current.Children.Add(node);
                }
                current = node;
            }   
        }
        return BuildRexp(root);
    }

    private static string BuildRexp(Node root)
    {
        string s = "";
        bool addBracket = root.Children.Count > 1;
        // uncomment the following line to avoid first brakets wrapping (occurring in case of multiple root's children)
        // addBracket = addBracket && (root.Parent != null); 
        if (addBracket)
            s += "(";
        for(int i = 0; i < root.Children.Count; i++)
        {
            var child = root.Children[i];
            s += child.Value;
            s += BuildRexp(child);
            if (i < root.Children.Count - 1)
                s += "|";
        }
        if (addBracket)
            s += ")";
        return s;
    }
}

用法：

var toStem1 = new[] { "what", "why", "where", "when", "which" };
string reg1 = StemmingUtilities.GetRegex(toStem1);
// reg1 = "wh(at|y|e(re|n)|ich)"

string[] toStem2 = new[] { "why", "abc", "what", "where", "apple", "when" };
string reg2 = StemmingUtilities.GetRegex(toStem2);
// reg2 = "(wh(y|at|e(re|n))|a(bc|pple))"

编辑：
要获得 reg2 = "wh(y|at|e(re|n))|a(bc|pple)" 即没有第一个括号，只需取消注释 BuildRexp 中的标记行方法。

This code should work:

public static class StemmingUtilities
{
    private class Node
    {
        public char? Value { get; private set; }
        public Node Parent { get; private set; }
        public List<Node> Children { get; private set; }
        public Node(char? c, Node parent)
        {
            this.Value = c;
            this.Parent = parent;
            this.Children = new List<Node>();
        }
    }

    public static string GetRegex(IEnumerable<string> tokens)
    {
        var root = new Node(null,null);
        foreach (var token in tokens)
        {
            var current = root;
            for (int i = 0; i < token.Length; i++)
            {
                char c = token[i];
                var node = current.Children.FirstOrDefault(x => x.Value.Value == c);
                if (node == null)
                {
                    node = new Node(c,current);
                    current.Children.Add(node);
                }
                current = node;
            }   
        }
        return BuildRexp(root);
    }

    private static string BuildRexp(Node root)
    {
        string s = "";
        bool addBracket = root.Children.Count > 1;
        // uncomment the following line to avoid first brakets wrapping (occurring in case of multiple root's children)
        // addBracket = addBracket && (root.Parent != null); 
        if (addBracket)
            s += "(";
        for(int i = 0; i < root.Children.Count; i++)
        {
            var child = root.Children[i];
            s += child.Value;
            s += BuildRexp(child);
            if (i < root.Children.Count - 1)
                s += "|";
        }
        if (addBracket)
            s += ")";
        return s;
    }
}

Usage:

var toStem1 = new[] { "what", "why", "where", "when", "which" };
string reg1 = StemmingUtilities.GetRegex(toStem1);
// reg1 = "wh(at|y|e(re|n)|ich)"

string[] toStem2 = new[] { "why", "abc", "what", "where", "apple", "when" };
string reg2 = StemmingUtilities.GetRegex(toStem2);
// reg2 = "(wh(y|at|e(re|n))|a(bc|pple))"

EDIT:
to get reg2 = "wh(y|at|e(re|n))|a(bc|pple)" i.e. without the first wrapping brackets, just uncomment the marked line in BuildRexp method.

回复收藏 0 原文

~没有更多了~