如何为正则表达式词干制作通用前缀?

发布于 2024-10-02 12:48:12 字数 502 浏览 0 评论 0原文

我有一个单词数组,需要通过正则表达式操作进行查找和替换,有时这个数组可能有数千个单词长。我经过测试发现,使用通用前缀来提取单词比单独搜索它们要快得多。也就是说,^where|why$^wh(ere|y)$ 慢。显然,在如此短的示例中,这并不是一个明显的差异,但在有数千个替代方案并且主题字符串很长的情况下,它的速度要快得多。

所以我正在寻找一种自动执行此词干提取的方法,例如转换 string[] { "what", "why", "where", "when", "which" }进入 wh(at|y|e(re|n)|i(ch))

是否已经有一个公认的算法可以做到这一点?如果没有,你会怎么做?这似乎需要递归完成,但我不太清楚如何做到这一点。我写了一个方法,它的工作范围有限,但它很不优雅,有 60 行长,并且使用多个嵌套的 foreach 循环,所以这是未来维护的噩梦。我确信有更好的方法,如果有人能指出我正确的方向,我将不胜感激......

I have an array of words I need to do a find-and-replace by regex operation on, and sometimes this array can be thousands of words long. I've tested and found that stemming the words using common prefixes is much faster than searching for them individually. That is, ^where|why$ is slower than ^wh(ere|y)$. Obviously it's not a noticeable difference in such a short example, but it's considerably faster where there are thousands of alternatives and the subject string is long.

So I'm looking for a way to do this stemming automatically, for instance to convert a string[] { "what", "why", "where", "when", "which" } into wh(at|y|e(re|n)|i(ch))

Is there already a recognized algorithm out there that does this ? If not, how would you go about it ? It seems to need to be done recursively but I can't quite get my head round how to do it. I have a method I wrote that works to a limited extent, but it's inelegant, 60 lines longs and uses multiple nested foreach loops so it's a future maintenance nightmare. I'm sure there's a much better way, if anyone could point me in the right direction that'd be much appreciated...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

忆梦 2024-10-09 12:48:12

此代码应该有效:

public static class StemmingUtilities
{
    private class Node
    {
        public char? Value { get; private set; }
        public Node Parent { get; private set; }
        public List<Node> Children { get; private set; }
        public Node(char? c, Node parent)
        {
            this.Value = c;
            this.Parent = parent;
            this.Children = new List<Node>();
        }
    }

    public static string GetRegex(IEnumerable<string> tokens)
    {
        var root = new Node(null,null);
        foreach (var token in tokens)
        {
            var current = root;
            for (int i = 0; i < token.Length; i++)
            {
                char c = token[i];
                var node = current.Children.FirstOrDefault(x => x.Value.Value == c);
                if (node == null)
                {
                    node = new Node(c,current);
                    current.Children.Add(node);
                }
                current = node;
            }   
        }
        return BuildRexp(root);
    }

    private static string BuildRexp(Node root)
    {
        string s = "";
        bool addBracket = root.Children.Count > 1;
        // uncomment the following line to avoid first brakets wrapping (occurring in case of multiple root's children)
        // addBracket = addBracket && (root.Parent != null); 
        if (addBracket)
            s += "(";
        for(int i = 0; i < root.Children.Count; i++)
        {
            var child = root.Children[i];
            s += child.Value;
            s += BuildRexp(child);
            if (i < root.Children.Count - 1)
                s += "|";
        }
        if (addBracket)
            s += ")";
        return s;
    }
}

用法:

var toStem1 = new[] { "what", "why", "where", "when", "which" };
string reg1 = StemmingUtilities.GetRegex(toStem1);
// reg1 = "wh(at|y|e(re|n)|ich)"

string[] toStem2 = new[] { "why", "abc", "what", "where", "apple", "when" };
string reg2 = StemmingUtilities.GetRegex(toStem2);
// reg2 = "(wh(y|at|e(re|n))|a(bc|pple))"

编辑:
要获得 reg2 = "wh(y|at|e(re|n))|a(bc|pple)" 即没有第一个括号,只需取消注释 BuildRexp 中的标记行方法。

This code should work:

public static class StemmingUtilities
{
    private class Node
    {
        public char? Value { get; private set; }
        public Node Parent { get; private set; }
        public List<Node> Children { get; private set; }
        public Node(char? c, Node parent)
        {
            this.Value = c;
            this.Parent = parent;
            this.Children = new List<Node>();
        }
    }

    public static string GetRegex(IEnumerable<string> tokens)
    {
        var root = new Node(null,null);
        foreach (var token in tokens)
        {
            var current = root;
            for (int i = 0; i < token.Length; i++)
            {
                char c = token[i];
                var node = current.Children.FirstOrDefault(x => x.Value.Value == c);
                if (node == null)
                {
                    node = new Node(c,current);
                    current.Children.Add(node);
                }
                current = node;
            }   
        }
        return BuildRexp(root);
    }

    private static string BuildRexp(Node root)
    {
        string s = "";
        bool addBracket = root.Children.Count > 1;
        // uncomment the following line to avoid first brakets wrapping (occurring in case of multiple root's children)
        // addBracket = addBracket && (root.Parent != null); 
        if (addBracket)
            s += "(";
        for(int i = 0; i < root.Children.Count; i++)
        {
            var child = root.Children[i];
            s += child.Value;
            s += BuildRexp(child);
            if (i < root.Children.Count - 1)
                s += "|";
        }
        if (addBracket)
            s += ")";
        return s;
    }
}

Usage:

var toStem1 = new[] { "what", "why", "where", "when", "which" };
string reg1 = StemmingUtilities.GetRegex(toStem1);
// reg1 = "wh(at|y|e(re|n)|ich)"

string[] toStem2 = new[] { "why", "abc", "what", "where", "apple", "when" };
string reg2 = StemmingUtilities.GetRegex(toStem2);
// reg2 = "(wh(y|at|e(re|n))|a(bc|pple))"

EDIT:
to get reg2 = "wh(y|at|e(re|n))|a(bc|pple)" i.e. without the first wrapping brackets, just uncomment the marked line in BuildRexp method.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文