使用 C# 从数据库中驻留的字符串集自动生成正则表达式

发布于 2024-09-03 17:18:23 字数 403 浏览 5 评论 0原文

我的数据库中有大约 100,000 个字符串,我想知道是否有一种方法可以从这些字符串自动生成正则表达式模式。它们都是字母字符串,并使用英文字母的字母集。例如,不使用 (X,W,V)。有没有任何函数或库可以帮助我在 C# 中实现这个目标?示例字符串为

KHTK
RAZ

给定这两个字符串,我的目标是生成一个正则表达式,该正则表达式允许像 (k, kh, kht,khtk, r, ra, raz) 这样的模式当然不区分大小写。我已经下载并使用了一些有助于生成正则表达式的 C# 应用程序,但这在我的场景中没有用,因为我想要一个过程,在该过程中我顺序从数据库读取字符串并向正则表达式添加规则,以便稍后可以在应用程序中重用该正则表达式或保存在磁盘上。

我是正则表达式模式的新手,不知道我要问的事情是否可能。如果不可能,请建议我一些替代方法。

I have about 100,000 strings in database and I want to if there is a way to automatically generate regex pattern from these strings. All of them are alphabetic strings and use set of alphabets from English letters. (X,W,V) is not used for example. Is there any function or library that can help me achieve this target in C#? Example strings are

KHTK
RAZ

Given these two strings my target is to generate a regex that allows patterns like (k, kh, kht,khtk, r, ra, raz) case insensitive of course. I have downloaded and used some C# applications that help in generating regex but that is not useful in my scenario because I want a process in which I sequentially read strings from db and add rules to regex so this regex could be reused later in the application or saved on the disk.

I'm new to regex patterns and don't know if the thing I'm asking is even possible or not. If it is not possible please suggest me some alternate approach.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

从此见与不见 2024-09-10 17:18:23

一种简单(有些人可能会说天真的)方法是创建一个正则表达式模式,该模式连接所有搜索字符串,并用交替运算符 | 分隔:

  1. 对于您的示例字符串,这将为您提供 KHTK |RAZ
  2. 为了让正则表达式捕获前缀,我们将在模式中包含这些前缀,例如 K|KH|KHT|KHTK|R|RA|RAZ
  3. 最后,为了确保这些字符串仅被整体捕获,而不是作为较大字符串的一部分,我们将分别匹配行首和行尾运算符以及每个字符串的开头和结尾: ^K$|^KH$|^KHT$|^KHTK$|^R$|^RA$|^RAZ$

我们希望 Regex 类实现能够完成转换 long 的繁重工作。正则表达式模式字符串到高效的匹配器。

这里的示例程序生成 10,000 个随机字符串,以及一个与这些字符串及其所有前缀完全匹配的正则表达式。然后,程序验证正则表达式是否确实与这些字符串匹配,并计算这一切需要多长时间。

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

namespace ConsoleApplication
{
    class Program
    {
        private static Random r = new Random();

        // Create a string with randomly chosen letters, of a randomly chosen
        // length between the given min and max.
        private static string RandomString(int minLength, int maxLength)
        {
            StringBuilder b = new StringBuilder();

            int length = r.Next(minLength, maxLength);
            for (int i = 0; i < length; ++i)
            {
                b.Append(Convert.ToChar(65 + r.Next(26)));
            }

            return b.ToString();
        }

        static void Main(string[] args)
        {
            int             stringCount = 10000;                    // number of random strings to generate
            StringBuilder   pattern     = new StringBuilder();      // our regular expression under construction
            HashSet<String> strings     = new HashSet<string>();    // a set of the random strings (and their
                                                                    // prefixes) we created, for verifying the
                                                                    // regex correctness

            // generate random strings, track their prefixes in the set,
            // and add their prefixes to our regular expression
            for (int i = 0; i < stringCount; ++i)
            {
                // make a random string, 2-5 chars long
                string nextString = RandomString(2, 5);

                // for each prefix of the random string...
                for (int prefixLength = 1; prefixLength <= nextString.Length; ++prefixLength)
                {
                    string prefix = nextString.Substring(0, prefixLength);

                    // ...add it to both the set and our regular expression pattern
                    if (!strings.Contains(prefix))
                    {
                        strings.Add(prefix);
                        pattern.Append(((pattern.Length > 0) ? "|" : "") + "^" + prefix + "$");
                    }
                }
            }

            // create a regex from the pattern (and time how long that takes)
            DateTime regexCreationStartTime = DateTime.Now;
            Regex r = new Regex(pattern.ToString());
            DateTime regexCreationEndTime = DateTime.Now;

            // make sure our regex correcly matches all the strings, and their
            // prefixes (and time how long that takes as well)
            DateTime matchStartTime = DateTime.Now;
            foreach (string s in strings)
            {
                if (!r.IsMatch(s))
                {
                    Console.WriteLine("uh oh!");
                }
            }
            DateTime matchEndTime = DateTime.Now;

            // generate some new random strings, and verify that the regex
            // indeed does not match the ones it's not supposed to.
            for (int i = 0; i < 1000; ++i)
            {
                string s = RandomString(2, 5);

                if (!strings.Contains(s) && r.IsMatch(s))
                {
                    Console.WriteLine("uh oh!");
                }
            }

            Console.WriteLine("Regex create time: {0} millisec", (regexCreationEndTime - regexCreationStartTime).TotalMilliseconds);
            Console.WriteLine("Average match time: {0} millisec", (matchEndTime - matchStartTime).TotalMilliseconds / stringCount);

            Console.ReadLine();
        }
    }
}

在 Intel Core2 盒子上,我得到了 10,000 个字符串的以下数字:

Regex create time: 46 millisec
Average match time: 0.3222 millisec

当将字符串数量增加 10 倍(至 100,000)时,我得到:

Regex create time: 288 millisec
Average match time: 1.25577 millisec

这是更高的,但增长不是线性的。

该应用程序的内存消耗(在 10,000 个字符串时)开始为 ~9MB,峰值为 ~23MB,其中必须包含正则表达式和字符串集,最后下降到 ~16MB(垃圾收集启动?)该程序没有针对从其他数据结构中剔除正则表达式内存消耗进行优化。

A simple (some might say naive) approach would be to create a regex pattern that concatenates all the search strings, separated by the alternation operator |:

  1. For your example strings, that would get you KHTK|RAZ.
  2. To have the regex capture prefixes, we would include those prefixes in the pattern, e.g. K|KH|KHT|KHTK|R|RA|RAZ.
  3. Finally, to make sure that those strings are captured only in whole, and not as part of larger strings, we'll match the beginning-of-line and end-of-line operators and the beginning and end of each string, respectively: ^K$|^KH$|^KHT$|^KHTK$|^R$|^RA$|^RAZ$

We would expect the Regex class implementation to do the heavy lifting of converting the long regex pattern string to an efficient matcher.

The sample program here generates 10,000 random strings, and a regular expression that matches exactly those strings and all their prefixes. The program then verifies that the regex indeed matches just those strings, and times how long it all takes.

using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;

namespace ConsoleApplication
{
    class Program
    {
        private static Random r = new Random();

        // Create a string with randomly chosen letters, of a randomly chosen
        // length between the given min and max.
        private static string RandomString(int minLength, int maxLength)
        {
            StringBuilder b = new StringBuilder();

            int length = r.Next(minLength, maxLength);
            for (int i = 0; i < length; ++i)
            {
                b.Append(Convert.ToChar(65 + r.Next(26)));
            }

            return b.ToString();
        }

        static void Main(string[] args)
        {
            int             stringCount = 10000;                    // number of random strings to generate
            StringBuilder   pattern     = new StringBuilder();      // our regular expression under construction
            HashSet<String> strings     = new HashSet<string>();    // a set of the random strings (and their
                                                                    // prefixes) we created, for verifying the
                                                                    // regex correctness

            // generate random strings, track their prefixes in the set,
            // and add their prefixes to our regular expression
            for (int i = 0; i < stringCount; ++i)
            {
                // make a random string, 2-5 chars long
                string nextString = RandomString(2, 5);

                // for each prefix of the random string...
                for (int prefixLength = 1; prefixLength <= nextString.Length; ++prefixLength)
                {
                    string prefix = nextString.Substring(0, prefixLength);

                    // ...add it to both the set and our regular expression pattern
                    if (!strings.Contains(prefix))
                    {
                        strings.Add(prefix);
                        pattern.Append(((pattern.Length > 0) ? "|" : "") + "^" + prefix + "$");
                    }
                }
            }

            // create a regex from the pattern (and time how long that takes)
            DateTime regexCreationStartTime = DateTime.Now;
            Regex r = new Regex(pattern.ToString());
            DateTime regexCreationEndTime = DateTime.Now;

            // make sure our regex correcly matches all the strings, and their
            // prefixes (and time how long that takes as well)
            DateTime matchStartTime = DateTime.Now;
            foreach (string s in strings)
            {
                if (!r.IsMatch(s))
                {
                    Console.WriteLine("uh oh!");
                }
            }
            DateTime matchEndTime = DateTime.Now;

            // generate some new random strings, and verify that the regex
            // indeed does not match the ones it's not supposed to.
            for (int i = 0; i < 1000; ++i)
            {
                string s = RandomString(2, 5);

                if (!strings.Contains(s) && r.IsMatch(s))
                {
                    Console.WriteLine("uh oh!");
                }
            }

            Console.WriteLine("Regex create time: {0} millisec", (regexCreationEndTime - regexCreationStartTime).TotalMilliseconds);
            Console.WriteLine("Average match time: {0} millisec", (matchEndTime - matchStartTime).TotalMilliseconds / stringCount);

            Console.ReadLine();
        }
    }
}

On an Intel Core2 box I'm getting the following numbers for 10,000 strings:

Regex create time: 46 millisec
Average match time: 0.3222 millisec

When increasing the number of strings 10-fold (to 100,000), I'm getting:

Regex create time: 288 millisec
Average match time: 1.25577 millisec

This is higher, but the growth is less than linear.

The app's memory consumption (at 10,000 strings) started at ~9MB, peaked at ~23MB that must have included both the regex and the string set, and dropped to ~16MB towards the end (garbage collection kicked in?) Draw your own conclusions from that -- the program doesn't optimize for teasing out the regex memory consumption from the other data structures.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文