针对大量可比对象测试现有字符串的最佳方法

发布于 2024-07-12 04:43:58 字数 506 浏览 7 评论 0原文

假设您有一个定义值的首字母缩略词列表(例如 AB1、DE2、CC3),并且您需要检查字符串值(例如“Happy:DE2|234”)以查看是否在字符串中找到首字母缩略词。 对于缩略词的简短列表,我通常会创建一个使用分隔符的简单正则表达式(例如 (AB1|DE2|CC3) )并仅查找匹配项。

但是,如果有超过 30 个首字母缩略词需要匹配,我该如何解决这个问题呢? 使用相同的技术(丑陋)是否有意义,或者是否有更有效和优雅的方法来完成此任务?

请记住,示例首字母缩略词列表和示例字符串并不是我正在使用的实际数据格式,而只是表达我的挑战的一种方式。

顺便说一句,我读了一个相关问题,但没想到它应用于我想要完成的事情。

编辑:我忘记包含捕获匹配值的需要,因此选择使用正则表达式......

Suppose you have a list of acronym's that define a value (ex. AB1,DE2,CC3) and you need to check a string value (ex. "Happy:DE2|234") to see if an acronym is found in the string. For a short list of acronym's I would usually create a simple RegEx that used a separator (ex. (AB1|DE2|CC3) ) and just look for a match.

But how would I tackle this if there are over 30 acronym's to match against? Would it make sense to use the same technique (ugly) or is there a more effecient and elegant way to accomplish this task?

Keep in mind the example acronym list and example string is not the actual data format that I am working with, rather just a way to express my challenge.

BTW, I read a SO related question but didn't think it applied to what I was trying to accomplish.

EDIT: I forgot to include my need to capture the matched value, hence the choice to use Regular Expressions...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

夏天碎花小短裙 2024-07-19 04:43:58

就我个人而言,我认为 30 对于正则表达式来说并不是特别大,所以我不会太快排除它。 只需一行代码即可创建正则表达式:

var acronyms = new[] { "AB", "BC", "CD", "ZZAB" };
var regex = new Regex(string.Join("|", acronyms), RegexOptions.Compiled);
for (var match = regex.Match("ZZZABCDZZZ"); match.Success; match = match.NextMatch())
    Console.WriteLine(match.Value);
// returns AB and CD

因此代码相对优雅且可维护。 如果您知道首字母缩略词数量的上限,我会进行一些测试,谁知道正则表达式引擎中已经内置了哪些优化。 您还可以免费受益于未来的正则表达式引擎优化。 除非您有理由相信性能将是一个问题,否则请保持简单。

另一方面,正则表达式可能有其他限制,例如默认情况下,如果您有首字母缩略词 AB、BC 和 CD,那么它只会返回其中两个作为“ABCD”中的匹配项。 所以它很擅长告诉你有一个缩写词,但你需要小心捕捉多个匹配项。

当性能对我来说成为一个问题时(> 10,000 个项目),我将“首字母缩写词”放入 HashSet 中,然后搜索文本的每个子字符串(从最小首字母缩写词长度到最大首字母缩写词长度)。 这对我来说没问题,因为源文本非常短。 我以前没有听说过它,但乍一看您引用的问题中提到的 Aho-Corasick 算法似乎是解决此问题的更好的通用解决方案。

Personally I don't think 30 is particularly large for a regex so I wouldn't be too quick to rule it out. You can create the regex with a single line of code:

var acronyms = new[] { "AB", "BC", "CD", "ZZAB" };
var regex = new Regex(string.Join("|", acronyms), RegexOptions.Compiled);
for (var match = regex.Match("ZZZABCDZZZ"); match.Success; match = match.NextMatch())
    Console.WriteLine(match.Value);
// returns AB and CD

So the code is relatively elegant and maintainable. If you know the upper bound for the number of acronyms I would to some testing, who knows what kind of optimizations there are already built into the regex engine. You'll also be able to benefit for free from future regex engine optimizations. Unless you have reason to believe performance will be an issue keep it simple.

On the other hand regex may have other limitations e.g. by default if you have acronyms AB, BC and CD then it'll only return two of these as a match in "ABCD". So its good at telling you there is an acronym but you need to be careful about catching multiple matches.

When performance became an issue for me (> 10,000 items) I put the 'acronyms' in a HashSet and then searched each substring of the text (from min acronym length to max acronym length). This was ok for me because the source text was very short. I'd not heard of it before, but at first look the Aho-Corasick algorithm, referred to in the question you reference, seems like a better general solution to this problem.

悲欢浪云 2024-07-19 04:43:58

如果首字母缩略词具有固定大小(如上面的示例),您可以计算所有它们的哈希值(可以在每个应用程序生命周期中执行一次),然后将字符串分割成这样的重叠部分并计算它们的哈希值。 然后您所要做的就是从一个数组中搜索值到另一个数组中。

您可能可以根据首字母缩略词创建后缀/前缀树或类似的东西,并使用此信息进行搜索,维基百科中有很多算法可以做到这一点。

您还可以为每个首字母缩略词创建一个确定性自动机,但这与以前的方法非常相似。

If acronym's have fixed size (like in above example), you could calculate a hash for all of them (could be done once per application life) and then split the string in such overlapped pieces and calculate hashes for them too. Then all you'd have to do is to search for values from one array into another one.

You probably could create a suffix/prefix tree or something similar from acronyms and search using this information, there's plenty of algorithms in Wikipedia to do just that.

You could also create an deterministic automata for each of acronyms but it's very similar to previous approach.

蓝天白云 2024-07-19 04:43:58

为什么不简单地分割字符串并比较返回的列表呢? 在这种情况下使用 REGEX 似乎是不必要的开销。 我知道您的格式可能有所不同,但似乎您可以:

  • 根据“标题分隔符”拆分字符串,在您的情况下为冒号:
  • 取结果的第二半,即首字母缩略词字符串,然后根据首字母缩略词分隔符,在本例中为管道 |
  • 最后,迭代新分割的缩略词列表,并使用嵌套 for 循环将每个缩略词与候选列表进行比较

编辑:如果您只需要知道字符串中是否存在特定的缩略词或缩略词集,使用 .Search() 方法而不是 .Match()。

Why not simply split the string and compare the returned list? It seems like needless overhead to use a REGEX in this case. I know your format may differ, but it would seem that you could:

  • Split the string based on the 'title separator', in your case a colon :
  • Take the 2nd half of the result, the acronym string, and split it based on the acronym separator, in this case a pipe |
  • Finally, iterate over the newly split list of acronyms and compare each to your list of candidates with a nested for loop

EDIT: If you only need to know if a particular acronym or set of acronyms exist inside a string, use the .Search() method instead of .Match().

懒猫 2024-07-19 04:43:58

正则表达式方法看起来足够高效和优雅。 当然,在构建表达式时,您必须注意未转义的字符,或者由于复杂性或大小限制而导致编译失败。

另一种方法是构建一个 trie 数据结构 来表示所有首字母缩略词(这可能有点重复正则表达式匹配器正在做的事情)。 当您逐步遍历字符串中的每个字符时,您将创建一个指向 trie 根的新指针,并将现有指针前进到适当的子节点(如果有)。 当任何指针到达叶子时,您就会得到匹配。

The regex approach seems efficient and elegant enough. Of course, you'll have to watch out for unescaped characters when building the expression, or a failure to compile it because of complexity or size limitations.

Another way to do this would be to construct a trie data structure to represent all the acronyms (this may somewhat duplicate what the regex matcher is doing). As you step through each character in the string, you would create a new pointer to the root of the trie, and advance existing pointers to the appropriate child (if any). You get a match when any pointer reaches a leaf.

╄→承喏 2024-07-19 04:43:58

这是我想出的。 我将不胜感激您可以提供的任何建设性批评...

首先,创建一个包含我的每个首字母缩略词的枚举:

enum acronym
{ AB1,DE2,CC3 }

接下来我创建一个枚举的字符串数组:

string[] acronyms = Enum.GetNames(typeof(acronym));

最后我循环遍历字符串数组并执行 regex.match 方法:

foreach (string a in acronyms)
{
    Match aMatch = Regex.Match(input, a.ToString(), RegexOptions.None);
    if (aMatch.Success)
    {
        ...<do something>...
        break;
    }
}

看出有什么问题了吗?

Here is what I came up with. I would appreciate any constructive criticism that you could offer...

First, create an enum that holds each of my acronym's:

enum acronym
{ AB1,DE2,CC3 }

Next I create a string array of the enum:

string[] acronyms = Enum.GetNames(typeof(acronym));

Finally I loop through the string array and peform the regex.match method:

foreach (string a in acronyms)
{
    Match aMatch = Regex.Match(input, a.ToString(), RegexOptions.None);
    if (aMatch.Success)
    {
        ...<do something>...
        break;
    }
}

See anything wrong with that?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文