针对大量可比对象测试现有字符串的最佳方法

发布于 2024-07-12 04:43:58 字数 506 浏览 7 评论 0原文

假设您有一个定义值的首字母缩略词列表（例如 AB1、DE2、CC3），并且您需要检查字符串值（例如“Happy:DE2|234”）以查看是否在字符串中找到首字母缩略词。对于缩略词的简短列表，我通常会创建一个使用分隔符的简单正则表达式（例如 (AB1|DE2|CC3) ）并仅查找匹配项。

但是，如果有超过 30 个首字母缩略词需要匹配，我该如何解决这个问题呢？使用相同的技术（丑陋）是否有意义，或者是否有更有效和优雅的方法来完成此任务？

请记住，示例首字母缩略词列表和示例字符串并不是我正在使用的实际数据格式，而只是表达我的挑战的一种方式。

顺便说一句，我读了一个相关问题，但没想到它应用于我想要完成的事情。

编辑：我忘记包含捕获匹配值的需要，因此选择使用正则表达式......

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏天碎花小短裙 2024-07-19 04:43:58

就我个人而言，我认为 30 对于正则表达式来说并不是特别大，所以我不会太快排除它。只需一行代码即可创建正则表达式：

var acronyms = new[] { "AB", "BC", "CD", "ZZAB" };
var regex = new Regex(string.Join("|", acronyms), RegexOptions.Compiled);
for (var match = regex.Match("ZZZABCDZZZ"); match.Success; match = match.NextMatch())
    Console.WriteLine(match.Value);
// returns AB and CD

因此代码相对优雅且可维护。如果您知道首字母缩略词数量的上限，我会进行一些测试，谁知道正则表达式引擎中已经内置了哪些优化。您还可以免费受益于未来的正则表达式引擎优化。除非您有理由相信性能将是一个问题，否则请保持简单。

另一方面，正则表达式可能有其他限制，例如默认情况下，如果您有首字母缩略词 AB、BC 和 CD，那么它只会返回其中两个作为“ABCD”中的匹配项。所以它很擅长告诉你有一个缩写词，但你需要小心捕捉多个匹配项。

当性能对我来说成为一个问题时（> 10,000 个项目），我将“首字母缩写词”放入 HashSet 中，然后搜索文本的每个子字符串（从最小首字母缩写词长度到最大首字母缩写词长度）。这对我来说没问题，因为源文本非常短。我以前没有听说过它，但乍一看您引用的问题中提到的 Aho-Corasick 算法似乎是解决此问题的更好的通用解决方案。

Personally I don't think 30 is particularly large for a regex so I wouldn't be too quick to rule it out. You can create the regex with a single line of code:

var acronyms = new[] { "AB", "BC", "CD", "ZZAB" };
var regex = new Regex(string.Join("|", acronyms), RegexOptions.Compiled);
for (var match = regex.Match("ZZZABCDZZZ"); match.Success; match = match.NextMatch())
    Console.WriteLine(match.Value);
// returns AB and CD

So the code is relatively elegant and maintainable. If you know the upper bound for the number of acronyms I would to some testing, who knows what kind of optimizations there are already built into the regex engine. You'll also be able to benefit for free from future regex engine optimizations. Unless you have reason to believe performance will be an issue keep it simple.

On the other hand regex may have other limitations e.g. by default if you have acronyms AB, BC and CD then it'll only return two of these as a match in "ABCD". So its good at telling you there is an acronym but you need to be careful about catching multiple matches.

When performance became an issue for me (> 10,000 items) I put the 'acronyms' in a HashSet and then searched each substring of the text (from min acronym length to max acronym length). This was ok for me because the source text was very short. I'd not heard of it before, but at first look the Aho-Corasick algorithm, referred to in the question you reference, seems like a better general solution to this problem.

回复收藏 0 原文

悲欢浪云 2024-07-19 04:43:58

如果首字母缩略词具有固定大小（如上面的示例），您可以计算所有它们的哈希值（可以在每个应用程序生命周期中执行一次），然后将字符串分割成这样的重叠部分并计算它们的哈希值。然后您所要做的就是从一个数组中搜索值到另一个数组中。

您可能可以根据首字母缩略词创建后缀/前缀树或类似的东西，并使用此信息进行搜索，维基百科中有很多算法可以做到这一点。

您还可以为每个首字母缩略词创建一个确定性自动机，但这与以前的方法非常相似。

回复收藏 0 原文

蓝天白云 2024-07-19 04:43:58

为什么不简单地分割字符串并比较返回的列表呢？在这种情况下使用 REGEX 似乎是不必要的开销。我知道您的格式可能有所不同，但似乎您可以：

根据“标题分隔符”拆分字符串，在您的情况下为冒号：
取结果的第二半，即首字母缩略词字符串，然后根据首字母缩略词分隔符，在本例中为管道 |
最后，迭代新分割的缩略词列表，并使用嵌套 for 循环将每个缩略词与候选列表进行比较

编辑：如果您只需要知道字符串中是否存在特定的缩略词或缩略词集，使用 .Search() 方法而不是 .Match()。

回复收藏 0 原文

懒猫 2024-07-19 04:43:58

正则表达式方法看起来足够高效和优雅。当然，在构建表达式时，您必须注意未转义的字符，或者由于复杂性或大小限制而导致编译失败。

另一种方法是构建一个 trie 数据结构来表示所有首字母缩略词（这可能有点重复正则表达式匹配器正在做的事情）。当您逐步遍历字符串中的每个字符时，您将创建一个指向 trie 根的新指针，并将现有指针前进到适当的子节点（如果有）。当任何指针到达叶子时，您就会得到匹配。

回复收藏 0 原文

╄→承喏 2024-07-19 04:43:58

这是我想出的。我将不胜感激您可以提供的任何建设性批评...

首先，创建一个包含我的每个首字母缩略词的枚举：

enum acronym
{ AB1,DE2,CC3 }

接下来我创建一个枚举的字符串数组：

string[] acronyms = Enum.GetNames(typeof(acronym));

最后我循环遍历字符串数组并执行 regex.match 方法：

foreach (string a in acronyms)
{
    Match aMatch = Regex.Match(input, a.ToString(), RegexOptions.None);
    if (aMatch.Success)
    {
        ...<do something>...
        break;
    }
}

看出有什么问题了吗？

Here is what I came up with. I would appreciate any constructive criticism that you could offer...

First, create an enum that holds each of my acronym's:

enum acronym
{ AB1,DE2,CC3 }

Next I create a string array of the enum:

string[] acronyms = Enum.GetNames(typeof(acronym));

Finally I loop through the string array and peform the regex.match method:

foreach (string a in acronyms)
{
    Match aMatch = Regex.Match(input, a.ToString(), RegexOptions.None);
    if (aMatch.Success)
    {
        ...<do something>...
        break;
    }
}

See anything wrong with that?

回复收藏 0 原文

~没有更多了~