模糊正则表达式

发布于 2024-08-23 15:07:48 字数 348 浏览 6 评论 0原文

在我的工作中,我使用近似字符串匹配算法(例如 Damerau-Levenshtein 距离)取得了很好的结果,使我的代码不易出现拼写错误。

现在我需要将字符串与简单的正则表达式进行匹配,例如 TV Schedule for \d\d​​ (Jan|Feb|Mar|...)。这意味着字符串 TV Schedule for 10 Jan 应返回 0,而 T Schedule for 10. Jan 应返回 2。

这可以通过在正则表达式中生成所有字符串来完成(在本例中为 100x12) 并找到最佳匹配,但这并不实用。

您对如何有效地做到这一点有什么想法吗?

In my work I have with great results used approximate string matching algorithms such as Damerau–Levenshtein distance to make my code less vulnerable to spelling mistakes.

Now I have a need to match strings against simple regular expressions such TV Schedule for \d\d (Jan|Feb|Mar|...). This means that the string TV Schedule for 10 Jan should return 0 while T Schedule for 10. Jan should return 2.

This could be done by generating all strings in the regex (in this case 100x12) and find the best match, but that doesn't seam practical.

Do you have any ideas how to do this effectively?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

梅倚清风 2024-08-30 15:07:48

我找到了TRE库,它似乎能够精确地进行正则表达式的模糊匹配。示例: http://hackerboss.com/approximate-regex-matching-in-python/
但它只支持插入、删除和替换。没有转置。但我想这工作正常。

我在以下文件上尝试了附带的 agrep 工具和正则表达式:

TV Schedule for 10Jan
TVSchedule for Jan 10
T Schedule for 10 Jan 2010
TV Schedule for 10 March
Tv plan for March

并得到了

$ agrep -s -E 100 '^TV Schedule for \d\d (Jan|Feb|Mar)

非常感谢您的所有建议。

filename 1:TV Schedule for 10Jan 8:TVSchedule for Jan 10 7:T Schedule for 10 Jan 2010 3:TV Schedule for 10 March 15:Tv plan for March

非常感谢您的所有建议。

I found the TRE library, which seems to be able to do exactly fuzzy matching of regular expressions. Example: http://hackerboss.com/approximate-regex-matching-in-python/
It only supports insertion, deletion and substitution though. No transposition. But I guess that works ok.

I tried the accompanying agrep tool with the regexp on the following file:

TV Schedule for 10Jan
TVSchedule for Jan 10
T Schedule for 10 Jan 2010
TV Schedule for 10 March
Tv plan for March

and got

$ agrep -s -E 100 '^TV Schedule for \d\d (Jan|Feb|Mar)

Thanks a lot for all your suggestions.

filename 1:TV Schedule for 10Jan 8:TVSchedule for Jan 10 7:T Schedule for 10 Jan 2010 3:TV Schedule for 10 March 15:Tv plan for March

Thanks a lot for all your suggestions.

凶凌 2024-08-30 15:07:48

另请参阅:Python 正则表达式(较新版本,2014 年 10 月< /a>)(在文档中搜索“fuzzy”)。

如果您不是 Python 爱好者(我也不是),您可以将您的代码编译为 C (exe/dll)。然后你就可以使用你的 dll,甚至可以使用旧的 vb6(等等)。

其他库可供选择:

  • TRE/agrep('classic, good, old and fast)(搜索'agrep Performance'),但你需要编写POSIX兼容的正则表达式(搜索'正则表达式信息POSIX')
    当然,所有使用 TRE 的库/示例都有此限制(搜索“hackerboss approximation regex matches in python”)。对于海量数据:搜索“agrep 算法的快速 CUDA 实现”。
  • FREJ (Java) - 一些(更多)限制(例如,不向前看/向后看)
  • fuzzy-wuzzy (基于 Python) - 值得一看,未经测试...

还搜索:

  • “Comparison_of_regular_expression_engines”
  • “regular-expressions.info 工具”

(抱歉无法发布真实链接)

Also see: The Python regex (newer version, Oct '14) (search for "fuzzy" inside the document).

If you're not a Python guy (neither I am), you could compile your code to C (exe/dll). Then you would be able to use your dll even from good old vb6 (and the like).

Other libraries to choose from:

  • TRE/agrep ('classic, good, old and fast) (search for 'agrep performace'), but you need to write POSIX compatible regex (search for 'regular expressions info posix')
    Of course, all libraries/examples using TRE have this limitation (search for 'hackerboss approximate regex matching in python'). For massive data: search for 'A fast CUDA implementation of agrep algorithm'.
  • FREJ (Java) - some (more) limitations (for instance, no look ahead/look behind)
  • fuzzy-wuzzy (Python-based) - worth looking at, not tested...

Search also for:

  • 'Comparison_of_regular_expression_engines'
  • 'regular-expressions.info tools'

(sorry for not be able to post real links)

撧情箌佬 2024-08-30 15:07:48

我只是使用 regex 模块: '替代正则表达式模块,更换重新。它提供了熟悉的 re 功能,但包含模糊匹配选项,以及对 re 的其他几项改进。

对于 Windows 二进制文件,请参阅此资源

I just use the regex module: 'Alternative regular expression module, to replace re.' It provides the familiarity of re but includes options for fuzzy matching, along with several other improvements on re.

For Windows binaries, see this resource.

桜花祭 2024-08-30 15:07:48

这里是有关您所问问题的资源。对于一家公司来说,这有点像预告片。更有用的可能是这篇论文。我看到了一个受该论文启发的实现,它可以在大型数据集上进行模糊搜索,偏向于特殊语言(例如阿拉伯语与英语)。

一般来说,您将无法执行您所要求的操作。您可以通过用等价类替换字符来使正则表达式搜索变得模糊,也可以在数据库中搜索由编辑距离定义的近似匹配。尝试扩展正则表达式后面的 (n)DFA 以包含按距离排列的近似匹配将很快变得极其复杂。

Here is a resource on the question you are asking. It is a bit of a teaser for a company. More useful might be this paper. I've seen an implementation inspired by the paper that could do a fuzzy search, biased for special language (e.g. Arabic vs. English), on a large dataset.

In general, you won't be able to do what you asked about. You can make a regexp search fuzzy by replacing characters with equivalence classes, or you can search a database for near-matches defined by Levenshtein distance. Trying to expand the (n)DFA behind a regexp to include near-matches by distance would rapidly become impossibly complex.

深陷 2024-08-30 15:07:48

您是否考虑过使用词法分析器

我从未真正使用过,所以我帮不上什么忙,但听起来很合适!

Have you considered using a lexer?

I've never actually used one so i can't be much help, but it sounds like it fits!

樱花落人离去 2024-08-30 15:07:48

我开始实现一个名为 prex 的 Java 工具,用于近似正则表达式匹配。该工具确定字符串 s 与正则表达式 r 的匹配程度,ies 上插入、删除和替换了多少次 至少是必需的(最小成本),使得生成的字符串 s' 可以被 r 接受。如果您有兴趣,可以查看 https://github.com/julianthome/prex 中的代码。我很乐意得到一些反馈。请注意,该方法仍然有点慢,但我目前正在结合一些启发式方法来提高其性能。

I started to implement a Java tool called prex for approximate regular expression matching. The tool determines how far a string s is from matching a regular expression r, i.e. how many insertions, deletions and substitutions on s are at least required (minimum cost) such that the resulting string s' is acceptable by r. If your are interested, you can check out the code from https://github.com/julianthome/prex. I would be very happy to get some feedback. Note, that the approach is still a bit slow, but I am currently incorporating some heuristics for improving its performance.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文