正则表达式搜索引擎

发布于 2024-10-10 02:37:31 字数 1536 浏览 0 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

夜空下最亮的亮点 2024-10-17 02:37:31

Google 代码搜索允许您使用正则表达式进行搜索。

据我所知,不存在这样的用于一般搜索的搜索引擎。

Google Code Search allows you to search using a regular expression.

As far as I am aware no such search engine exists for general searches.

江城子 2024-10-17 02:37:31

正则表达式存在一些问题,目前禁止在现实场景中使用它们。最紧迫的是整个缓存的互联网必须与您的正则表达式相匹配,这将占用大量的计算资源;索引在正则表达式上下文中似乎几乎没有用处,因为正则表达式可能未绑定(/fo*bar/)。

There are a few problems with regular expressions that current prohibit employing these in real-world scenarios. The most pressing would be that the entire cached Internet would have to be matched with your regex, which would take significant computing resources; indexes are pretty much useless in regex context it seems, due to regexes being potentially unbound (/fo*bar/).

难忘№最初的完美 2024-10-17 02:37:31

我没有具体的引擎可以推荐。

但是,如果您可以使用正则表达式语法的子集,搜索引擎可以存储附加标记以有效匹配相当复杂的表达式。 Solr/Lucene 允许自定义标记化,其中相同的单词可以生成多个标记并具有不同的规则集。

我将以我的名字为例:“马克标记该位置。”

不区分大小写,带词干:(mark, mark, spot)

区分大小写,不带词干:(Mark,marks,spot)

区分大小写,带 NLP 同义词库扩展:( [Mark, Marc], [mark, indicates, to-point], [点,位置,位置,信标,坐标])

现在正在朝着您的问题发展,不区分大小写,词干提取,重复数据删除,自动完成前缀匹配:([m,ma,mar,mark],[s,sp,spo,spot])

如果您想要“子字符串”样式匹配,则为: ( [m, ma, mar, mark, a, ar, ark, r, rk, k], [s, sp, spo, spot, p, po, pot , o, ot, t] )

单个搜索索引包含所有这些不同形式的标记,并选择用于每种类型搜索的标记。

让我们尝试使用带有文字标记的正则表达式样式的单词“Missippi”:[ m, m?, m+, i, i?, i+, s, ss, s+, ss+ ... ] 等。

实际规则取决于正则表达式子集,但希望模式变得更加清晰。您可以进一步扩展以匹配其他正则表达式片段,然后使用短语搜索的形式来查找匹配项。

当然,索引会很大,但可能是值得的,具体取决于项目的要求。您还需要一个查询解析器和应用程序逻辑。

我意识到如果你正在寻找一个封闭式发动机,这并不能做到这一点,但从理论上讲,这就是我的处理方法(假设这确实是一个要求!)。如果有人想要的只是子字符串匹配和灵活的通配符匹配,那么索引中的标记就可以少得多。

就罐装应用程序而言,您可以查看用于源代码索引的 OpenGrok,它不是完整的正则表达式,但可以很好地理解源代码。

I don't have a specific engine to suggest.

However, if you could live with a subset of regex syntax, a search engine could store additional tokens to efficiently match rather complex expressions. Solr/Lucene allows for custom tokenization, where the same word can generate multiple tokens and with various rule sets.

I'll use my name as an example: "Mark marks the spot."

Case insensitive with stemming: (mark, mark, spot)

Case sensitive with no stemming: (Mark, marks, spot)

Case sensitive with NLP thesaurus expansion: ( [Mark, Marc], [mark, indicate, to-point], [spot, position, location, beacon, coordinate] )

And now evolving towards your question, case insensitive, stemming, dedupe, autocomplete prefix matching: ( [m, ma, mar, mark], [s, sp, spo, spot] )

And if you wanted "substring" style matching it would be: ( [m, ma, mar, mark, a, ar, ark, r, rk, k], [s, sp, spo, spot, p, po, pot, o, ot, t] )

A single search Index contain all of these different forms of tokens, and choose which ones to use for each type of search.

Let's try the word "Missippi" with a regex style with literal tokens: [ m, m?, m+, i, i?, i+, s, ss, s+, ss+ ... ] etc.

The actual rules would depend on the regex subset, but hopefully the pattern is becoming clearer. You would extend even further to match other regex fragments, and then use a form of phrase searching to locate matches.

Of course the index would be quite large, BUT it might be worth it, depending on the project's requirements. And you'd also need a query parser and application logic.

I realize if you're looking for a canned engine this doesn't do it, but in terms of theory this is how I'd approach it (assuming it's really a requirement!). If all somebody wanted was substring matching and flexible wildcard matching, you could get away with far fewer tokens in the index.

In terms of canned apps, you might check out OpenGrok, used for source code indexing, which is not full regex, but understands source code pretty well.

冰雪之触 2024-10-17 02:37:31

如果正则表达式占用太多资源,为什么不按 cputime 对其使用进行收费,而不是使其完全不可用?我确信有些人会付费并使用它(当然会提供收费解释,从碳足迹和 CPU 资源方面进行解释)。 Google 在其搜索中确实支持扩展 * *gogo* 或 intitle:"*go" 这里是:http://www.hackcollege.com/blog/ 2011/11/23/infographic-get-more-out-of-google.html

If regex takes up too many resources, why not charge for its use by cputime instead of making it completely unavailable? I'm sure some people would pay and get use of it (and of course offer an explanation for the charge, explain in terms of carbon footprint and cpu resources). Google does support expansive * in its searches *go or go* or intitle:"*go" here it is: http://www.hackcollege.com/blog/2011/11/23/infographic-get-more-out-of-google.html

何以心动 2024-10-17 02:37:31

Russ Cox 撰写的一篇关于三元组索引正则表达式搜索的非常好的文章

http://swtch.com /~rsc/regexp/regexp4.html

A very good article on regex search on a trigram index for by Russ Cox

http://swtch.com/~rsc/regexp/regexp4.html

一城柳絮吹成雪 2024-10-17 02:37:31

http://www.google.com/codesearch 已关闭...

正则表达式搜索需要花费大量时间资源,因此流行的搜索引擎无法承受。

http://www.google.com/codesearch has been shut down...

Regular expression search takes much resources and thus is not affordale by popular search engines.

病女 2024-10-17 02:37:31

Globalogiq 有一个 HTML 源代码搜索,您可以在其中使用正则表达式进行搜索。但它不是免费的。

Globalogiq has an HTML Source Code Search where you can search with regular expressions. It's not free though.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文