一般来说,什么更快,是通过文件进行 grep 还是通过 blob 运行 SQL LIKE %x% 查询?
假设我正在设计一个工具,可以将代码片段保存在 PostgreSQL/MySQL 数据库或文件系统中。我想搜索这些片段。使用像Sphinx这样的搜索引擎似乎不太实用,因为我们在搜索代码时需要代码的精确文本匹配。
grep 和 ack 一直工作得很好,但是将内容存储在数据库中可以使大量内容在某些方面更易于管理。我想知道在目录树上递归运行 grep 与在具有文本 blob 的同等数量的记录上运行 SQL 的 LIKE 或 MySQL 的 REGEXP 函数之类的查询相比,相对性能如何。
Say I'm designing a tool that would save code snippets either in a PostgreSQL/MySQL database or on the file system. I want to search through these snippets. Using a search engine like Sphinx doesn't seem practical because we need exact text matches of code when searching code.
grep
and ack
and has always worked great, but storing stuff in a database makes a large collection of stuff more manageable in certain ways. I'm wonder what the relative performance of running grep
recursively over a tree of directories is compared to running a query like SQL's LIKE or MySQL's REGEXP function over an equivalent number of records with TEXT blobs is.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您有 100 万个文件需要 grep 遍历,您将(据我所知)使用正则表达式遍历每个文件。
出于所有意图和目的,如果您使用 LIKE 运算符或正则表达式对表行进行批量查询,您最终将会对表行执行相同的操作。
然而,我自己使用 grep 的经验是,我很少查找不包含至少一个完整单词的内容,因此您可以利用数据库来减少您正在搜索的集合。
MySQL 具有本机全文搜索功能,但我建议不要使用,因为这意味着您没有使用 InnoDB。
您可以在此处阅读 Postgres 的相关内容:
http://www.postgresql.org/ docs/current/static/textsearch.html
在 tsvector 列上创建索引后,您可以分两步执行“grep”,第一步是立即查找可能模糊地符合条件的行,然后是您的真实行标准:
这将比 grep 所做的任何事情都要快得多。
If you've 1M files to grep through, you will (best I'm aware) go through each one with a regular expression.
For all intents and purposes, you're going to end up doing the same thing over table rows if you mass-query them using a LIKE operator or a regular expression.
My own experience with grep is that I seldom look for something that doesn't contain at least one full word, however, so you might be able to take advantage of a database to reduce the set in which you're searching.
MySQL has native full text search features, but I'd recommend against because they mean you're not using InnoDB.
You can read about those from Postgres here:
http://www.postgresql.org/docs/current/static/textsearch.html
After creating an index on a tsvector column, you can then do your "grep" in two steps, one to immediately find rows that might vaguely qualify, followed by another on your true criteria:
That will be significantly faster than anything grep can do.
我无法比较它们,但两者都需要很长时间。我的猜测是 grep 会更快。
但是MySQL支持全文索引和搜索,这将比 grep 更快——我再次猜测。
另外,我不明白,Sphinx或Lucene有什么问题。无论如何,这里有一个 MySQL、Sphinx 和 Lucene 的基准
I can't compare them but both will take long. My guess is grep will be faster.
But MySQL support full text indexing and searching, which will be faster then grep -- i guess again.
Also, I did not understand, what is the problem with Sphinx or Lucene. Anyway, here's a benchmark for MySQL, Sphinx and Lucene
互联网似乎猜测 grep 使用 Boyer-Moore,这将使查询时间加法(而不是乘法)依赖于查询大小。但这并不那么相关。
我认为这对于一次性搜索来说几乎是最佳的。但在您的情况下,您可以做得更好,因为您进行了重复搜索,您可以利用其结构(例如,通过在查询中索引某些常见子字符串),正如 bpgergo 所暗示的那样。
另外,我不确定您正在考虑使用的正则表达式引擎是否针对非特殊查询进行了优化,您可以尝试一下并查看。
您可能希望将正在搜索的所有文件保留在内存中,以避免基于硬盘的速度减慢。除非您正在搜索大量文本,否则这应该有效。
The internet seems guess that
grep
uses Boyer-Moore, which will make the query time depend additively (not multiplicatively) on the query size. This isn't that relevant though.I think it's near-optimal for a one-time search. But in your case you can do better since you have repeated searches, which you can exploit the structure of (e.g. by indexing certain common substrings in your query), as bpgergo hints at.
Also I'm not sure the regular expression engine you are thinking of using is optimized for a non-special query, you could try it and see.
You may wish to keep all the files you're searching through in memory to avoid harddisk-based slowdown. This should work unless you are searching a staggering amount of text.
如果您想要代码的全文索引,我会推荐 Russ Cox 的代码搜索工具
https://code.google.com/p/codesearch/
这就是 Google 代码搜索的方式工作过
http://swtch.com/~rsc/regexp/regexp4.html
If you want a full-text index on code I would recommend Russ Cox’ codesearch tools
https://code.google.com/p/codesearch/
This is How Google Code Search Worked
http://swtch.com/~rsc/regexp/regexp4.html