使用 Python 在许多文档中搜索许多表达式
我经常需要在许多文档(百万+)中搜索许多单词(1000+)。我需要匹配单词的位置(如果匹配)。
那么慢的伪代码版本是
for text in documents:
for word in words:
position = search(word, text)
if position:
print word, position
有没有快速的Python模块可以做到这一点?或者我应该自己实施一些东西?
I often have to search many words (1000+) in many documents (million+). I need position of matched word (if matched).
So slow pseudo version of code is
for text in documents:
for word in words:
position = search(word, text)
if position:
print word, position
Is there any fast Python module for doing this? Or should I implement something myself?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
要快速精确文本、多关键字搜索,请尝试 acora - http://pypi.python.org /pypi/acora/1.4
如果你想要一些额外的东西 - 结果相关性、近似匹配、词根等,Whoosh 可能会更好 - http://pypi.python.org/pypi/Whoosh/1.4.1
我不知道扩展到数百万个文档的效果如何,但它很快就会发现!
For fast exact-text, multi-keyword search, try acora - http://pypi.python.org/pypi/acora/1.4
If you want a few extras - result relevancy, near-matches, word-rooting etc, Whoosh might be better - http://pypi.python.org/pypi/Whoosh/1.4.1
I don't know how well either scales to millions of docs, but it wouldn't take long to find out!
grep 有什么问题吗?
那么你必须使用Python吗?怎么样:
这太疯狂了。但是嘿!您正在使用 python ;-)
What's wrong with grep?
So you have to use python? How about:
which is insane. But hey! You are using python ;-)
假设
documents
是一个字符串列表,您可以使用text.index(word)
查找第一个匹配项,并使用text.count(word)
找出出现的总数。您的伪代码似乎假设单词只会出现一次,因此text.count(word)
可能是不必要的。Assuming
documents
is a list of strings, you can usetext.index(word)
to find the first occurrence andtext.count(word)
to find the total number of occurrences. Your pseudocode seems to assume words will only occur once, sotext.count(word)
may be unnecessary.