如何在 Python/Django 中根据一长串单词有效过滤字符串?
Stackoverflow 通过获取当前问题的标题并从 Google 中删除 10,000 个最常见的英语单词来实现“相关问题”功能。然后将剩余的单词作为全文搜索提交以查找相关问题。
我想在我的 Django 网站中做类似的事情。在 Python 中针对一长串单词过滤字符串(本例中为问题标题)的最佳方法是什么?有什么库可以让我高效地做到这一点吗?
Stackoverflow implemented its "Related Questions" feature by taking the title of the current question being asked and removing from it the 10,000 most common English words according to Google. The remaining words are then submitted as a fulltext search to find related questions.
I want to do something similar in my Django site. What is the best way to filter a string (the question title in this case) against a long list of words in Python? Any libraries that would enable me to do that efficiently?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您可以使用 Python 中的 set 和 string 功能非常简单地完成此操作,并查看其执行情况(过早的优化是万恶之源!):
You could do this very simply using the set and string functionality in Python and see how it performs (premature optimisation being the root of all evil!):
我认为一个更简单且仍然相当快的解决方案是使用 sqlite 和正则表达式。
将一长串单词放入 SQLite 表中并构建 B 树索引。这为您提供了 log(n) 时间存在的查询。使用正则表达式拆分较小的字符串,并循环遍历每个单词,运行存在查询。
您可以先使用 nltk 的 porter 词干分析器对单词进行词干分析。
I think a much simpler solution and still reasonably fast is to use sqlite and regular expressions.
Put the long list of words in an sqlite table and build a b-tree index. This gives you log(n) time exists queries. Split the smaller string with a regular expression and loop over the words running an exists query for each of them.
You can stem the words first with the porter stemmer from nltk.
虽然我讨厌阻止使用像 trie 这样很酷的东西,但你有没有想过直接用 python 来做呢?我使用 corncob worlist 编写了一个简单的基准测试,性能还不错。
请注意,我正在测试该单词不在文件中的最坏情况。我还包括加载文件和处理文件的时间。如果你将它存储在内存中,那么它就会消失。还要注意的另一件事是我的机器基本上是垃圾。 C 速度很快,但大多数涉及搜索列表的代码都是用 C 编写的。最后,这个测试几乎针对英语中的每个单词。如果你只想要一万,我认为那是小菜一碟。
While I hate to discourage the use of something cool like a trie, have you thought about doing it in straight python? I wrote a simple benchmark using the corncob worlist and performance wasn't that bad.
note that I'm testing the worst case where the word isn't in the file. I'm also including the time to load the file and process the file. If you were to memcache it, that would disappear. One further thing to note is that my machine is basically crap. C is fast but most of the code involved in searching a list is written in C anyways. Finally, this test is for pretty much every word in the english language. If you just want 10,000, I think that's cake.
我不知道 SO 使用哪种方法,但是:
我想一种快速(而且非常简单)的方法是返回到 C,并逐一检查它们,也许使用 KMP 算法。
另一种(不是那么简单)的方法是用这 10000 个单词保留一个 trie并使用它搜索文本。这将是超级快,但相当难以实现。如果您有兴趣,我有一个 C++ 虚拟实现。
编辑
回顾一下,我发现我只使用了fstream,所以这可以很容易地针对C进行修改,所以你会能够轻松与python集成。这是源代码:
它接受如下格式的
trie.in
文本:并生成如下文本
0 w - 在列表中添加单词 w(可以多次)
1 w - 删除一条记录列表中单词 w 的数量(可以是多次)
2 w - 打印列表中有多少个 w 单词
3 w - 打印 w 与列表中任何其他单词的最长公共前缀的长度
哦,抱歉由于格式较差,这是为了训练而这样做的。
I don't know which method is used by SO, but:
I suppose a fast (and very simplistic) way of doing this is by going back to C, and checking them one by one, maybe with a KMP algorithm.
Another (not so simple) way of doing this, is by keeping a trie with those 10.000 words and searching the text using that. This would be super-fast, but fairly hard to implement. If you are interested, I have a dummy implementation in C++.
EDIT
Looking back to it, I see I've only used fstream, so this could be modified easily for C, so you'll be able to integrate with python easily . This is the source:
This takes in
trie.in
text formatted like this:And produces text like this
0 w - add the word w in the list (could be multiple times)
1 w - delete one record of the word w from the list (could be multiple times)
2 w - print how many w words are there in the list
3 w - print the length of the longest common prefix of w with any other word in the list
Oh, and sorry for the poor formatting, this was done for training.
如果一些误报/漏报没问题,请在维基百科上搜索布隆过滤器。
如果不看CDB,(在Fedora中安装tinycdb——没有python API atm)。
If some false positives/negatives are ok, search for bloom filter on wikipedia.
If not look at CDB, (yum install tinycdb, in Fedora -- no python API atm).
使用 python 非常好的
filter
方法怎么样:How about using python's very nice
filter
method: