在大字符串文件中查找部分字符串匹配的最有效方法(python)
我下载了维基百科文章标题文件,其中包含每篇维基百科文章的名称。我需要搜索所有可能匹配的文章标题。例如,我可能有“曲棍球”一词,但我想要的曲棍球维基百科文章是“Ice_hockey”。它也应该是不区分大小写的搜索。
我正在使用Python,有没有比逐行搜索更有效的方法?理想情况下,我每分钟执行此搜索 500 或 1000 次。如果逐行是我唯一的选择,我可以在其中做一些优化吗?
我认为文件中有几百万行。
有什么想法吗?
谢谢。
I downloaded the Wikipedia article titles file which contains the name of every Wikipedia article. I need to search for all the article titles that may be a possible match. For example, I might have the word "hockey", but the Wikipedia article for hockey that I would want is "Ice_hockey". It should be a case-insensitive search too.
I'm using Python, and is there a more efficient way than to just do a line by line search? I'll be performing this search like 500 or a 1000 times per minute ideally. If line by line is my only option, are there some optimizations I can do within this?
I think there are several million lines in the file.
Any ideas?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您有固定的数据集和可变的查询,那么通常的技术是将数据集重新组织成可以更容易搜索的内容。在抽象层面上,您可以将每个文章标题分解为单独的小写单词,并将每个单词添加到 Python 字典数据结构中。然后,每当收到查询时,将查询词转换为小写并在字典中查找。如果每个词典条目值都是标题列表,那么您可以轻松找到与给定查询词匹配的所有标题。
这适用于简单的单词,但您必须考虑是否要对相似的单词进行匹配,例如当查询为“smoke”时查找“smoking”。
If you've got a fixed data set and variable queries, then the usual technique is to reorganise the data set into something that can be searched more easily. At an abstract level, you could break up each article title into individual lowercase words, and add each of them to a Python dictionary data structure. Then, whenever you get a query, convert the query word to lower case and look it up in the dictionary. If each dictionary entry value is a list of titles, then you can easily find all the titles that match a given query word.
This works for straightforward words, but you will have to consider whether you want to do matching on similar words, such as finding "smoking" when the query is "smoke".
如果您想匹配单个单词,格雷格的答案很好。如果你想匹配子字符串,你需要一些更复杂的东西,比如后缀树(http://en.wikipedia.org/wiki/Suffix_tree)。一旦构建完成,后缀树就可以有效地回答对任意子字符串的查询,因此在您的示例中,当有人搜索“hock”时,它可以匹配“Ice_Hockey”。
Greg's answer is good if you want to match on individual words. If you want to match on substrings you'll need something a bit more complicated, like a suffix tree (http://en.wikipedia.org/wiki/Suffix_tree). Once constructed, a suffix tree can efficiently answer queries for arbitrary substrings, so in your example it could match "Ice_Hockey" when someone searched for "hock".
我建议您将数据放入 sqlite 数据库中,并使用 SQL 'like' 运算符进行搜索。
I'd suggest you put your data into an sqlite database, and use the SQL 'like' operator for your searches.