测试字符串是否包含数千个子字符串之一
我将运行实时 Twitter 数据并尝试提取提及电影标题等内容的推文。假设我有一个大约 7000 个硬编码电影标题的列表,我想查看,那么选择相关推文的最佳方法是什么?这个项目还处于起步阶段,所以我愿意接受任何解决方案的研究(即与语言无关)。任何帮助将不胜感激。
更新:我很好奇是否有人对 Yahoo! 有何见解? Placemaker API,解决了这个问题。它可以接受文本字符串并返回其中提到的所有位置的地理编码 JSON 结果。
I'm going to be running through live twitter data and attempting to pull out tweets that mention, for example, movie titles. Assuming I have a list of ~7000 hard-coded movie titles I'd like to look against, what's the best way to select the relevant tweets? This project is in it's infancy so I'm open to any looking into any solution (i.e. language agnostic.) Any help would be greatly appreciated.
Update: I'd be curious if anyone had any insight to how the Yahoo! Placemaker API, solves this problem. It can take a text string and return a geocoded JSON result of all the locations mentioned in it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您可以尝试 Wu 和 Manber 的多模式搜索的快速算法。
多模式匹配问题是病毒扫描的核心,因此您可能会从扫描器实现中寻求灵感。例如,ClamAV 是开源的,并且已经发表了一些描述其的论文算法:
Lin、Lin 和 Lai:混合算法用于病毒扫描的反向散列和自动跟踪算法(Wu-Manber 的变体;该论文位于 IEEE 付费墙后面)。
Cha、Moraru 等人:SplitScreen:实现高效的分布式恶意软件检测
You could try Wu and Manber's A Fast Algorithm For Multi-Pattern Searching.
The multi-pattern matching problem lies at the heart of virus scanning, so you might look to scanner implementations for inspiration. ClamAV, for example, is open source and some papers have been published describing its algorithms:
Lin, Lin and Lai: A Hybrid Algorithm of Backward Hashing and Automaton Tracking for Virus Scanning (a variant of Wu-Manber; the paper is behind the IEEE paywall).
Cha, Moraru, et al: SplitScreen: Enabling Efficient, Distributed Malware Detection
如果您使用编译正则表达式,它应该相当快。也许尤其是当您在一个表达式中放置很多标题时。
If you use compiled regular expressions, it should be pretty fast. Maybe especially if you put lots of titles in one expression.
在长字符序列中有效搜索多个术语需要专门的算法,以避免在每个位置测试每个术语。
但由于听起来您有具有已知模式的短字符串,因此您应该能够使用相当简单的东西。将您关心的标题集存储在哈希表或树中。使用正则表达式从每条推文中解析出“string1”和“string2”,并测试它们是否包含在集合中。
Efficiently searching for many terms in a long character sequence would require a specialized algorithm to avoid testing for every term at every position.
But since it sounds like you have short strings with a known pattern, you should be able to use something fairly simple. Store the set of titles you care about in a hash table or tree. Parse out "string1" and "string2" from each tweet using a regex, and test whether they are contained in the set.
根据埃里克森的建议,最可行的搜索是(在您的示例中为“优于”),然后检查 7,000 个术语之一。您可以通过创建 7,000 个“[电影] 优于”搜索来缩小搜索范围,然后手动过滤第二部电影,但您可能会点击 搜索速率限制 很快。
您可以使用 Solr 等专用搜索服务而不是使用文本解析来加快搜索速度。您也许可以使用一些自然语言处理服务(OpenCalais?)快速提取标题,但这将是更适合批处理。
Working off what erickson suggested, the most feasible search is for the ("is better than" in your example), then checking for one of the 7,000 terms. You could instead narrow the set by creating 7,000 searches for "[movie] is better than" and then filtering manually on the second movie, but you'll probably hit the search rate limit pretty quickly.
You could speed up the searching by using a dedicated search service like Solr instead of using text parsing. You might be able to pull out titles quickly using some natural language processing service (OpenCalais?), but that would be better suited to batch processing.
为了同时搜索大量可能的目标,Rabin-Karp 算法 通常很有用。
For simultaneously searching for a large number of possible targets, the Rabin-Karp algorithm can often be useful.