使用 lucene/java 标记名称
我有我公司所有员工的姓名(超过 5000 名)。我想编写一个引擎,它可以在在线文章(博客/维基/帮助文档)中即时查找名称,并用用户电子邮件的“mailto”标签标记它们。
截至目前,我计划从文章中删除所有停用词,然后在 lucene 索引中搜索每个单词。但即使在这种情况下,我也会看到很多查询命中索引,例如,如果一篇文章有 2000 个单词,并且只有两次引用人名,那么很可能会有 1000 个 lucene 查询。
有没有办法减少这些查询?或者完全不同的方式来实现同样的目的? 提前致谢
I have names of all the employees of my company (5000+). I want to write an engine which can on the fly find names in online articles(blogs/wikis/help documents) and tag them with "mailto" tag with the users email.
As of now I am planning to remove all the stop words from the article and then search for each word in a lucene index. But even in that case I see a lot of queries hitting the indexes, for example if there is an article with 2000 words and only two references to people names then most probably there will be 1000 lucene queries.
Is there a way to reduce these queries? Or a completely other way of achieving the same?
Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果你只有 5000 个名称,我会将它们放入内存中的哈希表中,而不用费心使用 Lucene。您可以通过多种方式对它们进行散列(例如,昵称、先到后或后到先等),并且仍然具有相对较小的内存占用量和真正高效的性能。
If you have only 5000 names, I would just stick them into a hash table in memory instead of bothering with Lucene. You can hash them several ways (e.g., nicknames, first-last or last-first, etc.) and still have a relatively small memory footprint and really efficient performance.
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
这个算法可能对你有用。其工作方式是,您首先将整个名称列表编译成一个巨大的有限状态机(这可能需要一段时间),但是一旦构建了该状态机,您就可以根据需要运行它并通过任意数量的文档非常有效地检测名称。
我认为它只会查看每个文档中的每个字符一次,因此它应该比标记文档并将每个单词与已知名称列表进行比较要高效得多。
网络上有许多适用于不同语言的实现。一探究竟。
http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
This algorithm might be of use to you. The way this would work is you first compile the entire list of names into a giant finite state machine (which would probably take a while), but then once this state machine is built, you can run it through as many documents as you want and detect names pretty efficiently.
I think it would look at every character in each document only once, so it should be much more efficient than tokenizing the document and comparing each word to a list of known names.
There are a bunch of implementations available for different languages on the web. Check it out.