在 Lucene 中编写 Tokenizer 的正确方法
我正在尝试分析 Drupal 数据库的内容以达到集体智慧的目的。
到目前为止,我已经能够制定一个简单的示例,该示例可以对各种内容(主要是论坛帖子)进行标记,并在删除停用词后对标记进行计数。
Lucene 提供的 StandardTokenizer
应该能够对主机名和电子邮件进行标记,但内容也可以嵌入 html,例如:
Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi
Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete
scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.
这以这种方式标记化得很糟糕:
pubblichiamo -> 1
presentazione -> 1
ibm -> 1
riguardante -> 1
db2 -> 1
vari -> 1
sistemi -> 1
operativi -> 1
linux -> 1
unix -> 1
windows -> 1
documento -> 1
piattaforma -> 1
km -> 1
potete -> 1
scaricare -> 1
href -> 1
https -> 1
sfkm.griffon.local -> 1
sites -> 1
bsf -> 1
20km/bsf -> 1
cc -> 1
20t/specifiche/eventi2008/ibm -> 1
20db2 -> 1
20for -> 1
20linux -> 1
20unix -> 1
20e -> 1
20windows.pdf -> 1
target -> 1
blank -> 1
link -> 1
我想要的是将链接保持在一起并且去除无用的 html 标签(如
或
)。
我应该编写一个过滤器还是不同的分词器? Tokenizer 应该取代标准的,或者我可以将它们混合在一起吗?最难的方法是采用 StandardTokenizerImpl
并将其复制到一个新文件中,然后添加自定义行为,但我现在不想太深入 Lucene 实现(逐步学习)。
也许已经有类似的实现,但我一直找不到它。
编辑: 看着 StandardTokenizerImpl
让我觉得如果我必须通过修改实际实现来扩展它,那么与使用 lex 或 flex 相比并不那么方便,并且自己做..
I'm trying to analyze content of a Drupal database for collective intelligence purposes.
So far I've been able to work out a simple example that tokenizes the various contents (mainly forum posts) and count tokens after removing stop words.
The StandardTokenizer
supplied with Lucene should be able to tokenize hostnames and emails but content can have also embedded html, e.g:
Pubblichiamo la presentazione di IBM riguardante DB2 per i vari sistemi operativi
Linux, UNIX e Windows.\r\n\r\nQuesto documento sta sulla piattaforma KM e lo potete
scaricare a questo <a href=\'https://sfkm.griffon.local/sites/BSF%20KM/BSF/CC%20T/Specifiche/Eventi2008/IBM%20DB2%20for%20Linux,%20UNIX%20e%20Windows.pdf\' target=blank>link</a>.
This is tokenized badly in this way:
pubblichiamo -> 1
presentazione -> 1
ibm -> 1
riguardante -> 1
db2 -> 1
vari -> 1
sistemi -> 1
operativi -> 1
linux -> 1
unix -> 1
windows -> 1
documento -> 1
piattaforma -> 1
km -> 1
potete -> 1
scaricare -> 1
href -> 1
https -> 1
sfkm.griffon.local -> 1
sites -> 1
bsf -> 1
20km/bsf -> 1
cc -> 1
20t/specifiche/eventi2008/ibm -> 1
20db2 -> 1
20for -> 1
20linux -> 1
20unix -> 1
20e -> 1
20windows.pdf -> 1
target -> 1
blank -> 1
link -> 1
What I would like to have is to keep links together and strip html tags (like <pre>
or <strong>
) that are useless.
Should I write a Filter or a different Tokenizer? The Tokenizer should replace the standard one or can I mix them together? The hardest way would be to take StandardTokenizerImpl
and copy it in a new file, then add custom behaviour, but I wouldn't like to go too deep in Lucene implementation for now (learning gradually).
Maybe there is already something similar implemented but I've been unable to find it.
EDIT:
Looking at StandardTokenizerImpl
makes me think that if I have to extend it by modifying the actual implementation it's not so convenient compared to using lex or flex and doing it by myself..
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这最容易通过在将文本交给 lucene 进行标记化之前对其进行预处理来实现。使用 html 解析器,例如 Jericho 通过剥离将您的内容转换为没有 html 的文本取出您不关心的标签,并从您关心的标签中提取文本。 Jericho 的 TextExtractor 非常适合此目的,并且易于使用使用。
这输出:
您可以使用带有 html 过滤器的自定义 Lucene Tokenizer,但这不是最简单的解决方案 - 使用 Jericho 肯定会节省您执行此任务的开发时间。 lucene 的现有 html 分析器可能不想完全按照您的意愿行事,因为它们将保留页面上的所有文本。唯一需要注意的是,您最终将处理文本两次,而不是全部作为一个流处理,但除非您正在处理 TB 级的数据,否则您不会关心这种性能考虑,而处理性能是最好留给您的事情让你的应用程序充实起来,并把它确定为一个问题。
This is most easily achieved by pre processing the text before giving it to lucene to tokenize. Use an html parser, like Jericho to convert your content into text with no html by stripping out tags you dont care about, and extracting the text from those that you do. Jericho's TextExtractor is perfect for this, and easy to use.
This outputs:
You could use a custom Lucene Tokenizer with an html Filter, but it's not the easiest solution - using Jericho will defn save you development time for this task. The existing html analysers for lucene probably don't want to do exactly what you want, as they will keep all text on the page. The only caveat to this is that you will end up processing the text twice, rather than all as one stream, but unless you are handling Terabytes of data you aint gonna care about this performance consideration, and dealing with performance is something best left untill you have your app fleshed out and have identified it as an issue anyway.
一般来说,当使用 Lucene 索引包含 HTML 标记的文档时,您应该首先将 HTML 解析为包含您想要保留的部分的文本表示,然后才将其提供给 Tokenizer 进行索引。
请参阅jGuru:如何索引 HTML 文档? 了解常见问题解答的解释更多如何做到这一点。
Generally, when indexing documents that contain HTML markup with Lucene, you should first parse out the HTML into a textual representation with the parts you want to leave, and only then feed it to the Tokenizer to be indexed.
See jGuru: How can I index HTML documents? for an FAQ explaining more of how to do this.