lucene:使 StandardTokenizer 适应 Twitter 数据
我需要调整 lucene 的 StandardTokenizer 以适应有关 twitter 数据的一些特殊目的。目前,我使用 StandardTokenizer 来标记一些我想要处理的推文。它工作得很好,但现在我想扩展该行为(例如,还考虑 #hashtags 和 @somebody,考虑笑脸:),删除 url,...)。
有人可以告诉我 - 甚至给我一个方向 - 我如何才能轻松做到这一点?我知道,编写自己的 Tokenizer 将是最好的选择,但我对 lucene 很陌生,我不知道如何开始......
我希望有人可以帮助我:)
最好, 迈克尔
I need to adapt lucene's StandardTokenizer to some special purposes regarding twitter data. At the moment I use StandardTokenizer to tokenize some tweets, which I want to work at. It worked quite well, but now I want to extend the behaviour (e.g. considere also #hashtags and @somebody, consider smileys :), remove url, ...).
Can somebody tell me - or even give me a direction - how I can do this easily? I know, writing my own Tokenizer would be the best choice, but I'm quite new to lucene and I don't know how to start...
I hope somebody can help me :)
Best,
Michael
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用 lucene 的 CharFilter api(也可能是 TokenFilter,具体取决于您希望搜索如何工作)来扩展 StandardTokenizer 的标记化。
最终,如果 StandardTokenizer 与您想要的完全不同,那么它是错误的标记生成器,但如果它接近,这可能会容易得多。
CharFilter 本质上是一个 FilterReader,可让您在 Tokenizer 运行之前修改文本。它还跟踪偏移调整,以便突出显示仍然有效!
要添加 CharFilters,最简单的方法是扩展 ReusableAnalyzerBase 并重写其 initReader 方法,用所需的 CharFilters 包装传入的读取器。
您可能想从 MappingCharFilter 开始,它允许您预先定义一些映射来处理您的特殊 twitter 语法。这里有一些例子/想法:
http://markmail.org/message/abo2hysvfy2clxed
You can extend the tokenization of StandardTokenizer a great deal by using lucene's CharFilter apis (and possibly TokenFilters too, depending on how you want the search to work).
Ultimately, if StandardTokenizer is completely different than what you want, then its the wrong tokenizer, but if its close, this can be much easier.
A CharFilter is essentially a FilterReader that lets you modify the text before the Tokenizer runs. It also tracks offset adjustments so that highlighting will still work!
to add CharFilters, the easiest way is to extend ReusableAnalyzerBase and override its initReader method, wrapping the incoming reader with the CharFilters you want.
You might want to look at MappingCharFilter as a start, it lets you define some mappings up front to handle your special twitter syntax. There are some examples/ideas here:
http://markmail.org/message/abo2hysvfy2clxed