在 Lucene 中使用 WikipediaTokenizer 的示例
我想在 lucene 项目中使用 WikipediaTokenizer - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html但我从来没有使用过lucene。我只想将维基百科字符串转换为标记列表。但是,我看到这个类中只有四个方法可用,end、incrementToken、reset、reset(reader)。有人可以给我举一个使用它的例子吗?
谢谢。
I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html But I never used lucene. I just want to convert a wikipedia string into a list of tokens. But, I see that there are only four methods available in this class, end, incrementToken, reset, reset(reader). Can someone point me to an example to use it.
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 Lucene 3.0 中,next() 方法被删除。现在您应该使用incrementToken 来迭代标记,当您到达输入流末尾时它会返回false。要获取每个令牌,您应该使用 AttributeSource 类。根据您想要获取的属性(术语、类型、有效负载等),您需要使用 addAttribute 方法将相应属性的类类型添加到标记生成器中。
以下部分代码示例来自 WikipediaTokenizer 的测试类,如果您下载 Lucene 的源代码,您可以找到它。
In Lucene 3.0, next() method is removed. Now you should use incrementToken to iterate through the tokens and it returns false when you reach the end of the input stream. To obtain the each token, you should use the methods of the AttributeSource class. Depending on the attributes that you want to obtain (term, type, payload etc), you need to add the class type of the corresponding attribute to your tokenizer using addAttribute method.
Following partial code sample is from the test class of the WikipediaTokenizer which you can find if you download the source code of the Lucene.
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
Token token = new Token();
令牌 = tf.next(令牌);
http://www.javadocexamples.com/java_source /org/apache/lucene/wikipedia/analysis/WikipediaTokenizerTest.java.html
问候
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
Token token = new Token();
token = tf.next(token);
http://www.javadocexamples.com/java_source/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerTest.java.html
Regards
公共类维基百科TokenizerTest {
静态记录器记录器 = Logger.getLogger(WikipediaTokenizerTest.class);
protected static Final String LINK_PHRASES = "单击 [[再次链接]] 单击 [http://lucene.apache.org 又在这里] [[类别:abcd]]";
public class WikipediaTokenizerTest {
static Logger logger = Logger.getLogger(WikipediaTokenizerTest.class);
protected static final String LINK_PHRASES = "click [[link here again]] click [http://lucene.apache.org here again] [[Category:a b c d]]";