public boolean incrementToken() throws IOException { clearAttributes(); int position = 0; Term term = null; String name = null; int length = 0; do { term = this.ta.next(); if (term == null) { break; } length = term.getName().length(); if (term.getTermNatures().termNatures[0] == TermNature.EN) { name = this.stemmer.stem(term.getName()); term.setName(name); } position++; } while ((this.filter != null) && (term != null) && (this.filter.contains(term.getName())));
发布评论
评论(4)
其实哪个filter是停用词词典.....估计我lucene插件写的有问题..等下修补修补
1、编译代码需要依赖solr4.0.0,maven下
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-core</artifactId>
<version>4.0.0</version>
</dependency>
2、在solr的schema.xml 中添加
<fieldType name="text_cn" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="org.ansj.solr.ANSJTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="org.ansj.solr.ANSJTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
StopFilterFactory中的stopwords.txt可使用ansj-seg下的library/stopwords.txt
<field name="your own field" type="text_cn" indexed="true" stored="false" required="false"/>
这是一个网友发给我的..
我们团队准备在项目中引入。为了与solr集成,我们在同类型分词系统mmseg4j提供的TokenizerFactory之上针对ansj-seg进行的了改造。由于看到github有人提出是否能针对solr进行支持,所以我们将我们的solr实现发给你,仅作参考。solr版本:solr-4.0.0。该版本在实际solr-cloud中进行过测试,性能优越!再次感谢!
我使用的是你在github上的AnsjTokenizer:
package org.ansj.lucene.util;
import java.io.IOException;
import java.io.Reader;
import java.util.Set;
import org.ansj.domain.Term;
import org.ansj.domain.TermNature;
import org.ansj.splitWord.analysis.ToAnalysis;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
public class AnsjTokenizer extends Tokenizer {
private CharTermAttribute termAtt;
private OffsetAttribute offsetAtt;
private PositionIncrementAttribute positionAttr;
private ToAnalysis ta = null;
private Set<String> filter;
private final PorterStemmer stemmer = new PorterStemmer();
public AnsjTokenizer(Reader input, Set<String> filter) {
super(input);
this.ta = new ToAnalysis(input);
this.termAtt = ((CharTermAttribute) addAttribute(CharTermAttribute.class));
this.offsetAtt = ((OffsetAttribute) addAttribute(OffsetAttribute.class));
this.positionAttr = ((PositionIncrementAttribute) addAttribute(PositionIncrementAttribute.class));
this.filter = filter;
}
public boolean incrementToken() throws IOException {
clearAttributes();
int position = 0;
Term term = null;
String name = null;
int length = 0;
do {
term = this.ta.next();
if (term == null) {
break;
}
length = term.getName().length();
if (term.getTermNatures().termNatures[0] == TermNature.EN) {
name = this.stemmer.stem(term.getName());
term.setName(name);
}
position++;
} while ((this.filter != null) && (term != null) && (this.filter.contains(term.getName())));
if (term != null) {
this.positionAttr.setPositionIncrement(position);
this.termAtt.copyBuffer(term.getName().toCharArray(), 0, term.getName().length());
this.offsetAtt.setOffset(term.getOffe(), term.getOffe() + length);
return true;
}
end();
return false;
}
}
没有发现AnsjTokenizerFactory这个类,所以我自己写了一个:
package org.ansj.lucene.util;
import java.io.Reader;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;
public class AnsjTokenizerFactory extends BaseTokenizerFactory{
@Override
public Tokenizer create(Reader paramReader) {
// TODO Auto-generated method stub
return new AnsjTokenizer(paramReader, null);
}
}
因 为不知道你的AnsjTokenizer中的filter是什么,所以我传了个null过去。在schema.xml中配置好后然后进行分词查找发现第一 次查找是正确的,但是第二次查找就获取不到Query对象了,跟踪了很多代码,发现第一次查找是会到AnsjTokenizer的构造方法里的,但是第二 次就不会了,以至于AnsjTokenizer的ta属性为空。所以就获取不到Query对象了。
你提供的解决方案我试一下。谢谢了。
1、编译代码需要依赖solr4.0.0,maven下
<dependency>
<groupId>org.apache.solr</groupId>
<artifactId>solr-core</artifactId>
<version>4.0.0</version>
</dependency>
2、在solr的schema.xml 中添加
<fieldType name="text_cn" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="org.ansj.solr.ANSJTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="org.ansj.solr.ANSJTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
StopFilterFactory中的stopwords.txt可使用ansj-seg下的library/stopwords.txt
<field name="your own field" type="text_cn" indexed="true" stored="false" required="false"/>
这是一个网友发给我的..
我们团队准备在项目中引入。为了与solr集成,我们在同类型分词系统mmseg4j提供的TokenizerFactory之上针对ansj-seg进行的了改造。由于看到github有人提出是否能针对solr进行支持,所以我们将我们的solr实现发给你,仅作参考。solr版本:solr-4.0.0。该版本在实际solr-cloud中进行过测试,性能优越!再次感谢!