ansj和solr集成问题

发布于 2021-11-14 18:17:48 字数 160 浏览 813 评论 4

@ansj 你好,想跟你请教个问题:ansj和solr集成过程中需要配置schema.xml文件吗?如果需要改如何配置。谢谢。

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

甜柠檬 2021-11-19 02:32:23

其实哪个filter是停用词词典.....估计我lucene插件写的有问题..等下修补修补

筱武穆 2021-11-19 02:07:31
package org.ansj.solr;

import java.io.IOException;
import java.io.Reader;

import org.ansj.domain.Term;
import org.ansj.splitWord.Analysis;
import org.ansj.splitWord.analysis.ToAnalysis;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

public class ANSJTokenizer extends Tokenizer {
	Analysis udf;
	 private CharTermAttribute termAtt;
     private OffsetAttribute offsetAtt;
     private TypeAttribute typeAtt;
	protected ANSJTokenizer(Reader input) {
		super(input);		
		termAtt = (CharTermAttribute)addAttribute(CharTermAttribute.class);
		offsetAtt = (OffsetAttribute)addAttribute(OffsetAttribute.class);
		typeAtt = (TypeAttribute)addAttribute(TypeAttribute.class);
	}	
	
	@Override
	public void reset() throws IOException {
		super.reset();
		udf = new ToAnalysis(input);
	}
	@Override
	public boolean incrementToken() throws IOException {
		clearAttributes();
		Term term = udf.next();
		
        if(term != null) {
                termAtt.copyBuffer(term.getName().toCharArray(), 0, term.getName().length());
                offsetAtt.setOffset(term.getOffe(), term.getTo().getOffe());
                typeAtt.setType("word");
                return true;
        } else {
                end();
                return false;
        }
	}

}
/**
 * 
 */
package org.ansj.solr;

import java.io.IOException;
import java.io.Reader;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.util.ResourceLoader;
import org.apache.lucene.analysis.util.ResourceLoaderAware;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * @author chenlb
 * @author WEIFENG.YAO
 *
 */
public class ANSJTokenizerFactory extends TokenizerFactory implements
		ResourceLoaderAware {
    static final Logger logger = LoggerFactory.getLogger(ANSJTokenizerFactory.class);

    private ThreadLocal<ANSJTokenizer> tokenizerLocal = new ThreadLocal<ANSJTokenizer>();

	public void inform(ResourceLoader loader) throws IOException {
		
	}

	
	@Override
	public Tokenizer create(Reader input) {
		ANSJTokenizer tokenizer = tokenizerLocal.get();		
		if(tokenizer == null) {
            tokenizer = newTokenizer(input);
	    } else {
	            try {
	            		tokenizer.setReader(input);
	                    tokenizer.reset();
	            } catch (IOException e) {
	                    tokenizer = newTokenizer(input);
	            }
	    }
	
	    return tokenizer;
	}

	
	  private ANSJTokenizer newTokenizer(Reader input) {
		  ANSJTokenizer tokenizer = new ANSJTokenizer(input);
          tokenizerLocal.set(tokenizer);
          return tokenizer;
  }
}

1、编译代码需要依赖solr4.0.0,maven下  

<dependency>

   <groupId>org.apache.solr</groupId>

   <artifactId>solr-core</artifactId>

   <version>4.0.0</version>

</dependency>

2、在solr的schema.xml 中添加

   

  <fieldType name="text_cn" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

<tokenizer class="org.ansj.solr.ANSJTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

         <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

<tokenizer class="org.ansj.solr.ANSJTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

StopFilterFactory中的stopwords.txt可使用ansj-seg下的library/stopwords.txt 

 <field name="your own field" type="text_cn" indexed="true" stored="false" required="false"/>

这是一个网友发给我的..

我们团队准备在项目中引入。为了与solr集成,我们在同类型分词系统mmseg4j提供的TokenizerFactory之上针对ansj-seg进行的了改造。由于看到github有人提出是否能针对solr进行支持,所以我们将我们的solr实现发给你,仅作参考。solr版本:solr-4.0.0。该版本在实际solr-cloud中进行过测试,性能优越!再次感谢!

伴我心暖 2021-11-18 22:48:38

我使用的是你在github上的AnsjTokenizer:

package org.ansj.lucene.util;

import java.io.IOException;
import java.io.Reader;
import java.util.Set;
import org.ansj.domain.Term;
import org.ansj.domain.TermNature;
import org.ansj.splitWord.analysis.ToAnalysis;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

public class AnsjTokenizer extends Tokenizer {
    private CharTermAttribute termAtt;
    private OffsetAttribute offsetAtt;
    private PositionIncrementAttribute positionAttr;
    private ToAnalysis ta = null;
    private Set<String> filter;
    private final PorterStemmer stemmer = new PorterStemmer();

    public AnsjTokenizer(Reader input, Set<String> filter) {
        super(input);
        this.ta = new ToAnalysis(input);
        this.termAtt = ((CharTermAttribute) addAttribute(CharTermAttribute.class));
        this.offsetAtt = ((OffsetAttribute) addAttribute(OffsetAttribute.class));
        this.positionAttr = ((PositionIncrementAttribute) addAttribute(PositionIncrementAttribute.class));
        this.filter = filter;
    }

    public boolean incrementToken() throws IOException {
        clearAttributes();
        int position = 0;
        Term term = null;
        String name = null;
        int length = 0;
        do {
            term = this.ta.next();
            if (term == null) {
                break;
            }
            length = term.getName().length();
            if (term.getTermNatures().termNatures[0] == TermNature.EN) {
                name = this.stemmer.stem(term.getName());
                term.setName(name);
            }
            position++;
        } while ((this.filter != null) && (term != null) && (this.filter.contains(term.getName())));

        if (term != null) {
            this.positionAttr.setPositionIncrement(position);
            this.termAtt.copyBuffer(term.getName().toCharArray(), 0, term.getName().length());
            this.offsetAtt.setOffset(term.getOffe(), term.getOffe() + length);
            return true;
        }
        end();
        return false;
    }
}

没有发现AnsjTokenizerFactory这个类,所以我自己写了一个:

package org.ansj.lucene.util;

import java.io.Reader;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;

public class AnsjTokenizerFactory extends BaseTokenizerFactory{

    @Override
    public Tokenizer create(Reader paramReader) {
        // TODO Auto-generated method stub
        return new AnsjTokenizer(paramReader, null);
    }

}

因 为不知道你的AnsjTokenizer中的filter是什么,所以我传了个null过去。在schema.xml中配置好后然后进行分词查找发现第一 次查找是正确的,但是第二次查找就获取不到Query对象了,跟踪了很多代码,发现第一次查找是会到AnsjTokenizer的构造方法里的,但是第二 次就不会了,以至于AnsjTokenizer的ta属性为空。所以就获取不到Query对象了。

你提供的解决方案我试一下。谢谢了。

谁的新欢旧爱 2021-11-17 11:24:56
package org.ansj.solr;

import java.io.IOException;
import java.io.Reader;

import org.ansj.domain.Term;
import org.ansj.splitWord.Analysis;
import org.ansj.splitWord.analysis.ToAnalysis;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TypeAttribute;

public class ANSJTokenizer extends Tokenizer {
	Analysis udf;
	 private CharTermAttribute termAtt;
     private OffsetAttribute offsetAtt;
     private TypeAttribute typeAtt;
	protected ANSJTokenizer(Reader input) {
		super(input);		
		termAtt = (CharTermAttribute)addAttribute(CharTermAttribute.class);
		offsetAtt = (OffsetAttribute)addAttribute(OffsetAttribute.class);
		typeAtt = (TypeAttribute)addAttribute(TypeAttribute.class);
	}	
	
	@Override
	public void reset() throws IOException {
		super.reset();
		udf = new ToAnalysis(input);
	}
	@Override
	public boolean incrementToken() throws IOException {
		clearAttributes();
		Term term = udf.next();
		
        if(term != null) {
                termAtt.copyBuffer(term.getName().toCharArray(), 0, term.getName().length());
                offsetAtt.setOffset(term.getOffe(), term.getTo().getOffe());
                typeAtt.setType("word");
                return true;
        } else {
                end();
                return false;
        }
	}

}
/**
 * 
 */
package org.ansj.solr;

import java.io.IOException;
import java.io.Reader;

import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.util.ResourceLoader;
import org.apache.lucene.analysis.util.ResourceLoaderAware;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * @author chenlb
 * @author WEIFENG.YAO
 *
 */
public class ANSJTokenizerFactory extends TokenizerFactory implements
		ResourceLoaderAware {
    static final Logger logger = LoggerFactory.getLogger(ANSJTokenizerFactory.class);

    private ThreadLocal<ANSJTokenizer> tokenizerLocal = new ThreadLocal<ANSJTokenizer>();

	public void inform(ResourceLoader loader) throws IOException {
		
	}

	
	@Override
	public Tokenizer create(Reader input) {
		ANSJTokenizer tokenizer = tokenizerLocal.get();		
		if(tokenizer == null) {
            tokenizer = newTokenizer(input);
	    } else {
	            try {
	            		tokenizer.setReader(input);
	                    tokenizer.reset();
	            } catch (IOException e) {
	                    tokenizer = newTokenizer(input);
	            }
	    }
	
	    return tokenizer;
	}

	
	  private ANSJTokenizer newTokenizer(Reader input) {
		  ANSJTokenizer tokenizer = new ANSJTokenizer(input);
          tokenizerLocal.set(tokenizer);
          return tokenizer;
  }
}

1、编译代码需要依赖solr4.0.0,maven下  

<dependency>

   <groupId>org.apache.solr</groupId>

   <artifactId>solr-core</artifactId>

   <version>4.0.0</version>

</dependency>

2、在solr的schema.xml 中添加

   

  <fieldType name="text_cn" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

<tokenizer class="org.ansj.solr.ANSJTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

         <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

<tokenizer class="org.ansj.solr.ANSJTokenizerFactory"/>

<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

StopFilterFactory中的stopwords.txt可使用ansj-seg下的library/stopwords.txt 

 <field name="your own field" type="text_cn" indexed="true" stored="false" required="false"/>

这是一个网友发给我的..

我们团队准备在项目中引入。为了与solr集成,我们在同类型分词系统mmseg4j提供的TokenizerFactory之上针对ansj-seg进行的了改造。由于看到github有人提出是否能针对solr进行支持,所以我们将我们的solr实现发给你,仅作参考。solr版本:solr-4.0.0。该版本在实际solr-cloud中进行过测试,性能优越!再次感谢!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文