在 Lucene 中获取词干
在 Lucene 中,我使用 SnowballAnalyzer 进行索引和搜索。
当我建立索引后,我会对索引进行查询。例如,我对“body”字段进行“specialized”查询。 IndexSearcher 返回包含“专门化、专门化等”的文档因为 SnowballAnalyzer 进行了词干提取。
现在 - 拥有顶级文档 - 我想从正文字段获取文本片段。此片段应包含查询词的词干版本。
例如,返回的文件之一具有正文字段:“不幸的是,在某些州,盲人只能使用为各种残疾人提供服务的一般康复机构。在这些情况下,无法为视障人士提供专门服务。随时可用。” 然后我希望将“在这些情况下,视觉专业服务”部分作为片段。 另外我想从这个片段中获取术语。代码可以做到这一点,但有一个标记为“?”性格,我有一个问题是:
我想怎么做 IndexReader ir = IndexReader.open(fsDir);
TermPositionVector tv = (TermPositionVector)ir.getTermFreqVector(hits.scoreDocs[i].doc, "body");
? - 这里:查询 - 查询必须是术语。所以如果真正的查询 是“专门化”,那么查询应该是专门化的,这就是雪球分析器通常所做的。我如何获得分析器分析的单个单词或短语的术语,因为查询可以包含短语:“专用机器”。
int idx = tv.indexOf(query);
int [] idxs = tv.getTermPositions(idx);
for(字符串 t : tv.getTerms()){
int iidx = tv.indexOf(t);
int [] iidxs = tv.getTermPositions(iidx);
for(int ni : idxs){
tmp值 = 0.0f;
for(int nni : iidxs){
if(Math.abs(nni-ni)<= Settings.termWindowSize){
编辑
我找到了获取词干术语的方法:
<代码> Query q = queryParser.parse("要解析的一些文本"); String parsedQuery = q.toString();
查询对象有一个方法 toString(字符串字段名);
In Lucene I use the SnowballAnalyzer for indexing and searching.
When I have the index built I make queries on my index. For example I make a query 'specialized' for the field 'body'.
IndexSearcher returns documents containing 'specialize, specialized etc.' because of the stemming done by the SnowballAnalyzer.
Now - having top documents - I want to get a text snippet from the body field. This snipped should contain the stemmed version of the query word.
For example one of the returned documents has the body field: "Unfortunately, in some states, blind people only have access to general rehabilitation agencies, which serve people with a variety of disabilities. In these cases, specialized services for visually impaired people are not always available."
Then I wish to get the part 'In these cases, specialized services for visually' as the snippet.
Additionally I want to have terms from this snippet. Code which will do it, but with one marked '?' character, where I have a question is:
How I want to do it isIndexReader ir = IndexReader.open(fsDir);
TermPositionVector tv = (TermPositionVector)ir.getTermFreqVector(hits.scoreDocs[i].doc, "body");
? - here: query - query has to be the term. So if the real query
was 'specialized' then the query should be specialize, what normally the snowball analyzer does. How can I get the term analyzed by the analyzer for a single word or a phrase, since query can contain a phrase: "specialized machines".
int idx = tv.indexOf(query);
int [] idxs = tv.getTermPositions(idx);
for(String t : tv.getTerms()){
int iidx = tv.indexOf(t);
int [] iidxs = tv.getTermPositions(iidx);
for(int ni : idxs){
tmpValue = 0.0f;
for(int nni : iidxs){
if(Math.abs(nni-ni)<= Settings.termWindowSize){
edit
I found the way to get the stemmed term:
Query q = queryParser.parse("some text to be parsed");
String parsedQuery = q.toString();
There is a method for the Query object toString(String fieldName);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我相信您混淆了几个问题。
首先,要查看查询的词干版本以及其他有用信息,您可以使用 IndexSearcher 的explain() 方法。请参阅我对此问题的回答。
用于获取片段的 Lucene 解决方案是 荧光笔。另一个选择是 FastVectorHighlighter。我相信您可以自定义两者以获得词干术语而不是完整术语。
I believe you are mixing several questions.
First, to see the stemmed version of your query, and other useful information, you can use the IndexSearcher's explain() method. Please see my answer to this question.
The Lucene solution for getting snippets is the Highlighter. Another option is the FastVectorHighlighter. I believe you can customize both to get the stemmed term rather than the full one.