Lucene 3.5 自定义负载

发布于 2025-01-02 09:07:15 字数 1964 浏览 1 评论 0 原文

使用 Lucene 索引，我有一个标准文档格式，如下所示：

姓名：John Doe 
职位：水管工 
爱好：钓鱼

我的目标是将一个有效负载附加到作业字段，该字段将包含有关管道的其他信息，例如指向管道文章的维基百科链接。我不想将有效负载放在其他地方。最初，我找到了一个示例来涵盖我想做的事情，但它使用 Lucene 2.2，并且没有更新来反映令牌流 api 中的更改。经过更多研究后，我想出了这个小怪物来为该领域构建自定义令牌流。

public static TokenStream tokenStream(final String fieldName, Reader reader, Analyzer analyzer, final String item) {
        final TokenStream ts = analyzer.tokenStream(fieldName, reader) ;
        TokenStream res = new TokenStream() {
            CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
            PayloadAttribute payAtt = addAttribute(PayloadAttribute.class);

            public boolean incrementToken() throws IOException {
                while(true) {
                    boolean hasNext = ts.incrementToken();
                    if(hasNext) {
                        termAtt.append("test");
                        payAtt.setPayload(new Payload(item.getBytes()));
                    }
                    return hasNext;
                }
            }

        };
        return res;
    }

当我获取令牌流并迭代所有结果时，在将其添加到字段之前，我看到它成功地将术语和有效负载配对。在流上调用 reset() 后，我将其添加到文档字段并索引文档。然而，当我打印出文档并与 Luke 一起查看索引时，我的自定义令牌流没有成功。字段名称显示正确，但令牌流中的术语值没有出现，也没有表明有效负载已成功附加。

这让我想到两个问题。首先，我是否正确使用了令牌流，如果是，为什么当我将其添加到字段时它没有令牌化？其次，如果我没有正确使用流，我是否需要编写自己的分析器。该示例是使用 Lucene 标准分析器拼凑而成的，用于生成令牌流并编写文档。如果可能的话，我想避免编写自己的分析器，因为我只想将有效负载附加到一个字段！

编辑：

调用代码

TokenStream ts = tokenStream("field", new StringReader("value"), a, docValue);
        CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
        PayloadAttribute payload = ts.getAttribute(PayloadAttribute.class);
        while(ts.incrementToken()) {
            System.out.println("Term = " + cta.toString());
            System.out.println("Payload = " + new String(payload.getPayload().getData()));

        }
        ts.reset();

原文

Working with a Lucene index, I have a standard document format that looks something like this:

Name: John Doe 
Job: Plumber 
Hobby: Fishing

My goal is to append a payload to the job field that would hold additional information about Plumbing, for instance, a wikipedia link to the plumbing article. I do not want to put payloads anywhere else. Initially, I found an example that covered what I'd like to do, but it used Lucene 2.2, and has no updates to reflect the changes in the token stream api.
After some more research, I came up with this little monstrosity to build a custom token stream for that field.

public static TokenStream tokenStream(final String fieldName, Reader reader, Analyzer analyzer, final String item) {
        final TokenStream ts = analyzer.tokenStream(fieldName, reader) ;
        TokenStream res = new TokenStream() {
            CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
            PayloadAttribute payAtt = addAttribute(PayloadAttribute.class);

            public boolean incrementToken() throws IOException {
                while(true) {
                    boolean hasNext = ts.incrementToken();
                    if(hasNext) {
                        termAtt.append("test");
                        payAtt.setPayload(new Payload(item.getBytes()));
                    }
                    return hasNext;
                }
            }

        };
        return res;
    }

When I take the token stream and iterate over all the results, prior to adding it to a field, I see it successfully paired the term and the payload. After calling reset() on the stream, I add it to a document field and index the document. However, when I print out the document and look at the index with Luke, my custom token stream didn't make the cut. The field name appears correctly, but the term value from the token stream does not appear, nor does either indicate the successful attachment of a payload.

This leads me to 2 questions. First, did I use the token stream correctly and if so, why doesn't it tokenize when I add it to the field? Secondly, if I didn't use the stream correctly, do I need to write my own analyzer. This example was cobbled together using the Lucene standard analyzer to generate the token stream and write the document. I'd like to avoid writing my own analyzer if possible because I only wish to append the payload to one field!

Edit:

Calling code

TokenStream ts = tokenStream("field", new StringReader("value"), a, docValue);
        CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
        PayloadAttribute payload = ts.getAttribute(PayloadAttribute.class);
        while(ts.incrementToken()) {
            System.out.println("Term = " + cta.toString());
            System.out.println("Payload = " + new String(payload.getPayload().getData()));

        }
        ts.reset();

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无法言说的痛 2025-01-09 09:07:15

很难说出为什么没有保存有效负载，原因可能在于使用您提供的方法的代码中。

设置有效负载的最方便的方法是在 TokenFilter 中——我认为采用这种方法将为您提供更清晰的代码，从而使您的场景正常工作。我认为在 Lucene 源代码中查看此类过滤器最能说明问题，例如 <代码>TokenOffsetPayloadTokenFilter。您可以在测试此类。

另请考虑是否没有比有效负载更好的位置来存储这些超链接。有效负载具有非常特殊的应用，例如根据某些术语在原始文档中的位置或格式、词性来增强某些术语...它们的主要目的是影响搜索的执行方式，因此它们通常是数值，经过有效打包以减少减小索引大小。

回复收藏 0 原文

一百个冬季 2025-01-09 09:07:15

我可能会错过一些东西，但是...
您不需要自定义标记器来将附加信息关联到 Lucene 文档。只是存储作为一个未分析的字段。

doc.Add(new Field("fname", "Joe", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("job", "Plumber", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("link","http://www.example.com", Field.Store.YES, Field.Index.NO));

然后，您可以像任何其他字段一样获取“链接”字段。

另外，如果您确实需要自定义分词器，那么您肯定需要一个自定义分析器来实现它，用于索引构建和搜索。

I might be missing something, but...
You don't need a custom tokenizer to associate additional information to a Lucene document. Just store is as an unanalyzed field.

doc.Add(new Field("fname", "Joe", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("job", "Plumber", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("link","http://www.example.com", Field.Store.YES, Field.Index.NO));

You can then get the "link" field just like any other field.

Also, if you did need a custom tokenizer, then you would definitely need a custom analyzer to implement it, for both the index building and searching.

回复收藏 0 原文

~没有更多了~