Lucene 3.5 自定义负载

发布于 2025-01-02 09:07:15 字数 1964 浏览 1 评论 0 原文

使用 Lucene 索引,我有一个标准文档格式,如下所示:

姓名:John Doe 
职位:水管工 
爱好:钓鱼

我的目标是将一个有效负载附加到作业字段,该字段将包含有关管道的其他信息,例如指向管道文章的维基百科链接。我不想将有效负载放在其他地方。最初,我找到了一个示例来涵盖我想做的事情,但它使用 Lucene 2.2,并且没有更新来反映令牌流 api 中的更改。 经过更多研究后,我想出了这个小怪物来为该领域构建自定义令牌流。

public static TokenStream tokenStream(final String fieldName, Reader reader, Analyzer analyzer, final String item) {
        final TokenStream ts = analyzer.tokenStream(fieldName, reader) ;
        TokenStream res = new TokenStream() {
            CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
            PayloadAttribute payAtt = addAttribute(PayloadAttribute.class);

            public boolean incrementToken() throws IOException {
                while(true) {
                    boolean hasNext = ts.incrementToken();
                    if(hasNext) {
                        termAtt.append("test");
                        payAtt.setPayload(new Payload(item.getBytes()));
                    }
                    return hasNext;
                }
            }

        };
        return res;
    } 

当我获取令牌流并迭代所有结果时,在将其添加到字段之前,我看到它成功地将术语和有效负载配对。在流上调用 reset() 后,我将其添加到文档字段并索引文档。然而,当我打印出文档并与 Luke 一起查看索引时,我的自定义令牌流没有成功。字段名称显示正确,但令牌流中的术语值没有出现,也没有表明有效负载已成功附加。

这让我想到两个问题。首先,我是否正确使用了令牌流,如果是,为什么当我将其添加到字段时它没有令牌化?其次,如果我没有正确使用流,我是否需要编写自己的分析器。该示例是使用 Lucene 标准分析器拼凑而成的,用于生成令牌流并编写文档。如果可能的话,我想避免编写自己的分析器,因为我只想将有效负载附加到一个字段!

编辑:

调用代码

TokenStream ts = tokenStream("field", new StringReader("value"), a, docValue);
        CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
        PayloadAttribute payload = ts.getAttribute(PayloadAttribute.class);
        while(ts.incrementToken()) {
            System.out.println("Term = " + cta.toString());
            System.out.println("Payload = " + new String(payload.getPayload().getData()));

        }
        ts.reset();

Working with a Lucene index, I have a standard document format that looks something like this:

Name: John Doe 
Job: Plumber 
Hobby: Fishing

My goal is to append a payload to the job field that would hold additional information about Plumbing, for instance, a wikipedia link to the plumbing article. I do not want to put payloads anywhere else. Initially, I found an example that covered what I'd like to do, but it used Lucene 2.2, and has no updates to reflect the changes in the token stream api.
After some more research, I came up with this little monstrosity to build a custom token stream for that field.

public static TokenStream tokenStream(final String fieldName, Reader reader, Analyzer analyzer, final String item) {
        final TokenStream ts = analyzer.tokenStream(fieldName, reader) ;
        TokenStream res = new TokenStream() {
            CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
            PayloadAttribute payAtt = addAttribute(PayloadAttribute.class);

            public boolean incrementToken() throws IOException {
                while(true) {
                    boolean hasNext = ts.incrementToken();
                    if(hasNext) {
                        termAtt.append("test");
                        payAtt.setPayload(new Payload(item.getBytes()));
                    }
                    return hasNext;
                }
            }

        };
        return res;
    } 

When I take the token stream and iterate over all the results, prior to adding it to a field, I see it successfully paired the term and the payload. After calling reset() on the stream, I add it to a document field and index the document. However, when I print out the document and look at the index with Luke, my custom token stream didn't make the cut. The field name appears correctly, but the term value from the token stream does not appear, nor does either indicate the successful attachment of a payload.

This leads me to 2 questions. First, did I use the token stream correctly and if so, why doesn't it tokenize when I add it to the field? Secondly, if I didn't use the stream correctly, do I need to write my own analyzer. This example was cobbled together using the Lucene standard analyzer to generate the token stream and write the document. I'd like to avoid writing my own analyzer if possible because I only wish to append the payload to one field!

Edit:

Calling code

TokenStream ts = tokenStream("field", new StringReader("value"), a, docValue);
        CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
        PayloadAttribute payload = ts.getAttribute(PayloadAttribute.class);
        while(ts.incrementToken()) {
            System.out.println("Term = " + cta.toString());
            System.out.println("Payload = " + new String(payload.getPayload().getData()));

        }
        ts.reset();

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

无法言说的痛 2025-01-09 09:07:15

很难说出为什么没有保存有效负载,原因可能在于使用您提供的方法的代码中。

设置有效负载的最方便的方法是在 TokenFilter 中——我认为采用这种方法将为您提供更清晰的代码,从而使您的场景正常工作。我认为在 Lucene 源代码中查看此类过滤器最能说明问题,例如 <代码>TokenOffsetPayloadTokenFilter。您可以在 测试此类

另请考虑是否没有比有效负载更好的位置来存储这些超链接。有效负载具有非常特殊的应用,例如根据某些术语在原始文档中的位置或格式、词性来增强某些术语...它们的主要目的是影响搜索的执行方式,因此它们通常是数值,经过有效打包以减少减小索引大小。

It's very hard to tell why the payloads are not saved, the reason may lay in the code that uses the method that you presented.

The most convenient way to set payloads is in a TokenFilter -- I think that taking this approach will give you much cleaner code and in turn make your scenario work correctly. I think that it's most illustrative to take a look at some filter of this type in Lucene source, e.g. TokenOffsetPayloadTokenFilter. You can find an example of how it should be used in the test for this class.

Please also consider if there is no better place to store these hyperlinks than in payloads. Payloads have very special application for e.g. boosting some terms depending on their location or formatting in the original document, part of speech... Their main purpose is to affect how the search is performed, so they are normally numeric values, efficiently packed to cut down the index size.

一百个冬季 2025-01-09 09:07:15

我可能会错过一些东西,但是...
您不需要自定义标记器来将附加信息关联到 Lucene 文档。只是存储作为一个未分析的字段。

doc.Add(new Field("fname", "Joe", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("job", "Plumber", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("link","http://www.example.com", Field.Store.YES, Field.Index.NO));

然后,您可以像任何其他字段一样获取“链接”字段。

另外,如果您确实需要自定义分词器,那么您肯定需要一个自定义分析器来实现它,用于索引构建和搜索。

I might be missing something, but...
You don't need a custom tokenizer to associate additional information to a Lucene document. Just store is as an unanalyzed field.

doc.Add(new Field("fname", "Joe", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("job", "Plumber", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("link","http://www.example.com", Field.Store.YES, Field.Index.NO));

You can then get the "link" field just like any other field.

Also, if you did need a custom tokenizer, then you would definitely need a custom analyzer to implement it, for both the index building and searching.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文