Vespa 访客索引文档

发布于 2025-01-10 19:33:00 字数 1471 浏览 0 评论 0 原文

我想为 vespa 集群中的每个文档分配一个 ID。

但我不完全理解 vespa 中的访客是如何工作的。

我是否可以获得一个共享字段(即由访问者的所有实例共享),每次访问文档时我都可以自动递增该字段(使用一些锁)?

我尝试的方法显然不起作用,但您会看到总体思路:

public class MyVisitor extends DocumentProcessor {

    // where should i put this ? 
    private int document_id;

    private final Lock lock = new ReentrantLock();

    @Override
    public Progress process(Processing processing) {
        Iterator<DocumentOperation> it = processing.getDocumentOperations().iterator();
        while (it.hasNext()) {

            DocumentOperation op = it.next();
            if (op instanceof DocumentPut) {

                Document doc = ((DocumentPut) op).getDocument();
                /*
                 * Remove the PUT operation from the iterator so that it is not indexed back in
                 * the document cluster
                 */
                it.remove();

                try {
                    try {
                        lock.lock();
                        document_id += 1;
                    } finally {
                        lock.unlock();
                    }
                } catch (StatusRuntimeException | IllegalArgumentException e) {
                }
            }
        }
        return Progress.DONE;
    }
}

另一个想法是获取我当前正在处理的存储桶数量和存储桶 ID,并使用此模式进行增量:

document_id = bucket_id
document_id += bucked_count

这会起作用(如果我可以确保我的访问者一次对一个存储桶进行操作)但我不知道如何从访问者那里获取这些信息。

I want to attribute an ID to every document in a vespa cluster.

But I don't completely understand how visitors work in vespa.

Can I get a shared field (meaning shared by all instances of my visitor), which I can atomically increment (using some lock) every time I visit a document ?

What I tried obviously doesn't work, but you'll see the general idea :

public class MyVisitor extends DocumentProcessor {

    // where should i put this ? 
    private int document_id;

    private final Lock lock = new ReentrantLock();

    @Override
    public Progress process(Processing processing) {
        Iterator<DocumentOperation> it = processing.getDocumentOperations().iterator();
        while (it.hasNext()) {

            DocumentOperation op = it.next();
            if (op instanceof DocumentPut) {

                Document doc = ((DocumentPut) op).getDocument();
                /*
                 * Remove the PUT operation from the iterator so that it is not indexed back in
                 * the document cluster
                 */
                it.remove();

                try {
                    try {
                        lock.lock();
                        document_id += 1;
                    } finally {
                        lock.unlock();
                    }
                } catch (StatusRuntimeException | IllegalArgumentException e) {
                }
            }
        }
        return Progress.DONE;
    }
}

Another idea it to get the number of buckets and the bucket id I'm currently dealing with and to increment using this pattern:

document_id = bucket_id
document_id += bucked_count

which would work (if I can ensure my visitor operates on a single bucket at a time) but I don't know how to get these information from my visitor.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

滥情稳全场 2025-01-17 19:33:00

文档处理器对传入文档写入进行操作,因此它们不能应用于访问结果(无论如何都需要更多设置)。

要访问文档,您可以做的就是使用 HTTP/2 获取所有文档: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#visit

然后使用相同的 API 对每个发出更新操作使用相同的 API 设置字段的文档: https://docs.vespa.ai/en/ reference/document-v1-api-reference.html#put

由于这是由单个进程完成的,因此您可以拥有一个分配唯一值的 document_id 计数器。

顺便说一句,避免该要求的一个常见技巧是为每个文档生成一个 UUID。

Document processors operate on incoming document writes, so they cannot be applied to the result of visiting (not without a bit more setup anyway).

What you can do to visit the documents instead is to just get all the documents using HTTP/2: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#visit

Then use the same API to issue an update operation for each document to set the field using the same API: https://docs.vespa.ai/en/reference/document-v1-api-reference.html#put

Since this is done by a single process, you can then have a document_id counter which assigns unique values.

As an aside, a common trick to avoid that requirement is to generate an UUID for each document.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文