在没有索引的情况下查询 lucene 令牌

发布于 2024-08-07 15:36:50 字数 343 浏览 15 评论 0原文

我正在使用 Lucene(或更具体地说 Compass)来记录论坛中的线程,我需要一种方法来提取讨论背后的关键字。也就是说,我不想对某人所做的每个条目进行索引,而是有一个与特定上下文相关的“关键字”列表,如果该条目与关键字匹配并且高于阈值,我会添加将这些条目添加到索引中。

我希望能够利用分析器的强大功能来剥离内容并发挥其魔力,然后从分析器返回标记以匹配关键字,并计算某些单词被提及的出现次数。

有没有一种方法可以从分析器获取令牌,而无需为每个条目建立索引?

我想我必须维护一个 RAMDirectory 来保存所有条目,然后使用我的关键字列表执行搜索,然后将相关文档合并到持久性管理器以实际存储相关条目。

I am using Lucene (or more specifically Compass), to log threads in a forum and I need a way to extract the keywords behind the discussion. That said, I don't want to index every entry someone makes, but rather I'd have a list of 'keywords' that are relevant to a certain context and if the entry matches a keyword and is above a threshold I'd add these entries to the index.

I want to be able to use the power of an analyser to strip out things and do its magic, but then return the tokens from the analyser in order to match the keywords, and also count the number of occurrences certain words are being mentioned.

Is there a way to get the tokens from an analyser without having the overhead of indexing every entry made?

I was thinking I'd have to maintain a RAMDirectory to hold all entries, and then perform searches using my list of keywords, then merge the relevant Documents to the persistence manager to actually store the relevant entries.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

往日情怀 2024-08-14 15:36:50

您应该能够完全跳过使用 RAMDirectory。您可以直接调用 StandardAnalyzer 并让它向您传回令牌列表(也称为关键字)。

StandardAnalyzer analyzer = new StandardAnalyzer;
TokenStream stream = analyzer.tokenStream("meaningless", new StringReader("<text>"));
while (true) {
    Token token = stream.next();
    if (token == null) break;

    System.out.println(token.termText());
}

更好的是,编写您自己的分析器(它们并不难,查看现有分析器的源代码),使用您自己的过滤器来监视关键字。

You should be able to skip using the RAMDirectory entirely. You can call the StandardAnalyzer directly and get it to pass back a list of tokens to you (aka keywords).

StandardAnalyzer analyzer = new StandardAnalyzer;
TokenStream stream = analyzer.tokenStream("meaningless", new StringReader("<text>"));
while (true) {
    Token token = stream.next();
    if (token == null) break;

    System.out.println(token.termText());
}

Better yet, write your own Analyzer (they're not hard, have a look at the source code for the existing ones) that uses your own filter to watch for your keywords.

拥抱影子 2024-08-14 15:36:50

你走在正确的道路上。您可以使用 RAMDirectory 创建每个文档的索引,然后对其进行搜索以检查该文档是否包含相关关键字。如果不是,则丢弃该文档。否则,您将其添加到持久/主索引中。

您不需要将所有文档保存在内存中。它将不必要地消耗大量内存。

You are on the right path. You can create index of each document using RAMDirectory and then search on it to check that document contains relevant keyword. If no, discard that document. Else, you add it to the persistent/main index.

You don't need to hold all the documents in memory. It will consume a lot of memory unnecessarily.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文