Java 全文搜索解决方案？

发布于 2024-12-06 06:42:20 字数 1552 浏览 4 评论 0原文

有大量不同类型的实体：

interface Entity {
}

interface Entity1 extends Entity {
  String field1();
  String field2();
}

interface Entity2 extends Entity {
  String field1();
  String field2();
  String field3();
}

interface Entity3 extends Entity {
  String field12();
  String field23();
  String field34();
}

Set<Entity> entities = ...

任务是为此集实现全文搜索。通过全文搜索，我的意思是我只需要获取包含我正在查找的子字符串的实体（我不需要需要知道确切的属性、该子字符串所在位置的确切偏移量等）。在当前的实现中，Entity 接口有一个方法 matches(String)：

interface Entity {
  boolean matches(String text);
}

每个实体类根据其内部实现它：

class Entity1Impl implements Entity1 {
  public String field1() {...}
  public String field2() {...}

  public boolean matches(String text) {
    return field1().toLowerCase().contains(text.toLowerCase()) ||
           field2().toLowerCase().contains(text.toLowerCase());
  }
}

我相信这种方法真的很糟糕（尽管它有效））。我正在考虑在每次有新集合时使用 Lucene 来构建索引。我所说的索引是指内容 -> id 映射。内容只是我正在考虑的所有领域的一个微不足道的“总和”。因此，对于 Entity1 ，内容将是 field1() 和 field2() 的串联。我对性能有一些疑问：构建索引通常是一项相当昂贵的操作，所以我不确定它是否有帮助。

您还有其他建议吗？

为了澄清细节：

Set; Entity = ... 约有 10000 个项目。
设置<实体> Entity = ... 不是从数据库读取的，所以我不能只添加 where ... 条件。数据源非常重要，所以我无法从侧面解决问题。
实体 应被视为短文章，因此某些字段可能高达 10KB，而其他字段可能约为 10 字节。
我需要经常执行此搜索，但查询字符串和原始集每次都不同，因此看起来我不能只构建一次索引（因为实体集每次都不同）。

原文

There's a large set of entities of different kinds:

interface Entity {
}

interface Entity1 extends Entity {
  String field1();
  String field2();
}

interface Entity2 extends Entity {
  String field1();
  String field2();
  String field3();
}

interface Entity3 extends Entity {
  String field12();
  String field23();
  String field34();
}

Set<Entity> entities = ...

The task is to implement full text search for this set. By full text search I mean I just need to get entities that contain a substring I'm looking for (I don't need to know exact property, exact offset of where this substrig is, etc). In current implementation the Entity interface has a method matches(String):

interface Entity {
  boolean matches(String text);
}

Each entity class implements it depending on its internals:

class Entity1Impl implements Entity1 {
  public String field1() {...}
  public String field2() {...}

  public boolean matches(String text) {
    return field1().toLowerCase().contains(text.toLowerCase()) ||
           field2().toLowerCase().contains(text.toLowerCase());
  }
}

I believe this approach is really awful (though, it works). I'm considering using Lucene to build indexes every time I have a new set. By index I mean content -> id mappings. The content is just a trivial "sum" of all the fields I'm considering. So, for Entity1 the content would be concatenation of field1() and field2(). I have some doubts about the performance: building the index is often quite an expensive operation, so I'm not really sure if it helps.

Do you have any other suggestions?

To clarify the details:

Set<Entity> entities = ... is of ~10000 items.
Set<Entity> entities = ... is not read from DB, so I can't just add where ... condition. The data source is quite non-trivial, so I can't solve the problem on its side.
Entities should be thought of as of short articles, so some fields may be up to 10KB, while others may be ~10 bytes.
I need to perform this search quite often, but both the query string and original set are different every time, so it looks like I can't just build index once (because the set of entities is different every time).

分享到QQ

分享到微博