Java 全文搜索解决方案?

发布于 2024-12-06 06:42:20 字数 1552 浏览 0 评论 0原文

有大量不同类型的实体:

interface Entity {
}

interface Entity1 extends Entity {
  String field1();
  String field2();
}

interface Entity2 extends Entity {
  String field1();
  String field2();
  String field3();
}

interface Entity3 extends Entity {
  String field12();
  String field23();
  String field34();
}

Set<Entity> entities = ...

任务是为此集实现全文搜索。通过全文搜索,我的意思是我只需要获取包含我正在查找的子字符串的实体(我不需要需要知道确切的属性、该子字符串所在位置的确切偏移量等)。在当前的实现中,Entity 接口有一个方法 matches(String)

interface Entity {
  boolean matches(String text);
}

每个实体类根据其内部实现它:

class Entity1Impl implements Entity1 {
  public String field1() {...}
  public String field2() {...}

  public boolean matches(String text) {
    return field1().toLowerCase().contains(text.toLowerCase()) ||
           field2().toLowerCase().contains(text.toLowerCase());
  }
}

我相信这种方法真的很糟糕(尽管它有效) )。我正在考虑在每次有新集合时使用 Lucene 来构建索引。我所说的索引是指内容 -> id 映射。内容只是我正在考虑的所有领域的一个微不足道的“总和”。因此,对于 Entity1 ,内容将是 field1()field2() 的串联。我对性能有一些疑问:构建索引通常是一项相当昂贵的操作,所以我不确定它是否有帮助。

您还有其他建议吗?

为了澄清细节:

  1. Set; Entity = ... 约有 10000 个项目。
  2. 设置<实体> Entity = ... 不是从数据库读取的,所以我不能只添加 where ... 条件。数据源非常重要,所以我无法从侧面解决问题。
  3. 实体 应被视为短文章,因此某些字段可能高达 10KB,而其他字段可能约为 10 字节。
  4. 我需要经常执行此搜索,但查询字符串和原始集每次都不同,因此看起来我不能只构建一次索引(因为实体集每次都不同)。

There's a large set of entities of different kinds:

interface Entity {
}

interface Entity1 extends Entity {
  String field1();
  String field2();
}

interface Entity2 extends Entity {
  String field1();
  String field2();
  String field3();
}

interface Entity3 extends Entity {
  String field12();
  String field23();
  String field34();
}

Set<Entity> entities = ...

The task is to implement full text search for this set. By full text search I mean I just need to get entities that contain a substring I'm looking for (I don't need to know exact property, exact offset of where this substrig is, etc). In current implementation the Entity interface has a method matches(String):

interface Entity {
  boolean matches(String text);
}

Each entity class implements it depending on its internals:

class Entity1Impl implements Entity1 {
  public String field1() {...}
  public String field2() {...}

  public boolean matches(String text) {
    return field1().toLowerCase().contains(text.toLowerCase()) ||
           field2().toLowerCase().contains(text.toLowerCase());
  }
}

I believe this approach is really awful (though, it works). I'm considering using Lucene to build indexes every time I have a new set. By index I mean content -> id mappings. The content is just a trivial "sum" of all the fields I'm considering. So, for Entity1 the content would be concatenation of field1() and field2(). I have some doubts about the performance: building the index is often quite an expensive operation, so I'm not really sure if it helps.

Do you have any other suggestions?

To clarify the details:

  1. Set<Entity> entities = ... is of ~10000 items.
  2. Set<Entity> entities = ... is not read from DB, so I can't just add where ... condition. The data source is quite non-trivial, so I can't solve the problem on its side.
  3. Entities should be thought of as of short articles, so some fields may be up to 10KB, while others may be ~10 bytes.
  4. I need to perform this search quite often, but both the query string and original set are different every time, so it looks like I can't just build index once (because the set of entities is different every time).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

蓝天 2024-12-13 06:42:20

对于如此复杂的对象域,您可以使用 lucene 包装工具,例如 Compass ,它可以快速将对象图映射到 lucene使用与 ORM 相同的方法进行索引(如 hibernate)

For such a complex Object domain, you can use lucene wrapper tool like Compass which allow quickly map you object graph to lucene index using the same approach as ORM(like hibernate)

梦罢 2024-12-13 06:42:20

我强烈考虑将 Lucene 与 SOLR 一起使用。 http://lucene.apache.org/java/docs/index.html

I would strongly consider using Lucene with SOLR. http://lucene.apache.org/java/docs/index.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文