Java 全文搜索解决方案?
有大量不同类型的实体:
interface Entity {
}
interface Entity1 extends Entity {
String field1();
String field2();
}
interface Entity2 extends Entity {
String field1();
String field2();
String field3();
}
interface Entity3 extends Entity {
String field12();
String field23();
String field34();
}
Set<Entity> entities = ...
任务是为此集实现全文搜索。通过全文搜索,我的意思是我只需要获取包含我正在查找的子字符串的实体(我不需要需要知道确切的属性、该子字符串所在位置的确切偏移量等)。在当前的实现中,Entity
接口有一个方法 matches(String)
:
interface Entity {
boolean matches(String text);
}
每个实体类根据其内部实现它:
class Entity1Impl implements Entity1 {
public String field1() {...}
public String field2() {...}
public boolean matches(String text) {
return field1().toLowerCase().contains(text.toLowerCase()) ||
field2().toLowerCase().contains(text.toLowerCase());
}
}
我相信这种方法真的很糟糕(尽管它有效) )。我正在考虑在每次有新集合时使用 Lucene 来构建索引。我所说的索引是指内容 -> id 映射。内容只是我正在考虑的所有领域的一个微不足道的“总和”。因此,对于 Entity1
,内容将是 field1()
和 field2()
的串联。我对性能有一些疑问:构建索引通常是一项相当昂贵的操作,所以我不确定它是否有帮助。
您还有其他建议吗?
为了澄清细节:
Set
约有 10000 个项目。; Entity = ... 设置<实体> Entity = ...
不是从数据库读取的,所以我不能只添加where ...
条件。数据源非常重要,所以我无法从侧面解决问题。实体
应被视为短文章,因此某些字段可能高达 10KB,而其他字段可能约为 10 字节。- 我需要经常执行此搜索,但查询字符串和原始集每次都不同,因此看起来我不能只构建一次索引(因为实体集每次都不同)。
There's a large set of entities of different kinds:
interface Entity {
}
interface Entity1 extends Entity {
String field1();
String field2();
}
interface Entity2 extends Entity {
String field1();
String field2();
String field3();
}
interface Entity3 extends Entity {
String field12();
String field23();
String field34();
}
Set<Entity> entities = ...
The task is to implement full text search for this set. By full text search I mean I just need to get entities that contain a substring I'm looking for (I don't need to know exact property, exact offset of where this substrig is, etc). In current implementation the Entity
interface has a method matches(String)
:
interface Entity {
boolean matches(String text);
}
Each entity class implements it depending on its internals:
class Entity1Impl implements Entity1 {
public String field1() {...}
public String field2() {...}
public boolean matches(String text) {
return field1().toLowerCase().contains(text.toLowerCase()) ||
field2().toLowerCase().contains(text.toLowerCase());
}
}
I believe this approach is really awful (though, it works). I'm considering using Lucene to build indexes every time I have a new set. By index I mean content -> id mappings. The content is just a trivial "sum" of all the fields I'm considering. So, for Entity1
the content would be concatenation of field1()
and field2()
. I have some doubts about the performance: building the index is often quite an expensive operation, so I'm not really sure if it helps.
Do you have any other suggestions?
To clarify the details:
Set<Entity> entities = ...
is of ~10000 items.Set<Entity> entities = ...
is not read from DB, so I can't just addwhere ...
condition. The data source is quite non-trivial, so I can't solve the problem on its side.Entities
should be thought of as of short articles, so some fields may be up to 10KB, while others may be ~10 bytes.- I need to perform this search quite often, but both the query string and original set are different every time, so it looks like I can't just build index once (because the set of entities is different every time).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
对于如此复杂的对象域,您可以使用 lucene 包装工具,例如 Compass ,它可以快速将对象图映射到 lucene使用与 ORM 相同的方法进行索引(如 hibernate)
For such a complex Object domain, you can use lucene wrapper tool like Compass which allow quickly map you object graph to lucene index using the same approach as ORM(like hibernate)
我强烈考虑将 Lucene 与 SOLR 一起使用。 http://lucene.apache.org/java/docs/index.html
I would strongly consider using Lucene with SOLR. http://lucene.apache.org/java/docs/index.html