Lucene 分析器可用于特殊字符和标点符号吗?
我有一个 Lucene 索引,其中包含多个文档。每个文档都有多个字段,例如:
Id
Project
Name
Description
Id 字段是唯一标识符,例如 GUID,Project 是用户的 ProjectID,用户只能查看其项目的文档,Name 和 Description 包含可以包含特殊字符的文本。
当用户在“名称”字段上执行搜索时,我希望能够尝试尽可能匹配,例如:
First
将返回两者:
First.Last
并且
First.Middle.Last
“名称”也可以是这样的:
Test (NameTest)
其中,如果用户输入“测试”, 'Name',或'(NameTest)',然后他们就可以找到结果。
但是,如果我说 Project 是“ProjectA”,那么这需要完全匹配(不区分大小写的搜索)。 Id 字段也是如此。
我应该将哪些字段设置为标记化,哪些字段设置为未标记化?另外,我应该考虑使用一个好的分析器来实现这一点吗?
我一直在努力决定实现所需搜索的最佳路线。
I have a Lucene index that has several documents in it. Each document has multiple fields such as:
Id
Project
Name
Description
The Id field will be a unique identifier such as a GUID, Project is a user's ProjectID and a user can only view documents for their project, and Name and Description contain text that can have special characters.
When a user performs a search on the Name field, I want to be able to attempt to match the best I can such as:
First
Will return both:
First.Last
and
First.Middle.Last
Name can also be something like:
Test (NameTest)
Where, if a user types in 'Test', 'Name', or '(NameTest)', then they can find the result.
However, if I say that Project is 'ProjectA' then that needs to be an exact match (case insensitive search). The same goes with the Id field.
Which fields should I set up as Tokenized and which as Untokenized? Also, is there a good Analyzer I should consider to make this happen?
I am stuck trying to decide the best route to implement the desired searching.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的 ID 字段应该未标记化,原因很简单,除非您编写自己的标记生成器,否则它似乎无法标记化(基于空格)。您可以对所有其他字段进行标记。
对项目名称执行短语查询,查找 PhraseQuery 或将项目名称用双引号引起来(这将使其完全匹配)。示例:“\”My Fancy Project“\”
对于名称字段,简单的查询应该可以正常工作。
不确定是否存在需要组合字段的情况。在这种情况下,查找 BooleanQuery(它允许您以布尔方式组合不同的查询)
Your ID field should be untokenized for simple reason it does not appear it can be tokenized (whitespace based) unless you write your own tokenizer. You can Tokenize all your other fields.
Perform a phrase query on the project name, look up PhraseQuery or enclose your project name in double quotes (which will make it match exactly). Example: "\"My Fancy Project"\"
For the name field a simple query should work fine.
Unsure if there are situations where you want a combination of fields. In that situation look up BooleanQuery (which allows you to combine different queries boolean-ly)