是否可以将自定义元数据添加到 Lucene 字段?
我已经到了需要在 Lucene.Net 索引中存储一些有关特定字段来自何处的附加数据的地步。具体来说,我想在将字段添加到文档时将 guid 附加到文档的某些字段,并在从搜索结果中获取文档时再次检索它。
这可能吗?
编辑: 好吧,让我举个例子来澄清一下。
假设我有一个对象,我希望允许用户使用自定义标签(如“个人”、“最喜欢的”、“某些项目”)进行标记。我通过向文档添加多个“标签”字段来实现此目的,如下所示:
doc.Add( new Field( "tag", "personal" ) );
doc.Add( new Field( "tag", "favorite" ) );
问题是我现在需要记录有关每个单独标签本身的一些元数据,特别是表示该标签来自何处的 guid(将其想象为用户 ID) )。每个标签可能有不同的 guid,因此我不能简单地创建一个“tag-guid”字段(除非保留值的顺序 - 请参阅下面的编辑 2)。我不需要为这些元数据建立索引(事实上,我不希望这样做,以避免元数据被命中),我只需要能够从文档/字段中再次检索它。
doc.GetFields( "tag" )[0].Metadata...
(我在这里编写语法,但我希望我的观点现在已经清楚了。)
编辑 2: 由于这是一个完全不同的问题,因此我针对这种方法发布了一个新问题:Lucene 中多值字段的顺序稳定吗?
好吧,让我们尝试另一种方法...关键问题在于多字段值的不确定性在相同的字段名称下(例如“标签”)。如果我可以在这里引入或获得某种确定性,我也许可以将元数据存储在另一个字段中。
例如,如果我可以依赖字段值的顺序永远不会改变,我可以使用值集中的索引来准确识别我所引用的标签。
当我稍后检索文档时,是否可以保证向字段添加值的顺序保持不变?
I've come to the point where I need to store some additional data about where a particular field comes from in my Lucene.Net index. Specifically, I want to attach a guid to certain fields of a document when the field is added to the document, and retrieve it again when I get the document from a search result.
Is this possible?
Edit:
Okay, let me clarify a bit by giving an example.
Let's say I have an object that I want to allow the user to tag with custom tags like "personal", "favorite", "some-project". I do this by adding multiple "tag" fields to the document, like so:
doc.Add( new Field( "tag", "personal" ) );
doc.Add( new Field( "tag", "favorite" ) );
The problem is I now need to record some meta data about each individual tag itself, specifically a guid representing where that tag came from (imagine it as a user id). Each tag could potentially have a different guid, so I can't simply create a "tag-guid" field (unless the order of the values is preserved---see edit 2 below). I don't need this metadata to be indexed (and in fact I'd prefer it not to be, to avoid getting hits on metadata), I just need to be able to retrieve it again from the document/field.
doc.GetFields( "tag" )[0].Metadata...
(I'm making up syntax here, but I hope my point is clear now.)
Edit 2:
Since this is a completely different question, I've posted a new question for this approach: Is the order of multi-valued fields in Lucene stable?
Okay let's try another approach... The key problem area is the indeterminacy of the multiple field values under the same field name (e.g. "tag"). If I could introduce or obtain some kind of determinacy here, I might be able to store the metadata in another field.
For example, if I could rely on the order of the values of the field never changing, I could use an index in the set of values to identify exactly which tag I am referring to.
Is there any guarantee that the order I add the values to a field will remain the same when I retrieve the document at a later time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据您对此索引的搜索要求,这可能是可能的。这样您就可以控制字段的顺序。当然,随着标签列表的变化,这需要更新这两个字段,但开销可能是值得的。
注意:使用 {} 可以让您在存在相似值的情况下限定搜索的唯一性。
示例:如果值存储为“person|personal|personage”,搜索“person”将返回包含 person、personal 或 personage 中任何一个的文档。通过像这样在大括号中进行限定:“{person}|{personal}|{personage}”,我可以搜索“{person}”并确保它不会返回误报。当然,这假设您在值中不使用大括号。
Depending on your search requirements for this index, this may be possible. That way you can control the order of fields. It would require updating both fields as the tag list changes of course, but the overhead may be worth it.
Note: using the {} allows you to qualify your search for uniqueness where similar values exist.
Example: If values were stored as "person|personal|personage" searching for "person" would return a document that has any one of person, personal or personage. By qualifying in curly brackets like so: "{person}|{personal}|{personage}", I can search for "{person}" and be sure it won't return false positives. Of course, this assumes you don't use curly brackets in your values.
我认为您在询问有效负载 。
编辑:从您的用例来看,听起来您不想在搜索中使用此元数据,您只是希望它存在。 (基本上,您想使用 Lucene 作为数据库系统。)
那么,为什么不能使用二进制字段呢?
然后您可以在检索时反序列化它。
I think you're asking about payloads.
Edit: From your use case, it sounds like you have no desire to use this metadata in your search, you just want it there. (Basically, you want to use Lucene as a database system.)
So, why can't you use a binary field?
Then you can deserialize it on retrieval.