使用 Solr3.3 分别索引文件内容和自定义元数据
我正在使用 Solr3.3 进行内容/文本搜索的 POC。 我需要首先对文档及其内容及其自定义元数据进行索引。文档被索引并可供搜索后,用户可以更改文档的自定义元数据。然而,一旦将文档添加到索引中,文档的内容就无法更新。当用户更新自定义元数据时,必须更新文档索引以反映搜索中的元数据更改。 但在索引更新期间,即使文件内容没有改变,它也会被索引,这会导致元数据更新的延迟。
所以我想检查是否有一种方法可以避免内容索引并仅更新元数据? 或者我是否必须将内容和元数据存储在单独的索引文件中。即 documentId,index1 中的内容和 documentId,另一个索引中的自定义元数据。在这种情况下,我如何查询这两个不同的索引并返回结果?
I am doing a POC on content/text search using Solr3.3.
I have requirement where documents along with content and their custom metadata would be indexed initially. After the documents are indexed and made available for searching, user can change the custom metadata of the documents. However once the document is added to index the content of the document cannot be updated. When the user updates the custom metadata, the document index has to be updated to reflect the metadata changes in the search.
But during index update, even though the content of the file is not changed, it is also indexed and which causes delays in the metadata update.
So I wanted to check if there is a way to avoid content indexing and update just the metadata?
Or do I have to store the content and metadata in separate index files. i.e. documentId, content in index1 and documentId, custom metadata in another index. In that case how I can query onto these two different indexes and return the result?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
“如果有办法避免内容索引并仅更新元数据”这已在 solr 索引和重新索引,答案是否定的。
请记住 Solr 使用非常松散的模式。它就像一个数据库,所有内容都放入一个表中。想想稀疏矩阵,想想 Amazon SimpleDB。 如果您考虑到类似数据库的联接,两个 solr 索引将被视为两个数据库,而不是两个表。我刚刚在 如何从用户创建的 Windows 服务启动和停止 SOLR 。
我将每个文件作为两个文档输入(一个solr文档=一个数据库行)。因此,对于“watson”上的文件:
元数据为
要搜索文档的内容:
http://localhost:8080/app/select?q=type:contents& text:“在一个漆黑孤独的夜晚”
要进行元数据搜索:
http://localhost:8080/app/select?q=type:元数据&年份:1984
请注意类型:xx。
这可能是一种拼凑(从长远来看,这种实现可能会引起麻烦)。请各位SO'ers批评一下。
"if there is a way to avoid content indexing and update just the metadata" This has been covered in solr indexing and reindexing and the answer is no.
Do remember that Solr uses a very loose schema. Its like a database where everything is put into a single table. Think sparse matrices, think Amazon SimpleDB. Two solr indexes are considered as two databases, not two tables, if you had DB-like joins in mind. I just answered on it on How to start and Stop SOLR from A user created windows service .
I would enter each file as two documents (a solr document = a DB row). Hence for a file on "watson":
and the metadata as
To search the contents of a document:
http://localhost:8080/app/select?q=type:contents&text:"on a dark lonely night"
To do metadata searches:
http://localhost:8080/app/select?q=type:metadata&year:1984
Note the type:xx.
This may be a kludge (an implementation that can cause headaches in the long run). Fellow SO'ers, please critic this.
我们确实尝试过,它应该有效。在将 SOLrInputDocument 对象发送到 lucene 之前,先对您所拥有的 SOLrInputDocument 对象进行快照。压缩它并序列化该对象,然后将其分配给架构中的另一个字段。将该字段设置为二进制字段。
因此,当您想要将此信息更新到其中一个字段时,只需获取二进制字段,将其反序列化,然后将值附加/更新到您感兴趣的字段,然后将其重新提供给 lucene。
永远不要忘记将 XML 存储为 SolrInputDocument 内的字段之一,其中包含 TIKA 提取的用于搜索/索引的文本。
唯一的缺点:您的索引大小会增加一点,但您将获得您想要的内容,而无需重新输入数据。
We did try this and it should work. Take a snapshot of what you have basically the SOLrInputDocument object before you send it to lucene. Compress it and serialize the object and then assign it to one more field in your schema. Make that field as a binary field.
So when you want to update this information to one of the fields just fetch the binary field unserialize it and append/update the values to fields you are interested and re-feed it to lucene.
Never forget to store the XML as one of the fields inside SolrInputDocument that contains the text extracted by TIKA which is used for search/indexing.
The only negative: Your index size will grow a little bit but you will get what you want without re-feeding the data.