对源数据来自多个源的 Solr 记录建立索引的好方法是什么?
我有多个数据源,我想从中生成 Solr 文档。一个来源是文件系统,因此我计划迭代一组(可能很多)文件来收集每个生成的 Solr 文档中的一部分数据。第二个来源是另一个 Solr 索引,我想从中提取几个字段。第二个来源也可能有许多(约数百万)条记录。如果重要的话,源 1 提供大部分内容(每个记录的大小比源 2 大几个数量级)。
来源 1:
- /file/band1 -> id="xyz1" name="beatles"era="60s"
- /file/band2 -> id =“xyz2”名称=“u2”时代=“80s”
- ...
- /file/band4000 - > id="xyz4000" name="clash"era="70s"
来源 2:
- solr 记录 1 -> id =“xyz2”吉他=“edge”
- solr记录2 - > id =“xyz4000”吉他=“琼斯”
- solr记录3 - > id =“xyz1”吉他=“george”
我的问题是如何最好地设计这个工作流程。一些高级选择包括:
- 对源 1(文件系统)中的数据进行完全索引。接下来,对源 2 中的数据建立索引并更新已索引的记录。使用 Solr,我相信您仍然不能只向记录添加单个字段,而是用新记录替换整个旧记录。
- 执行与 (1) 相反的操作,首先索引来自 Solr 源的数据,然后索引来自文件系统的数据。
- 在索引到 Solr 之前以某种方式集成数据。一般来说,我们对每个源的遍历顺序不太了解——也就是说,我没有看到一种简单的方法来迭代两个源,其中 xyz1 从两个源处理,然后 xyz2所以
影响决策的一些因素包括数据的大小(在计算时间或内存方面不能太低效)以及Solr在替换记录时的性能(原始大小重要吗? )。
任何想法将不胜感激。
I have multiple sources of data from which I want to produce Solr documents. One source is a filesystem, so I plan to iterate through a set of (potentially many) files to collect one portion of the data in each resulting Solr doc. The second source is another Solr index, from which I'd like to pull just a few fields. This second source also could have many (~millions) of records. If it matters, source 1 provides the bulk of the content (the size of each record there is several orders of magnitude greater than that from source 2).
Source 1:
- /file/band1 -> id="xyz1" name="beatles" era="60s"
- /file/band2 -> id="xyz2" name="u2" era="80s"
- ...
- /file/band4000 -> id="xyz4000" name="clash" era="70s"
Source 2:
- solr record 1 -> id="xyz2" guitar="edge"
- solr record 2 -> id="xyz4000" guitar="jones"
- solr record 3 -> id="xyz1" guitar="george"
My issue is how best to design this workflow. A few high-level choices include:
- Fully index the data from source 1 (the filesystem). Next, index the data from source 2 and update the already-indexed records. With Solr, I believe you still can't just add a single field to a record, you replace the entire old record with the new.
- Do the reverse of (1), indexing first the data from the Solr source, followed by the data from the filesystem.
- Somehow integrate the data before indexing into Solr. In general, we don't know much about the order of traversal in each source--which is to say, I don't see an easy way to iterate the two sources together, in which xyz1 gets processed from both sources, then xyz2, etc.
So some of the factors affecting the decision include the size of the data (can't afford to be too inefficient in terms of computational time or memory) and the performance of Solr when replacing records (does the original size matter much?).
Any ideas would be greatly appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我想说,如果您不关心存储在两个源中的数据首先被合并,那么选项 1 或 2 就可以正常工作。我可能会首先索引较大的源,然后使用第二个源“更新”。
I would say if you're not concerned about the data that is stored in two sources being merged first then option 1 or 2 would work fine. I would probably index the larger source first, then "update" with the second.
选择选项 3 — 在更新之前合并记录。
大概您会使用脚本来迭代文件并处理它们,然后再将它们发送到最终的 Solr 索引。在该脚本中,使用您的共享标识符查询备用 Solr 索引以获取它可能具有的任何补充字段信息。根据需要将其与文件内容相结合,然后将生成的记录发送到 Solr 进行索引。
通过在更新之前合并,您不必担心记录会相互覆盖。您还可以更好地控制哪个源具有优先权。此外,只要您不查询该国另一端的服务器,我就会假设对备用 Solr 索引的请求时间可以忽略不计。
Go with option 3 — combine the records before updating.
Presumably you would be using a script to iterate over the files and process them before sending them to your final Solr index. Within that script, query the alternate Solr index to fetch any supplemental field information that it might have, using your shared identifier. Combine that as appropriate with the contents of your file, then send the resulting record to Solr for indexing.
By combining before you update, you don't have to worry about records overwriting each other. You also maintain more control over which source has priority. Furthermore, so long as you're not querying a server on the other side of the country, I will assume that the request time to the alternate Solr index is negligible.