Solr DataImportHandler 增量导入
我正在使用 DataImportHandler 在 SOLR 中索引数据。我使用完全导入来索引数据库中的所有数据,大约有 10000 个产品。现在我对增量导入的用法感到困惑?它是否会定期对添加到数据库中的新数据建立索引?我的意思是,它将对添加到表中大约 10 行的新数据建立索引,或者只是更新已索引数据中的更改。
谁能尽快用简单的例子向我解释一下。
I am using DataImportHandler for indexing data in SOLR. I used full-import to index all the data in the my database which is around 10000 products.Now I am confused with the delta-import usage? Does it index the new data added into the database on interval basis i mean it is going to index the new data added to my table around 10 rows or it just updates the changes in the already indexed data.
Can anyone please explain it to me with simple example as soon as you can.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
DataImportHandler 可能有点令人畏惧。您的初始查询已加载 10.000 个独特的产品。如果您指定 /dataimport?command=full-import,则会加载此文件。
完成此导入后,DIH 会存储一个变量 ({dataimporter.last_index_time}),它是您执行此导入的最后日期/时间。
为了进行更新,您需要指定一个 deltaQuery。 deltaQuery 旨在识别自上次更新以来数据库中已更改的记录。因此,您指定这样的查询: SELECT Product_id
来自某个表
WHERE [date_update] >= '${dataimporter.last_index_time}'
这将从您的数据库中检索自上次完整更新以来更新的所有product_ids。您需要指定的下一个查询 (deltaImportQuery) 是检索上一步中每个 Product_id 的完整记录的查询。
假设product_id是你的唯一键,solr会发现它需要更新现有记录,或者如果product_id不起作用则添加一个记录。
为了执行 deltaQuery 和 deltaImportQuery,您可以使用 /dataimport?command=delta-import
这是所有可能性的极大简化,请查看 Solr wiki 上的 DataImportHandler,它是一个非常强大的工具!
The DataImportHandler can be a little daunting. Your initial query has loaded 10.000 unique products. This is loaded if you specify /dataimport?command=full-import.
When this import is done, the DIH stores a variable ({dataimporter.last_index_time}) which is the last date/time you did this import.
In order to do an update, you specify a deltaQuery. The deltaQuery is meant to identify the records that have changed in your database since the last update. So, you specify a query like this: SELECT product_id
FROM sometable
WHERE [date_update] >= '${dataimporter.last_index_time}'
This will retrieve all the product_ids from your database that are updated since you last full update. The next query (deltaImportQuery) you need to specify is the query that will retrieve the full record for each product_id that you have from the previous step.
Assuming product_id is you unique key, solr will figure out that it needs to update an existing record, or add one if the product_id doens't work.
In order to execute the deltaQuery and the deltaImportQuery you use /dataimport?command=delta-import
This is a great simplification of all the possibilities, check the Solr wiki on DataImportHandler, it is a VERY powerful tool!
另请注意:
当您在较小的时间窗口内使用增量导入(例如几秒钟内几次)并且数据库服务器位于 solr 索引服务之外的其他计算机上时,请确保
systemtime<两台机器的 /code> 匹配,因为
[date_update]
的时间戳是在数据库服务器上生成的,而dataimporter.last_index_time
在另一台机器上生成的。否则,您将不会根据时间差异更新索引(或太多)。
On another note:
When you use a delta import within a small time window (like a couple of times in a few seconds) and the database server is on an other machine than the solr index service, make sure that the
systemtime
of both machines matches, since the timestamp of[date_update]
is generated on the database server anddataimporter.last_index_time
is generated on the other.Otherwise you won't be updating the index (or too much) depending on the time differences.
我同意数据导入处理程序可以处理这种情况。 DIH 的一个重要限制是它不会对请求进行排队。这样做的结果是,如果 DIH 正在“忙”索引,它将忽略所有未来的 DIH 请求,直到它再次“空闲”。跳过的 DIH 请求将丢失且不会执行。
I agree that the Data Import Handler can handle this situation. One important limitation to the DIH is that it does not queue requests. The result of this is that if the DIH is "busy" indexing it will ignore all future DIH requests until it is "idle" again. The skipped DIH requests are lost and not executed.