SOLR - 从 csv 文件导入 2000 万个文档的最佳方法
我当前的任务是找出在 solr 中加载数百万文档的最佳方法。 数据文件是从数据库导出的 csv 格式。
目前,我正在考虑将文件分割成更小的文件并编写一个脚本,同时使用curl 发布这个更小的文件。
我注意到,如果您发布大量数据,大多数时候请求都会超时。
我正在研究数据导入器,它似乎是一个不错的选择
任何其他想法非常感谢
谢谢
My current task on hand is to figure out the best approach to load millions of documents in solr.
The data file is an export from DB in csv format.
Currently, I am thinking about splitting the file into smaller files and having a script while post this smaller ones using curl.
I have noticed that if u post high amount of data, most of the time the request times out.
I am looking into Data importer and it seems like a good option
Any others ideas highly appreciated
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
除非数据库已经是您的解决方案的一部分,否则我不会为您的解决方案增加额外的复杂性。引用 SOLR FAQ 是您的 servlet 容器发出会话超时。
在我看来,您有几个选项(按照我的偏好顺序):
增加容器超时
增加容器超时。 (“maxIdleTime”参数,如果您使用嵌入式 Jetty 实例)。
我假设您只是偶尔索引这么大的文件?暂时增加超时可能是最简单的选择。
分割文件
下面是一个简单的 unix 脚本,可以完成这项工作(将文件分割为 500,000 行块):
解析文件并分块加载
以下 groovy 脚本使用 opencsv 和 solrj 解析 CSV 文件,并每 500,000 行向 Solr 提交更改线。
Unless a database is already part of your solution, I wouldn't add additional complexity to your solution. Quoting the SOLR FAQ it's your servlet container that is issuing the session time-out.
As I see it, you have a couple of options (In my order of preference):
Increase container timeout
Increase the container timeout. ("maxIdleTime" parameter, if you're using the embedded Jetty instance).
I'm assuming you only occasionally index such large files? Increasing the time-out temporarily might just be simplest option.
Split the file
Here's the simple unix script that will do the job (Splitting the file in 500,000 line chunks):
Parse the file and load in chunks
The following groovy script uses opencsv and solrj to parse the CSV file and commit changes to Solr every 500,000 lines.
在 SOLR 4.0(当前为 BETA)中,可以使用 UpdateHandler 直接导入本地目录中的 CSV。修改 SOLR Wiki 中的示例
,这会从本地位置流式传输文件,因此无需将其分块并通过 HTTP POST。
In SOLR 4.0 (currently in BETA), CSV's from a local directory can be imported directly using the UpdateHandler. Modifying the example from the SOLR Wiki
And this streams the file from the local location, so no need to chunk it up and POST it via HTTP.
上面的答案很好地解释了单机的摄取策略。
如果您拥有大数据基础设施并想要实施分布式数据摄取管道,那么还有更多选择。
hive-solr 连接器, spark-solr 连接器。
PS:
solrconfig.xml
文件中的autoCommit
和autoSoftCommit
配置。Above answers have explained really well the ingestion strategies from single machine.
Few more options if you have big data infrastructure in place and want to implement distributed data ingestion pipeline.
hive- solr connector, spark- solr connector.
PS:
StandardDirectoryFactory
.autoCommit
andautoSoftCommit
configuration insolrconfig.xml
file.当然,首先将它们加载到普通数据库中。有各种各样的工具可以处理 CSV(例如,postgres' COPY< /a>),所以这应该很容易。使用 数据导入处理程序 也非常简单,所以这似乎是最无摩擦的加载方式您的数据。此方法也会更快,因为您不会有不必要的网络/HTTP 开销。
Definitely just load these into a normal database first. There's all sorts of tools for dealing with CSVs (for example, postgres' COPY), so it should be easy. Using Data Import Handler is also pretty simple, so this seems like the most friction-free way to load your data. This method will also be faster since you won't have unnecessary network/HTTP overhead.
参考指南 说
ConcurrentUpdateSolrServer
可以/应该用于批量更新。Javadoc有些不正确(v 3.6.2 ,<一href="https://lucene.apache.org/solr/4_7_0/solr-solrj/org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html" rel="nofollow">v 4.7.0):
它不会无限期地缓冲,但最多可达
intqueueSize
,这是一个构造函数参数。The reference guide says
ConcurrentUpdateSolrServer
could/should be used for bulk updates.Javadocs are somewhat incorrect (v 3.6.2, v 4.7.0):
It doesn't buffer indefinitely, but up to
int queueSize
, which is a constructor parameter.