当前位置：文江博客话题详情

SOLR - 从 csv 文件导入 2000 万个文档的最佳方法

发布于 2025-01-08 22:00:24 字数 210 浏览 3 评论 0原文

我当前的任务是找出在 solr 中加载数百万文档的最佳方法。数据文件是从数据库导出的 csv 格式。

目前，我正在考虑将文件分割成更小的文件并编写一个脚本，同时使用curl 发布这个更小的文件。

我注意到，如果您发布大量数据，大多数时候请求都会超时。

我正在研究数据导入器，它似乎是一个不错的选择

任何其他想法非常感谢

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

鸢与 2025-01-15 22:00:24

除非数据库已经是您的解决方案的一部分，否则我不会为您的解决方案增加额外的复杂性。引用 SOLR FAQ 是您的 servlet 容器发出会话超时。

在我看来，您有几个选项（按照我的偏好顺序）：

增加容器超时

增加容器超时。（“maxIdleTime”参数，如果您使用嵌入式 Jetty 实例）。

我假设您只是偶尔索引这么大的文件？暂时增加超时可能是最简单的选择。

分割文件

下面是一个简单的 unix 脚本，可以完成这项工作（将文件分割为 500,000 行块）：

split -d -l 500000 data.csv split_files.
for file in `ls split_files.*`
do  
curl 'http://localhost:8983/solr/update/csv?fieldnames=id,name,category&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary @$file
done

解析文件并分块加载

以下 groovy 脚本使用 opencsv 和 solrj 解析 CSV 文件，并每 500,000 行向 Solr 提交更改线。

import au.com.bytecode.opencsv.CSVReader

import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument

@Grapes([
    @Grab(group='net.sf.opencsv', module='opencsv', version='2.3'),
    @Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
    @Grab(group='ch.qos.logback', module='logback-classic', version='1.0.0'),
])

SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");

new File("data.csv").withReader { reader ->
    CSVReader csv = new CSVReader(reader)
    String[] result
    Integer count = 1
    Integer chunkSize = 500000

    while (result = csv.readNext()) {
        SolrInputDocument doc = new SolrInputDocument();

        doc.addField("id",         result[0])
        doc.addField("name_s",     result[1])
        doc.addField("category_s", result[2])

        server.add(doc)

        if (count.mod(chunkSize) == 0) {
            server.commit()
        }
        count++
    }
    server.commit()
}

Unless a database is already part of your solution, I wouldn't add additional complexity to your solution. Quoting the SOLR FAQ it's your servlet container that is issuing the session time-out.

As I see it, you have a couple of options (In my order of preference):

Increase container timeout

Increase the container timeout. ("maxIdleTime" parameter, if you're using the embedded Jetty instance).

I'm assuming you only occasionally index such large files? Increasing the time-out temporarily might just be simplest option.

Split the file

Here's the simple unix script that will do the job (Splitting the file in 500,000 line chunks):

split -d -l 500000 data.csv split_files.
for file in `ls split_files.*`
do  
curl 'http://localhost:8983/solr/update/csv?fieldnames=id,name,category&commit=true' -H 'Content-type:text/plain; charset=utf-8' --data-binary @$file
done

Parse the file and load in chunks

The following groovy script uses opencsv and solrj to parse the CSV file and commit changes to Solr every 500,000 lines.

import au.com.bytecode.opencsv.CSVReader

import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer
import org.apache.solr.common.SolrInputDocument

@Grapes([
    @Grab(group='net.sf.opencsv', module='opencsv', version='2.3'),
    @Grab(group='org.apache.solr', module='solr-solrj', version='3.5.0'),
    @Grab(group='ch.qos.logback', module='logback-classic', version='1.0.0'),
])

SolrServer server = new CommonsHttpSolrServer("http://localhost:8983/solr/");

new File("data.csv").withReader { reader ->
    CSVReader csv = new CSVReader(reader)
    String[] result
    Integer count = 1
    Integer chunkSize = 500000

    while (result = csv.readNext()) {
        SolrInputDocument doc = new SolrInputDocument();

        doc.addField("id",         result[0])
        doc.addField("name_s",     result[1])
        doc.addField("category_s", result[2])

        server.add(doc)

        if (count.mod(chunkSize) == 0) {
            server.commit()
        }
        count++
    }
    server.commit()
}

回复收藏 0 原文

肤浅与狂妄 2025-01-15 22:00:24

在 SOLR 4.0（当前为 BETA）中，可以使用 UpdateHandler 直接导入本地目录中的 CSV。修改 SOLR Wiki 中的示例

curl http://localhost:8983/solr/update?stream.file=exampledocs/books.csv&stream.contentType=text/csv;charset=utf-8

，这会从本地位置流式传输文件，因此无需将其分块并通过 HTTP POST。

In SOLR 4.0 (currently in BETA), CSV's from a local directory can be imported directly using the UpdateHandler. Modifying the example from the SOLR Wiki

curl http://localhost:8983/solr/update?stream.file=exampledocs/books.csv&stream.contentType=text/csv;charset=utf-8

And this streams the file from the local location, so no need to chunk it up and POST it via HTTP.

回复收藏 0 原文

月亮坠入山谷 2025-01-15 22:00:24

上面的答案很好地解释了单机的摄取策略。

如果您拥有大数据基础设施并想要实施分布式数据摄取管道，那么还有更多选择。

使用 sqoop 将数据导入 hadoop 或手动将 csv 文件放入 hadoop 中。
使用以下连接器之一来提取数据：

hive-solr 连接器, spark-solr 连接器。

PS：

确保没有防火墙阻止客户端节点和 solr/solrcloud 节点之间的连接。
选择正确的目录工厂进行数据摄取，如果不需要近乎实时的搜索，则使用StandardDirectoryFactory。
如果在提取过程中客户端日志中出现以下异常，请调整 solrconfig.xml 文件中的 autoCommit 和 autoSoftCommit 配置。