C# 将大量数据从 CSV 导入数据库

发布于 2024-08-28 17:53:17 字数 190 浏览 9 评论 0原文

将大量数据从 CSV(300 万行以上)加载到数据库的最有效方法是什么。

  • 数据需要格式化(例如名称列需要分为名字和姓氏等)
  • 我需要尽可能高效地完成此操作,即时间限制

我支持读取、转换和加载数据的选项使用 C# 应用程序逐行处理数据?这是理想的吗?如果不是,我有什么选择?我应该使用多线程吗?

What's the most efficient method to load large volumes of data from CSV (3 million + rows) to a database.

  • The data needs to be formatted(e.g. name column needs to be split into first name and last name, etc.)
  • I need to do this in a efficiently as possible i.e. time constraints

I am siding with the option of reading, transforming and loading the data using a C# application row-by-row? Is this ideal, if not, what are my options? Should I use multithreading?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

飘过的浮云 2024-09-04 17:53:17

您将受到 I/O 限制,因此多线程不一定会使它运行得更快。

上次我这样做时,大约有十几行C#。在一个线程中,它以与从盘片读取数据的速度一样快地运行硬盘。我一次从源文件中读取一行。

如果您不热衷于自己编写它,可以尝试 FileHelpers 库。您可能还想看看 Sébastien Lorion 的作品。他的 CSV 阅读器是专门为处理性能问题而编写的。

You will be I/O bound, so multithreading will not necessarily make it run any faster.

Last time I did this, it was about a dozen lines of C#. In one thread it ran the hard disk as fast as it could read data from the platters. I read one line at a time from the source file.

If you're not keen on writing it yourself, you could try the FileHelpers libraries. You might also want to have a look at Sébastien Lorion's work. His CSV reader is written specifically to deal with performance issues.

回忆那么伤 2024-09-04 17:53:17

您可以使用 csvreader 快速读取 CSV。

假设您使用的是 SQL Server,您可以使用 csvreader 的 CachedCsvReader 将数据读取到 DataTable 中,您可以将其与 SqlBulkCopy 加载到 SQL Server 中。

You could use the csvreader to quickly read the CSV.

Assuming you're using SQL Server, you use csvreader's CachedCsvReader to read the data into a DataTable which you can use with SqlBulkCopy to load into SQL Server.

噩梦成真你也成魔 2024-09-04 17:53:17

我同意你的解决方案。一次读取文件一行应该避免一次将整个文件读入内存的开销,这应该使应用程序快速有效地运行,主要花时间从文件中读取(相对较快)并解析行。我要提醒您的一点是,请注意您的 CSV 中是否嵌入了换行符。我不知道您使用的特定 CSV 格式实际上是否会在数据中的引号之间输出换行符,但这当然可能会混淆该算法。

另外,我建议在将插入语句(在一个字符串中包含许多插入语句)发送到数据库之前对其进行批处理,如果这在检索您需要用于后续外键的生成键值时不会出现问题(希望您不会出现问题)不需要检索任何生成的键值)。请记住,SQL Server(如果您使用的是 SQL Server)每批只能处理 2200 个参数,因此请限制您的批处理大小以解决这一问题。我建议使用参数化 TSQL 语句来执行插入。我怀疑插入记录所花费的时间比从文件中读取记录所花费的时间要多。

I would agree with your solution. Reading the file one line at a time should avoid the overhead of reading the whole file into memory at once, which should make the application run quickly and efficiently, primarily taking time to read from the file (which is relatively quick) and parse the lines. The one note of caution I have for you is to watch out if you have embedded newlines in your CSV. I don't know if the specific CSV format you're using might actually output newlines between quotes in the data, but that could confuse this algorithm, of course.

Also, I would suggest batching the insert statements (include many insert statements in one string) before sending them to the database if this doesn't present problems in retrieving generated key values that you need to use for subsequent foreign keys (hopefully you don't need to retrieve any generated key values). Keep in mind that SQL Server (if that's what you're using) can only handle 2200 parameters per batch, so limit your batch size to account for that. And I would recommend using parameterized TSQL statements to perform the inserts. I suspect more time will be spent inserting records than reading them from the file.

宁愿没拥抱 2024-09-04 17:53:17

您没有说明您正在使用哪个数据库,但考虑到您提到的语言是 C#,我将假设 SQL Server。

如果无法使用 BCP 导入数据(如果需要大量处理,听起来似乎不能),那么 SSIS 可能是下一个最快的选择。它不是世界上最好的开发平台,但速度非常快。当然比您在任何合理的时间范围内自己编写的任何应用程序都要快。

You don't state which database you're using, but given the language you mention is C# I'm going to assume SQL Server.

If the data can't be imported using BCP (which it sounds like it can't if it needs significant processing) then SSIS is likely to be the next fastest option. It's not the nicest development platform in the world, but it is extremely fast. Certainly faster than any application you could write yourself in any reasonable timeframe.

疏忽 2024-09-04 17:53:17

BCP 非常快,所以我会用它来加载数据。对于字符串操作,一旦数据存在,我会使用 SQL 上的 CLR 函数。在这种情况下,多线程除了增加复杂性和损害性能之外没有任何帮助。

BCP is pretty quick so I'd use that for loading the data. For string manipulation I'd go with a CLR function on SQL once the data is there. Multi-threading won't help in this scenario except to add complexity and hurt performance.

陈甜 2024-09-04 17:53:17

将 CSV 文件的内容逐行读取到内存中的 DataTable 中。您可以在填充数据表时操作数据(即:拆分名字和姓氏)等。

将 CSV 数据加载到内存中后,使用 SqlBulkCopy 将数据发送到数据库。

请参阅 http://msdn.microsoft.com /en-us/library/system.data.sqlclient.sqlbulkcopy.writetoserver.aspx 获取文档。

read the contents of the CSV file line by line into a in memory DataTable. You can manipulate the data (ie: split the first name and last name) etc as the DataTable is being populated.

Once the CSV data has been loaded in memory then use SqlBulkCopy to send the data to the database.

See http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.writetoserver.aspx for the documentation.

听你说爱我 2024-09-04 17:53:17

如果您确实想在 C# 中执行此操作,请创建 &填充 DataTable,截断目标数据库表,然后使用 System.Data.SqlClient.SqlBulkCopy.WriteToServer(DataTable dt)。

If you really want to do it in C#, create & populate a DataTable, truncate the target db table, then use System.Data.SqlClient.SqlBulkCopy.WriteToServer(DataTable dt).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文