使用 C# 和 Linq 以 1000 为一组处理文档的 46,000 行

发布于 2024-10-21 09:17:26 字数 753 浏览 1 评论 0原文

我有下面的代码可以执行。 IT 文本文件中有 46,000 条记录,我需要处理这些记录并将其插入数据库。如果我直接调用它并一次循环一个,则需要很长时间。

我试图使用 LINQ 每隔 1000 行左右提取一次并将其放入一个线程中,这样我就可以一次处理 3000 行并缩短处理时间。但我无法弄清楚。所以我需要一些帮助。

任何建议都将受到欢迎。先感谢您。

var reader = ReadAsLines(tbxExtended.Text);
        var ds = new DataSet();
        var dt = new DataTable();

        string headerNames = "Long|list|of|strings|"                                  
        var headers = headerNames.Split('|');
        foreach (var header in headers)
            dt.Columns.Add(header);

        var records = reader.Skip(1);
        foreach (var record in records)
            dt.Rows.Add(record.Split('|'));

        ds.Tables.Add(dt);
        ds.AcceptChanges();

        ProcessSmallList(ds);

I have this code below that executes. IT has 46,000 records in the text file that i need to process and insert into the database. It takes FOREVER if i just call it directly and loop one at a time.

I was trying to use LINQ to pull every 1000 rows or so and throw it into a thread so I could proces 3000 rows at once and cut the processing time. I can't figure it out though. so I need some help.

Any suggestions would be welcome. Thank You in advance.

var reader = ReadAsLines(tbxExtended.Text);
        var ds = new DataSet();
        var dt = new DataTable();

        string headerNames = "Long|list|of|strings|"                                  
        var headers = headerNames.Split('|');
        foreach (var header in headers)
            dt.Columns.Add(header);

        var records = reader.Skip(1);
        foreach (var record in records)
            dt.Rows.Add(record.Split('|'));

        ds.Tables.Add(dt);
        ds.AcceptChanges();

        ProcessSmallList(ds);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

葬シ愛 2024-10-28 09:17:26

如果您正在寻找高性能,请查看 SqlBulkInsert< /a> 如果您使用的是 SqlServer。性能明显优于逐行插入。

下面是一个使用我用于项目的自定义 CSVDataReader 的示例,但任何 IDataReader 兼容的 Reader,DataRow[] 或 DataTable 可用作 WriteToServer 的参数,SQLDataReader, OLEDBDataReader等。

Dim sr As CSVDataReader
Dim sbc As SqlClient.SqlBulkCopy
sbc = New SqlClient.SqlBulkCopy(mConnectionString, SqlClient.SqlBulkCopyOptions.TableLock Or SqlClient.SqlBulkCopyOptions.KeepIdentity)
sbc.DestinationTableName = "newTable"
'sbc.BulkCopyTimeout = 0

sr = New CSVDataReader(parentfileName, theBase64Map, ","c)
sbc.WriteToServer(sr)
sr.Close()

有相当多的选项可用。 (见文章中的链接)

If you are looking for high performance then look at the SqlBulkInsert if you are using SqlServer. The performance is significantly better than Insert row by row.

Here is an example using a custom CSVDataReader that I used for a project, but any IDataReader compatible Reader, DataRow[] or DataTable can be used as a parameter into WriteToServer, SQLDataReader, OLEDBDataReader etc.

Dim sr As CSVDataReader
Dim sbc As SqlClient.SqlBulkCopy
sbc = New SqlClient.SqlBulkCopy(mConnectionString, SqlClient.SqlBulkCopyOptions.TableLock Or SqlClient.SqlBulkCopyOptions.KeepIdentity)
sbc.DestinationTableName = "newTable"
'sbc.BulkCopyTimeout = 0

sr = New CSVDataReader(parentfileName, theBase64Map, ","c)
sbc.WriteToServer(sr)
sr.Close()

There are quite a number of options available. (See the link in the item)

小忆控 2024-10-28 09:17:26

要将数据批量插入数据库,您可能应该使用该数据库引擎的批量插入实用程序(例如 SQL Server 中的 bcp)。您可能希望首先进行处理,将处理后的数据写入单独的文本文件中,然后批量插入到您关注的数据库中。

如果你确实想在线处理并在线插入,内存也是一个(小)因素,例如:

  1. ReadAllLines将整个文本文件读入内存,创建46,000个字符串。这将占用相当大的内存块。尝试使用 ReadLines 代替,它返回 IEnumerable 并一次返回一行字符串。
  2. 您的数据集最终可能包含所有 46,000 行,这在检测更改的行时会很慢。尝试在插入后立即 Clear() 数据集表。

我相信您观察到的缓慢实际上来自数据集。数据集对每个新记录发出一个 INSERT 语句,这意味着您不会通过一次 1,000 行或一次一行执行 Update() 来保存任何内容。您仍然有 46,000 个 INSERT 语句进入数据库,这使得速度变慢。

为了提高性能,恐怕 LINQ 在这里无法帮助您,因为瓶颈在于 46,000 个 INSERT 语句。您应该:

  1. 放弃使用数据集
  2. 在字符串中动态创建 INSERT 语句
  3. 批量更新,例如每个命令 100-200 行
  4. 使用多个 VALUE 语句动态构建 INSERT 语句
  5. 运行 SQL 命令以每批插入 100-200 行

如果如果您坚持使用数据集,则不必使用 LINQ——LINQ 解决不同类型的问题。执行以下操作:

// code to create dataset "ds" and datatable "dt" omitted
// code to create data adaptor omitted

int count = 0;

foreach (string line in File.ReadLines(filename)) {
    // Do processing based on line, perhaps split it
    dt.AddRow(...);
    count++;

    if (count >= 1000) {
        adaptor.Update(dt);
        dt.Clear();
        count = 0;
    }
}

这会在一定程度上提高性能,但是您永远无法达到通过为数据库引擎使用专用批量插入实用程序(或函数调用)获得的性能。

不幸的是,使用这些批量插入功能将使您的代码难以移植到另一个数据库引擎。这是您需要做出的权衡。

To bulk insert data into a database, you probably should be using that database engine's bulk-insert utility (e.g. bcp in SQL Server). You might want to first do the processing, write out the processed data into a separate text file, then bulk-insert into your database of concern.

If you really want to do the processing on-line and insert on-line, memory is also a (small) factor, for example:

  1. ReadAllLines reads the whole text file into memory, creating 46,000 strings. That would occupying a sizable chunk of memory. Try to use ReadLines instead which returns an IEnumerable and return strings one line at a time.
  2. Your dataset may contain all 46,000 rows in the end, which will be slow in detecting changed rows. Try to Clear() the dataset table right after insert.

I believe the slowness you observed actually came from the dataset. Datasets issue one INSERT statement per new record, which means that you won't be saving anything by doing Update() 1,000 rows at a time or one row at a time. You still have 46,000 INSERT statements going to the database, which makes it slow.

In order to improve performance, I'm afraid LINQ can't help you here, since the bottleneck is with the 46,000 INSERT statements. You should:

  1. Forgo the use of datasets
  2. Dynamically create an INSERT statement in a string
  3. Batch the update, say, 100-200 rows per command
  4. Dynamically build the INSERT statement with multiple VALUE statments
  5. Run the SQL command to insert 100-200 rows per batch

If you insist on using datasets, you don't have to do it with LINQ -- LINQ solves a different type of problems. Do something like:

// code to create dataset "ds" and datatable "dt" omitted
// code to create data adaptor omitted

int count = 0;

foreach (string line in File.ReadLines(filename)) {
    // Do processing based on line, perhaps split it
    dt.AddRow(...);
    count++;

    if (count >= 1000) {
        adaptor.Update(dt);
        dt.Clear();
        count = 0;
    }
}

This will improve performance somewhat, but you're never going to approach the performance you obtain by using dedicated bulk-insert utilities (or function calls) for your database engine.

Unfortunately, using those bulk-insert facilities will make your code less portable to another database engine. This is the trade-off you'll need to make.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文