使用 C# 将 XML 文件加载到 MySQL 的最快方法是什么?

发布于 2024-08-05 04:12:33 字数 1112 浏览 4 评论 0原文

问题

将大型(> 1GB)XML 文件转储到 MySQL 数据库的最快方法是什么?

数据

相关数据是 StackOverflow Creative Commons 数据转储。

目的

这将在我正在构建的离线 StackOverflow 查看器中使用,因为我希望在无法访问互联网的地方进行一些学习/编码。

我想在项目完成后将其发布给 StackOverflow 的其他成员供他们自己使用。

问题

最初,我一次从 XML 读取/写入 DB 一条记录。在我的机器上运行大约需要 10 个小时。我现在使用的 hacktastic 代码将 500 条记录放入数组中,然后创建一个插入查询来一次加载所有 500 条记录(例如“INSERT INTO posts VALUES (...), (...), ( ...) ...;")。虽然速度更快,但运行仍需要几个小时。显然这不是最好的方法,所以我希望这个网站上的大佬能知道更好的方法。

约束

  • 我正在使用 C# 构建应用程序作为桌面应用程序(即 WinForms)。
  • 我使用 MySQL 5.1 作为我的数据库。这意味着“LOAD XML INFILE filename.xml”等功能在此项目中不可用,因为该功能仅在 MySQL 5.4 及更高版本中可用。这种限制很大程度上是由于我希望该项目对我以外的人有用,而且我不想强迫人们使用 MySQL 的 Beta 版本。
  • 我希望将数据加载内置到我的应用程序中(即没有说明“在运行此应用程序之前使用‘foo’将转储加载到 MySQL 中。”)。
  • 我使用的是 MySQL Connector/Net,因此 MySql.Data 命名空间中的任何内容都是可接受的。

感谢您提供的任何指示!


到目前为止的想法

将整个 XML 文件加载到列中,然后使用 XPath 对其进行解析的存储过程

  • 这不起作用,因为文件大小受到 max_allowed_pa​​cket 变量的限制,默认情况下设置为 1 MB。这远远低于数据转储文件的大小。

Question

What is the fastest way to dump a large (> 1GB) XML file into a MySQL database?

Data

The data in question is the StackOverflow Creative Commons Data Dump.

Purpose

This will be used in an offline StackOverflow viewer I am building, since I am looking to do some studying/coding in places where I will not have access to the internet.

I would like to release this to the rest of the StackOverflow membership for their own use when the project is finished.

Problem

Originally, I was reading from XML/writing to DB one record at a time. This took about 10 hours to run on my machine. The hacktastic code I'm using now throws 500 records into an array, then creates an insertion query to load all 500 at once (eg. "INSERT INTO posts VALUES (...), (...), (...) ... ;"). While this is faster, it still takes hours to run. Clearly this is not the best way to go about it, so I'm hoping the big brains on this site will know of a better way.

Constraints

  • I am building the application using C# as a desktop application (i.e. WinForms).
  • I am using MySQL 5.1 as my database. This means that features such as "LOAD XML INFILE filename.xml" are not usable in this project, as this feature is only available in MySQL 5.4 and above. This constraint is largely due to my hope that the project would be useful to people other than myself, and I'd rather not force people to use Beta versions of MySQL.
  • I'd like the data load to be built into my application (i.e. no instructions to "Load the dump into MySQL using 'foo' before running this application.").
  • I'm using MySQL Connector/Net, so anything in the MySql.Data namespace is acceptable.

Thanks for any pointers you can provide!


Ideas so far

stored procedure that loads an entire XML file into a column, then parses it using XPath

  • This didn't work since the file size is subject to the limitations of the max_allowed_packet variable, which is set to 1 MB by default. This is far below the size of the data dump files.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

追星践月 2024-08-12 04:12:33

这有两个部分:

  • 读取 xml 文件
  • 写入数据库

要读取 xml 文件,此链接 http://csharptutorial.blogspot.com/2006/10/reading-xml-fast.html ,显示使用流读取器可以在 2.4 秒内读取 1 MB,即 2400 1 GB 文件需要几秒或 40 分钟(如果我的数学算得这么晚的话)。

据我所知,将数据导入 MySQL 的最快方法是使用 LOAD DATA。

http://dev.mysql.com/doc/refman/5.1 /en/load-data.html

因此,如果您可以读取xml数据,请将其写入到LOAD DATA可以使用的文件中,然后运行LOAD DATA。总时间可能少于您正在经历的时间。

There are 2 parts to this:

  • reading the xml file
  • writing to the database

For reading the xml file, this link http://csharptutorial.blogspot.com/2006/10/reading-xml-fast.html , shows that 1 MB can be read in 2.4 sec using stream reader, that would be 2400 seconds or 40 mins (if my maths is working this late) for 1 GB file.

From what I have read the fastest way to get data into MySQL is to use LOAD DATA.

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

Therefore, if you can read the xml data, write it to files that can be used by LOAD DATA, then run LOAD DATA. The total time may be less than the hours that you are experiancing.

冰雪之触 2024-08-12 04:12:33

好吧,我就在这里当个白痴,用问题来回答你的问题。

为什么将其放入数据库?

如果...只是一个假设...您将 xml 写入本地驱动器上的文件,并且如果需要,在数据库中写入一些索引信息。这应该比尝试加载数据库执行得快得多,并且更便携。您所需要的只是一种搜索方法和一种对关系引用进行索引的方法。搜索应该有很多帮助,并且关系方面应该足够容易构建?您甚至可以考虑重写信息,以便每个文件都包含一个帖子,其中包含所有答案和评论。

不管怎样,只是我的两分钱(而且那不值一毛钱)。

Ok, I'm going to be an idiot here and answer your question with a question.

Why put it in a database?

What if ... just a what-if... you wrote the xml to files on local drive and, if needed, write some indexing information in the database. This should perform significantly faster than trying to load a database and would much more portable. All you would need on top of it is a way to search and a way to index relational references. There should be plenty of help with searching, and the relational aspect should be easy enough to build? You might even consider re-writing the information so that each file contains a single post with all the answers and comments right there.

Anyway, just my two-cents (and that is not worth a dime).

鱼窥荷 2024-08-12 04:12:33

我有一些想法可以帮助加快速度...

  1. 查询的大小可能需要调整,通常会有一个点,大语句在解析时间上花费更多,因此变得更慢。 500 可能是最佳的,但也许不是,您可以稍微调整一下(可能更多,也可能更少)。

  2. 采用多线程。假设您的系统在处理方面尚未趋于平稳,您可以通过将数据分解为块并让线程处理它们来获得一些收益。再次强调,寻找最佳线程数需要进行实验,但很多人都在使用多核机器,并且有空闲的 CPU 周期。

  3. 在数据库方面,确保表尽可能简单。关闭所有索引并在建立索引之前加载数据。

I have a few thoughts to help speed this up...

  1. The size of the query may need to be tweaked, there's often a point where the big statement costs more in parsing time and so becomes slower. The 500 may be optimal, but perhaps it is not and you could tweak that a little (it could be more, it could be less).

  2. Go multithreaded. Assuming your system isn't already flatlined on the processing, you could make some gains by having breaking up the data in to chunks and having threads process them. Again, it's an experimentation thing to find the optimal number of threads, but a lot of people are using multicore machines and have CPU cycles to spare.

  3. On the database front, make sure that the table is as bare as it can be. Turn off any indexes and load the data before indexing it.

格子衫的從容 2024-08-12 04:12:33

SqlBulkCopy ROCKS。我用它把 30 分钟的功能变成了 4 秒。但是,这仅适用于 MS SQL Server

我可以建议您查看一下您创建的表上的约束吗?如果您删除数据库上的所有键、约束等,数据库将减少您的插入工作和递归工作。

其次,设置具有较大初始大小的表,以防止在插入空白数据库时调整大小。

最后看看MySQL是否有批量复制风格的API。 SQL Server 基本上会在数据写入磁盘时对其进行格式化,然后 SQL Server 将数据流链接到磁盘,然后将数据泵入。然后,它对所有数据执行一次一致性检查,而不是每次插入一次,从而显着提高性能。

你需要MySQL吗?如果您使用 Visual Studio 并且您的数据库性能/大小较低,那么 SQL Server 会让您的生活更轻松。

SqlBulkCopy ROCKS. I used it to turn a 30 min function to 4 seconds. However this is applicable only to MS SQL Server.

Might I suggest you look at the constraints on your table you've created? If you drop all keys on the database, constraints etc, the database will do less work on your insertions and less recursive work.

Secondly setup the tables with big initial sizes to prevent your resizes if you are inserting into a blank database.

Finally see if there is a bulk copy style API for MySQL. SQL Server basically formats the data as it would go down to disk and the SQL server links the stream up to the disk and you pump in data. It then performs one consistency check for all the data instead of one per insert, dramatically improving your performance.

Do you need MySQL? SQL Server makes your life easier if you are using Visual Studio and your database is low performance/size.

弥繁 2024-08-12 04:12:33

Does this help at all? It's a stored procedure that loads an entire XML file into a column, then parses it using XPath and creates a table / inserts the data from there. Seems kind of crazy, but it might work.

撞了怀 2024-08-12 04:12:33

不是你想要的答案,但 mysql c api 有 mysql_stmt_send_long_data函数。

Not the answer you want, but the mysql c api has the mysql_stmt_send_long_data function.

鱼忆七猫命九 2024-08-12 04:12:33

我在上面的评论中注意到您正在考虑 MSSQL,所以我想我应该发布此内容。 SQL Server 有一个名为 SQML​​XMLBulkLoad 的实用程序,旨在将大量 XML 数据导入到 SQL Server 数据库中。以下是 SQL Sever 2008 版本的文档:

http://msdn.microsoft .com/en-us/library/ms171993.aspx

早期版本的 SQL Server 也有此实用程序

I noticed in one of your comments above that you are considering MSSQL, so I thought I'd post this. SQL Server has a utility called SQMLXMLBulkLoad which is designed to import large amounts of XML Data into a SQL Server database. Here is the documentation for the SQL Sever 2008 version:

http://msdn.microsoft.com/en-us/library/ms171993.aspx

Earlier versions of SQL Server also have this utility

天生の放荡 2024-08-12 04:12:33

PostgreSQL 中,获取批量数据绝对最快的方法是删除所有索引和触发器,使用 MySQL 的 LOAD DATA 的等效项,然后重新创建您的索引/触发器。我使用这种技术在大约 10 分钟内将 5 GB 的论坛数据提取到 PostgreSQL 数据库中。

当然,这可能不适用于 MySQL,但值得一试。另外,这个问题的答案表明这实际上是 MySQL 的可行策略。

快速谷歌一下,发现了一些关于提高 MySQL LOAD DATA 性能的提示

In PostgreSQL, the absolute fastest way to get bulk data in is to drop all indexes and triggers, use the equivalent of MySQL's LOAD DATA and then recreate your indexes/triggers. I use this technique to pull 5 GB of forum data into a PostgreSQL database in roughly 10 minutes.

Granted, this may not apply to MySQL, but it's worth a shot. Also, this SO question's answer suggests that this is in fact a viable strategy for MySQL.

A quick google turned up some tips on increasing the performance of MySQL's LOAD DATA.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文