将 100 亿行导入 mysql
我有一个包含 100 亿行的 .csv 文件。我想检查每一行是否都是唯一的。有没有简单的方法可以做到这一点?我在想也许导入到 mysql 可以让我快速找出唯一性。如何将这么大的文件上传到mysql?我已经尝试过逐行插入语句以及“LOAD DATA INFILE”命令,但都失败了。
谢谢
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我不会为此目的使用数据库,除非它最终需要进入数据库。假设每行的格式相同(这样就没有“8.230”和“8.23”,或者在相同值的行的开头/结尾处没有额外的空格),请使用一些 textutils 包含在大多数 POSIX 环境(Linux、Mac OS X)中,或者通过 GnuWIn32 coreutils。
以下是从系统 shell 执行的步骤序列。首先,对文件进行排序(此步骤是必需的):
然后从排序的数据中查找唯一的行:
现在您可以检查最终文件中有多少行:
或者您可以使用管道将三个步骤与一个命令结合起来线:
I wouldn't use a database for this purpose, unless it needed to end up in the database eventually. Assuming you have the same formatting for each row (so that you don't have "8.230" and "8.23", or extra spaces on start/end of lines of equal values), use a few textutils included with most POSIX environments (Linux, Mac OS X), or available for Windows via GnuWIn32 coreutils.
Here is the sequence of steps to do from your system shell. First, sort the file (this step is required):
Then find unique rows from sorted data:
Now you can check to see how many rows there are in the final file:
Or you can just use pipes for combining the three steps with one command line:
数据有唯一标识符吗?将此列作为 mysql 表中的主键,当您导入数据时,如果有重复项,mysql 应该抛出错误。
至于如何去做......只需逐行读取文件并在每一行上进行插入。
Does the data have a unique identifier? Have this column as primary key in your mysql table and when you go to import the data, mysql should throw an error if you have duplicates.
As for how to go about doing it..just read in the file row by row and do an insert on each row.
如果您从 Excel 或此类其他程序导入。请参阅此处了解如何在将 csv 文件导入 MySQL 之前对其进行清理。关于唯一行,只要你的表模式是正确的,MySQL应该能够处理它。
编辑:
无论源是否为Excel,LOAD DATA LOCAL INFILE 似乎是正确的方法。
100 亿行,并且 LOAD DATA LOCAL 给出错误?您确定csv文件没有问题吗?
If you are importing from Excel or such other programs. See here for how to cleanse the csv file before importing it into MySQL. Regarding the unique row, as long as your table schema is right, MySQL should be able to take care of it.
EDIT:
Whether the source is Excel or not, LOAD DATA LOCAL INFILE appears to be the way to go.
10bn rows, and LOAD DATA LOCAL gives you error? Are you sure there is no problem with the csv file?
您必须将数据库截断为单独的小块。使用大转储。
http://www.ozerov.de/bigdump.php
You have to truncate your database into separate small bite size chunks. Use Big Dump.
http://www.ozerov.de/bigdump.php
如果您确实有 100 亿行,那么您将很难处理这些数据。
您需要查看数据库分区(请参阅此处:关于 mysql 分区< /a>)
然而,即使有这么大的数量,您也需要一些强大的硬件来完成其中涉及的工作。
另外,如果发现一行不唯一,您会怎么做?您想继续导入数据吗?如果导入数据,您会导入相同的行还是将其标记为重复行?你会停止处理吗?
If you do have 10 billion rows then you will struggle working with this data.
You would need to look at partitioning your database (ref here: about mysql partitioning)
However, even with that large number you would be requiring some serious hardware to cut through the work involved there.
Also, what would you do if a row was found to be nonunique? Would you want to continue importing the data? If you import the data would you import the identical row or flag it as a duplicate? Would you stop processing.
Linux 就是为这种工作而生的。
首先,您必须将文件拆分为许多较小的文件:
在此之后,您可以使用两个命令 sort / uniq 进行一些选择,并且在使用广告交换日志文件中的 100 万个 IP 地址的文件对 8 个不同的选项进行计时后,发现使用 LC_ALL=C 与不使用 LC_ALL=C 之间几乎有 20 倍的差异。例如:
没有 LC=ALL_C 时情况相同:
通过管道传输命令并使用 LC_ALL=C 比最快的速度慢 2 倍:
数据库对于像这样的一次性作业没有用,而平面文件即使更具挑战性,也会让您出人意料地走得更远/ 长期目标。
This is the kind of job linux is "made for".
First you have to split the file in to many smaller files:
After this you have few options with the two commands sort / uniq, and after having timed 8 different options with a file of 1 million IP address from an ad exchange log-file, and found a almost 20x difference between using LC_ALL=C or not. For example:
Where as the same without LC=ALL_C:
Piping the command and using LC_ALL=C was 2x slower than the fastest:
Databases are not useful for one-off jobs like this, and flatfiles will get you surprisingly far even with more challenging / long-term objectives.