当前位置：文江博客话题详情

MySQL database large-files

将 100 亿行导入 mysql

发布于 2024-11-02 10:13:45 字数 158 浏览 1 评论 0 原文

我有一个包含 100 亿行的 .csv 文件。我想检查每一行是否都是唯一的。有没有简单的方法可以做到这一点？我在想也许导入到 mysql 可以让我快速找出唯一性。如何将这么大的文件上传到mysql？我已经尝试过逐行插入语句以及“LOAD DATA INFILE”命令，但都失败了。

谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

深巷少女 2024-11-09 10:13:45

我不会为此目的使用数据库，除非它最终需要进入数据库。假设每行的格式相同（这样就没有“8.230”和“8.23”，或者在相同值的行的开头/结尾处没有额外的空格），请使用一些 textutils 包含在大多数 POSIX 环境（Linux、Mac OS X）中，或者通过 GnuWIn32 coreutils。

以下是从系统 shell 执行的步骤序列。首先，对文件进行排序（此步骤是必需的）：

sort ten.csv > ten_sorted.csv

然后从排序的数据中查找唯一的行：

uniq ten_sorted.csv > ten_uniq.csv

现在您可以检查最终文件中有多少行：

wc ten_uniq.csv

或者您可以使用管道将三个步骤与一个命令结合起来线：

sort ten.csv | uniq | wc

I wouldn't use a database for this purpose, unless it needed to end up in the database eventually. Assuming you have the same formatting for each row (so that you don't have "8.230" and "8.23", or extra spaces on start/end of lines of equal values), use a few textutils included with most POSIX environments (Linux, Mac OS X), or available for Windows via GnuWIn32 coreutils.

Here is the sequence of steps to do from your system shell. First, sort the file (this step is required):

sort ten.csv > ten_sorted.csv

Then find unique rows from sorted data:

uniq ten_sorted.csv > ten_uniq.csv

Now you can check to see how many rows there are in the final file:

wc ten_uniq.csv

Or you can just use pipes for combining the three steps with one command line:

sort ten.csv | uniq | wc

回复收藏 0 原文

混浊又暗下来 2024-11-09 10:13:45

数据有唯一标识符吗？将此列作为 mysql 表中的主键，当您导入数据时，如果有重复项，mysql 应该抛出错误。

至于如何去做......只需逐行读取文件并在每一行上进行插入。

回复收藏 0 原文

千纸鹤带着心事 2024-11-09 10:13:45

如果您从 Excel 或此类其他程序导入。请参阅此处了解如何在将 csv 文件导入 MySQL 之前对其进行清理。关于唯一行，只要你的表模式是正确的，MySQL应该能够处理它。

编辑：

无论源是否为Excel，LOAD DATA LOCAL INFILE 似乎是正确的方法。

100 亿行，并且 LOAD DATA LOCAL 给出错误？您确定csv文件没有问题吗？

回复收藏 0 原文

蓬勃野心 2024-11-09 10:13:45

您必须将数据库截断为单独的小块。使用大转储。

http://www.ozerov.de/bigdump.php

回复收藏 0 原文

时常饿 2024-11-09 10:13:45

如果您确实有 100 亿行，那么您将很难处理这些数据。

您需要查看数据库分区（请参阅此处：关于 mysql 分区< /a>)

然而，即使有这么大的数量，您也需要一些强大的硬件来完成其中涉及的工作。

另外，如果发现一行不唯一，您会怎么做？您想继续导入数据吗？如果导入数据，您会导入相同的行还是将其标记为重复行？你会停止处理吗？

回复收藏 0 原文

唐婉 2024-11-09 10:13:45

Linux 就是为这种工作而生的。

首先，您必须将文件拆分为许多较小的文件：

split -l 100 filename

在此之后，您可以使用两个命令 sort / uniq 进行一些选择，并且在使用广告交换日志文件中的 100 万个 IP 地址的文件对 8 个不同的选项进行计时后，发现使用 LC_ALL=C 与不使用 LC_ALL=C 之间几乎有 20 倍的差异。例如：

LC_ALL=C sort IP_1m_rows.txt > temp_file
LC_ALL=C uniq temp_file > IP_unique_rows.txt

real    0m1.283s
user    0m1.121s
sys     0m0.088s

没有 LC=ALL_C 时情况相同：

sort IP_1m_rows.txt > temp_file
uniq temp_file > IP_unique_rows.txt

real    0m24.596s
user    0m24.065s
sys     0m0.201s

通过管道传输命令并使用 LC_ALL=C 比最快的速度慢 2 倍：

LC_ALL=C sort IP_1m_rows.txt | uniq > IP_unique_rows.txt

real    0m3.532s
user    0m3.677s
sys     0m0.106s

数据库对于像这样的一次性作业没有用，而平面文件即使更具挑战性，也会让您出人意料地走得更远/ 长期目标。

This is the kind of job linux is "made for".

First you have to split the file in to many smaller files:

split -l 100 filename

After this you have few options with the two commands sort / uniq, and after having timed 8 different options with a file of 1 million IP address from an ad exchange log-file, and found a almost 20x difference between using LC_ALL=C or not. For example:

LC_ALL=C sort IP_1m_rows.txt > temp_file
LC_ALL=C uniq temp_file > IP_unique_rows.txt

real    0m1.283s
user    0m1.121s
sys     0m0.088s

Where as the same without LC=ALL_C:

sort IP_1m_rows.txt > temp_file
uniq temp_file > IP_unique_rows.txt

real    0m24.596s
user    0m24.065s
sys     0m0.201s

Piping the command and using LC_ALL=C was 2x slower than the fastest:

LC_ALL=C sort IP_1m_rows.txt | uniq > IP_unique_rows.txt

real    0m3.532s
user    0m3.677s
sys     0m0.106s

Databases are not useful for one-off jobs like this, and flatfiles will get you surprisingly far even with more challenging / long-term objectives.

回复收藏 0 原文

~没有更多了~

关于作者

深爱成瘾

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

将 100 亿行导入 mysql

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

将 100 亿行导入 mysql

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

关于作者

相关话题

热门标签

推荐作者

烙印

singlesman

给自己一个微笑

独孤求败

晨钟暮鼓

我是自愿种绣球花的

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。