在超大 MYSQL 表中查找 varchar 列的不同值

发布于 2024-11-15 02:09:16 字数 359 浏览 3 评论 0原文

我想在一个超大的 MYSQL 表(10 亿行)中找到 varchar 列的不同值。
我有以下解决方案:

1. select distinct(col_name) from mytable; 
2. export this column to a text file incrementally(select col_name from mytable where myid>x and myid<x+n), then use linux sort 
sort myfile.txt | uniq -u

第一种方式的问题(即使列已索引)是查询可能会在很长一段时间内崩溃,然后您需要重新开始。
我倾向于第二种方式,还有其他更快的方式吗?

I want to find the distinct values of a varchar column in a super large MYSQL table(1 billion rows).
I have the follow solution:

1. select distinct(col_name) from mytable; 
2. export this column to a text file incrementally(select col_name from mytable where myid>x and myid<x+n), then use linux sort 
sort myfile.txt | uniq -u

The problem with the 1st way(even if the column is indexed) is that the query might crash during a long period of time and then you would need to start over.
I'm leaning toward the second way, is there any other faster way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

陌伤ぢ 2024-11-22 02:09:16
  1. ...
  2. ...
  3. SELECT col_name FROM mytable GROUP BY col_name;

即使它们返回相同的结果集,这两个查询实际上使用不同的执行计划,我注意到 GROUP BY 在某些情况下,比 MySQL 中的 DISTINCT 快一些。

我支持 spin_plate 关于索引的评论。如果您已经有了一个,那么获得结果的痛苦应该会少得多。您的索引的基数是多少?

  1. ...
  2. ...
  3. SELECT col_name FROM mytable GROUP BY col_name;

Even if they return the same result set, the two queries actually use different execution plans, and I noticed GROUP BY being somewhat faster than DISTINCT in MySQL, in some cases.

I support spinning_plate's comment regarding the index. If you already have one, it should be much less pain to get your result. What's the cardinality of your index?

魂牵梦绕锁你心扉 2024-11-22 02:09:16

可悲的是,我之前也曾在 MySQL 中诉诸过这种废话。如果您不能只提取索引,并且 GROUP BY 不能更快地工作(不知道为什么会这样,继续@Ben ..的帖子),您可以尝试对问题进行分段以进行批处理。

我仍然会使用 MySQL,它可能比您自己编写或在 UNIX 命令行上执行的任何操作都要快。像对待 DW 中的物化视图或聚合表一样对待它 - 一种简单的方法是创建一个批处理脚本,在小范围内执行 SELECT DISTINCTS 到仅包含不同值的第二个表中(通过 MERGE 或其他一些机制)。这更加可批处理,但是您会遇到相同的性能问题,只是分散在各个作业中。您必须尝试参数(批量大小)。如果您在生产环境中运行它,并且人们希望获得所有不同的值,就像直接查询数据库一样,那么最好有 3 个表,即原始表、当前批次的临时表,以及包含最新值和 date_modified 列的实时表。

Sadly, I've had to resort to this sort of nonsense before with MySQL as well. If you can't just pull the index, and the GROUP BY doesn't work faster (not sure why this would be, going of @Ben..'s post), you could try segmenting the problem to batch it.

I would still work in MySQL, it's likely going to be faster than anything you write yourself or do on the UNIX command line. Treat it like you would an materialized view or aggregation table in a DW - One simple way would be to create a batch script that did SELECT DISTINCTS over small ranges into a second table with only distinct values (via MERGE or some other mechanism). This is more batchable, but you run in to the same performance issues, just spread across jobs. You'll have to experiment with the parameters (batch size). If you're running this in a production environment and folks are expecting to get all distinct values as if they were querying direct to the DB, it would be best to have 3 tables, the original, a temporary one for the current batch, and a live table with the latest values and a date_modified column.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文