在MYSQL中通过一个varchar列快速排序10亿行
我在 MYSQL 中存储了 10 亿行,我需要通过 varchar 列按字母顺序输出它们,最有效的方法是什么。允许使用其他 Linux 实用程序,例如 sort awk。
I have 1 billion rows stored in MYSQL, I need to output them alphabetically by the a varchar column, what's the most efficient way of go about it. using other linux utilites like sort awk are allowed.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
MySQL 可以处理十亿行。效率取决于 3 个主要因素:缓冲区、索引和连接。
一些建议:
尝试将您正在使用的数据集放入内存中
内存中的处理速度要快得多,这样做就解决了一大堆问题。使用多个服务器来托管部分数据集。将您要使用的部分数据存储在临时表等中。
优先选择全表扫描而不是索引访问
对于大型数据集,全表扫描通常比范围扫描和其他类型的索引查找更快。即使您查看 1% 或更少的行,全表扫描可能会更快。
避免连接到大型表
使用嵌套循环连接大型数据集的成本非常高。尽量避免它。连接到较小的表是可以的,但您可能希望在连接之前将它们预加载到内存中,这样就不需要随机 IO 来填充缓存。
请注意 MySQL 的限制,这要求您在处理大型数据集时要格外小心。在 MySQL 中,查询作为单个线程运行(MySQL Cluster 除外),MySQL 会一一发出 IO 请求来执行查询,这意味着如果您关心单个查询执行时间,那么许多硬盘和大量 CPU 将无济于事。
有时,手动将查询拆分为多个、并行运行并聚合结果集是个好主意。
您没有提供有关您的设置或数据集的太多信息,但这应该为您提供一些需要注意的线索。在我看来,让(适当调整的)数据库为您进行排序会比以编程方式进行排序更快,除非您有帖子中未提及的非常具体的需求。
MySQL can deal with a billion rows. Efficiency depends on 3 main factors: Buffers, Indexes and Joins.
Some suggestions:
Try to fit data set you’re working with in memory
Processing in memory is so much faster and you have whole bunch of problems solved just doing so. Use multiple servers to host portions of data set. Store portion of data you’re going to work with in temporary table etc.
Prefer full table scans to index accesses
For large data sets full table scans are often faster than range scans and other types of index lookups. Even if you look at 1% or rows or less full table scan may be faster.
Avoid joins to large tables
Joining of large data sets using nested loops is very expensive. Try to avoid it. Joins to smaller tables is OK but you might want to preload them to memory before join so there is no random IO needed to populate the caches.
Be aware of MySQL limitations which requires you to be extra careful working with large data sets. In MySQL, a query runs as a single thread (with exeption of MySQL Cluster) and MySQL issues IO requests one by one for query execution, which means if single query execution time is your concern many hard drives and large number of CPUs will not help.
Sometimes it is good idea to manually split query into several, run in parallel and aggregate result sets.
You did not give much info on your setup or your dataset, but this should give you a couple of clues on what to watch out for. In my opinion having the (properly tuned) database sort this for you would be faster than doing it programmatically unless you have very specific needs not mentioned in your post.
您是否刚刚尝试过对列建立索引并将其转储出来?在进入异国情调之前,我会先尝试看看性能是否不足。
Have you just tried indexing the column and dumping them out? I'd try that first to see if the performance was inadequate before going exotic.
这取决于你如何定义高效。 CPU/内存/IO/时间/编码工作量。在这种情况下什么是重要的?
“select * from big_table order by the_varchar_column” 这可能是对开发人员资源最有效的利用。添加索引可能会使其运行速度更快。
It depends on how you define efficient. CPU/Memory/IO/Time/Coding Effort. What is important in this case?
"select * from big_table order by the_varchar_column" That is probably the most efficient use of developer resources. Adding an index might make it run a lot faster.