基于列顺序的查询速度
数据库中列类型的顺序对查询时间有影响吗?
例如,具有混合排序(INT、TEXT、VARCHAR、INT、TEXT)的表的查询速度是否会比具有连续类型(INT、INT、VARCHAR、TEXT、TEXT)的表慢?
Does the order of the column types in your database have any affect on the query time?
For example, would a table with mixed ordering (INT, TEXT, VARCHAR, INT, TEXT) be slower to query than a table with consecutive types (INT, INT, VARCHAR, TEXT, TEXT)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
答案是肯定的,它确实很重要,而且可能很重要,但通常并不重要。
所有 I/O 均在页面级别完成(通常为 2K 或 4K,具体取决于您的操作系统)。行的列数据彼此相邻存储,除非页面已满,在这种情况下,数据将写入另一页(通常是下一页)。
您选择的列之间(基于表定义)的列所需的磁盘数据空间越大,所选列的数据(有时)位于不同页面的可能性就越大。位于不同的页面上可能会导致额外的 I/O 操作(如果在其他页面上没有选择其他行)。在最坏的情况下,您选择的每一列可能位于不同的页面上。
这是一个例子:
比较:
从 bad_layout 中选择 num1、num2、num3;
从 better_layout 中选择 num1、num2、num3;
因为对于 bad_layout,每个 num 列基本上都位于不同的页面上,所以每行将需要 3 次 I/O 操作。相反,对于 better_layout 来说,num 列通常位于同一页面上。
bad_layout 查询的执行时间可能会延长大约 3 倍。
良好的表布局可以对查询性能产生很大的影响。您应该尝试使通常一起选择的列在表布局中尽可能彼此靠近。
The answer is yes, it does matter, and it can matter a great deal, but usually not much.
All I/O is done at a page level (typically 2K or 4K depending on your OS). Column data for rows are stored next to each other, except when the page becomes full, in which case the data is written on the another (usually the next) page.
The greater the on-disk data space required for columns between (based on the the table definition) the columns you select, the greater the chance that the data for the selected columns will (sometimes) be on different pages. Being on a different page may result in an extra I/O operation (if there are no other rows being selected on the other page). In the worst case, each column you select could be on a different page.
Here's an example:
Comparing:
select num1, num2, num3 from bad_layout;
select num1, num2, num3 from better_layout;
Because for bad_layout each num column is basically going to be on a different page, each row will require 3 i/O operations. Conversely, for better_layout num columns are usually going to be on the same page.
The bad_layout query is likely to take about 3 times longer to execute.
Good table layout can make a large difference to query performance. You should try to keep columns that are usually selected together as close as possible to each other in the table layout.
顺序不太重要。运行时间主要由磁盘访问时间等因素决定,并且磁盘访问的数量和顺序不太可能因对行内的数据重新排序而改变。
一个例外是,如果您的行中有一个非常大的项目(比磁盘块大得多,通常是 4K?)。如果表中有一个非常大的列,您可能希望将其作为最后一列,这样如果您不访问它,则可能不需要完全分页。但即便如此,您也必须非常努力地生成数据集和访问模式,其中差异会很明显。
The order is unlikely to matter much. The running time is dominated by things like disk access times, and the number and order of disk accesses is unlikely to change as a result of reordering the data within a row.
The one exception is if you have a very big item in your row (much bigger than a disk block, usually 4K?). If you have one very big column in a table, you might want to put it as the last column so that if you aren't accessing it, it might not need to be fully paged in. But even then, you'd have to work pretty hard to generate a data set and access pattern where the difference would be noticeable.
在 PostgreSQL 中,如果将固定宽度的列放在前面,您将获得优势,因为该访问路径经过了专门优化。因此 (INT, INT, VARCHAR, TEXT, TEXT) 将是最快的(VARCHAR 和 TEXT 的相对顺序无关紧要)。
此外,如果正确管理类型的对齐要求,您还可以节省空间,这可以转化为更高的吞吐量和性能。例如,(INT, BOOL, INT, BOOL) 将需要 13 个字节的空间,因为第三列必须在 4 字节边界对齐,因此第二列和第三列之间将浪费 3 个字节的空间。这里更好的是 (INT, INT, BOOL, BOOL)。 (无论该行之后的内容可能也需要至少 4 个字节的对齐,因此最后将浪费 2 个字节。)
In PostgreSQL, you will get an advantage if you put fixed-width columns first because that access path is specially optimized. So (INT, INT, VARCHAR, TEXT, TEXT) will be fastest (the relative order of VARCHAR and TEXT doesn't matter).
Additionally, you can save space, which can translate to more throughput and performance, if you manage the alignment requirements of the types correctly. For example, (INT, BOOL, INT, BOOL) will require 13 bytes of space because the third column has to be aligned at a 4-byte boundary, and so there will be 3 bytes of space wasted between the second and the third column. Better here would be (INT, INT, BOOL, BOOL). (Whatever comes after this row will probably also require alignment of at least 4 bytes, so you will waste 2 bytes at the end.)
我建议,无论您如何排列列,绝对没有[显着]差异。
PostgreSQL: http://social .msdn.microsoft.com/Forums/en-US/sqldatabaseengine/thread/a7ce8a90-22fc-456d-9f56-4956c42a78b0
SQL Server:
http://social.msdn.microsoft .com/Forums/en/sqldatabaseengine/thread/36713a82-315d-45ef-b74e-5f342e0f22fa
我怀疑MySQL也是如此。
所有数据均按页读取,因此,如果您的数据适合单个页面,则如何对列进行排序并不重要。如果磁盘块大小为2K、4K,则会占用多个块来满足“8K 页请求”。如果磁盘块大小为 64K(对于大型数据库系统),您可能已经在缓冲其他数据。
不仅如此,如果请求一条记录,它通常会检索该记录的所有页面,如果数据跨越多个页面,则包括溢出到第2页和第3页。然后根据检索到的数据计算出这些列。 SQL Server 对页内数据有限制,约为 8060 字节。任何较大的数据都存储在主数据页之外,类似于 PostgreSQL 的 TOAST,并且如果不使用该列则不会检索。列在顺序中的位置仍然并不重要。
例如,在 SQL Server 中,多个位字段存储在位模式掩码中 - 这与您是否将列彼此相邻放置无关。我怀疑 MySQL 和 PostgreSQL 也会做同样的事情来优化空间。
注意:[重要] - 进行此限定的唯一原因是,当从数据页中提取特定列时,将其放在开头会有所帮助,因为低级汇编调用不必深入查找内存块。
I would suggest that there is absolutely no [significant] difference no matter how you order the columns.
PostgreSQL: http://social.msdn.microsoft.com/Forums/en-US/sqldatabaseengine/thread/a7ce8a90-22fc-456d-9f56-4956c42a78b0
SQL Server:
http://social.msdn.microsoft.com/Forums/en/sqldatabaseengine/thread/36713a82-315d-45ef-b74e-5f342e0f22fa
I suspect the same for MySQL.
All data is read in pages, so if your data fits into a single page it does not matter how you order the columns. If a disk block size is 2K, 4K, it will take in multiple to satisfy the "8K page request". If the disk block size is 64K (for large DB systems), you would already be buffering other data.
Not only that, if a record is requested, it will normally retrieve all pages for the record, including the overflow to pages 2 and 3 if the data spans multiple pages. The columns are then worked out from the data retrieved. SQL Server has a limit on in-page data, which is about 8060 bytes. Anything larger is stored off the main data page, similar to TOAST for PostgreSQL and is not retrieved if the column is not used. It still does not matter where the column is in the order.
In SQL Server for example, multiple bit fields are stored together in a bit patterned mask - this is irrespective of whether you put the columns next to each other. I would suspect MySQL and PostgreSQL to do much the same to optimize space.
Note: [significant] - the only reason for this qualification is that, possibly, when extracting a particular column from a data page, having it in the beginning helps because the low-level assembly calls do not have to seek far in the memory block.