有效的索引可以在多大程度上克服非常大的表的性能问题?
因此,在我看来,对具有 10k 条记录的表的查询和对具有 1000 万条记录的表的查询几乎同样快,如果它们都获取大致相同数量的记录并充分利用简单索引(自动增量、记录 id 类型索引字段)。
我的问题是,如果索引正确并且数据库的设置方式使得查询始终有效地使用这些索引,这是否会扩展到包含近 40 亿条记录的表?
另外,我知道将新记录插入到一个非常大的索引表中可能会非常慢,因为所有索引都必须重新计算,如果我仅将新记录添加到表的末尾,我可以避免这种减慢吗?不起作用,因为索引是二叉树,并且树的很大一部分仍然需要重新计算?
最后,我四处寻找有关使用非常大的表的常见问题解答/警告,但找不到,所以如果有人知道类似的事情,该链接将不胜感激。
So, it seems to me like a query on a table with 10k records and a query on a table with 10mil records are almost equally fast if they are both fetching roughly the same number of records and making good use of simple indexes(auto increment, record id type indexed field).
My question is, will this extend to a table with close to 4 billion records if it is indexed properly and the database is set up in such a way that queries always use those indexes effectively?
Also, I know that inserting new records in to a very large indexed table can be very slow because all the indexes have to be recalculated, if I add new records only to the end of the table can I avoid that slow down, or will that not work because the index is a binary tree and a large chunk of the tree will still have to be recalculated?
Finally, I looked around a bit for a FAQs/caveats about working with very large tables, but couldn't really find one, so if anyone knows of something like that, that link would be appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
根据您的要求,这里有一些关于大型表以及索引对它们的影响的好读物,包括成本/效益:
http://www.dba-oracle.com/t_indexing_power.htm
Here is some good reading about large tables and the effects of indexing on them, including cost/benefit, as you requested:
http://www.dba-oracle.com/t_indexing_power.htm
为非常大的表(与任何与数据库相关的事物一样)建立索引取决于许多因素,包括您的访问模式、读取与写入的比率以及可用 RAM 的大小。
如果您可以将“热”(即频繁访问的索引页)放入内存中,那么访问通常会很快。
用于索引非常大的表的策略是使用分区表和分区索引。 但是如果您的查询不加入或过滤分区键,那么与未分区表相比,性能不会有任何改进,即不会消除分区。
SQL Server 数据库分区神话与真相
Oracle 分区表和索引
使索引尽可能缩小非常重要。
Kimberly Tripp 的聚集索引争论仍在继续...( SQL Server)
Indexing very large tables (as with anything database related) depends on many factors, incuding your access patterns, ratio of Reads to Writes and size of available RAM.
If you can fit your 'hot' (i.e. frequently accessed index pages) into memory then accesses will generally be fast.
The strategy used to index very large tables, is using partitioned tables and partitioned indexes. BUT if your query does not join or filter on the partition key then there will no improvement in performance over an unpartitioned table i.e. no partition elimination.
SQL Server Database Partitioning Myths and Truths
Oracle Partitioned Tables and Indexes
It's very important to keep your indexes as narrow as possible.
Kimberly Tripp's The Clustered Index Debate Continues...(SQL Server)
当表变得非常大时,通过唯一索引查找访问数据的速度会减慢,但不会减慢很多。索引在 Postgres 中存储为 B 树结构(不是每个节点只有两个子节点的二叉树),因此 10k 行表可能有 2 个级别,而 10B 行表可能有 4 个级别(取决于表的宽度)行)。因此,当表格变得大得离谱时,它可能会达到 5 级或更高,但这仅意味着额外读取一页,因此可能不会被注意到。
当您插入新行时,您无法控制它们在表的物理布局中插入的位置,因此我假设您的意思是使用索引的最大值来表示“表的末尾”。我知道 Oracle 在这种情况下对叶块分割有一些优化,但我不知道 Postgres。
Accessing the data via a unique index lookup will slow down as the table gets very large, but not by much. The index is stored as a B-tree structure in Postgres (not binary tree which only has two children per node), so a 10k row table might have 2 levels whereas a 10B row table might have 4 levels (depending on the width of the rows). So as the table gets ridiculously large it might go to 5 levels or higher, but this only means one extra page read so is probably not noticeable.
When you insert new rows, you cant control where they are inserted in the physical layout of the table so I assume you mean "end of the table" in terms of using the maximum value being indexed. I know Oracle has some optimisations around leaf block splitting in this case, but I dont know about Postgres.
如果索引正确,插入性能受到的影响可能比选择性能受到的影响更大。 PostgreSQL 中的索引有大量选项,允许您对表的一部分或表中元组的不可变函数的输出进行索引。此外,假设索引可用,索引的大小对速度的影响将比表的实际扫描慢得多。最大的区别在于搜索树和扫描列表之间。当然,索引使用中仍然存在磁盘 I/O 和内存开销,因此大型索引的性能不如理论上的好。
If it is indexed properly, insert performance may be impacted more than select performance. Indexes in PostgreSQL have vast numbers of options which can allow you to index part of a table or the output of an immutable function on tuples in the table. Also size of the index, assuming it is usable, will affect speed much more slowly than will the actual scan of the table. The biggest difference is between searching a tree and scanning a list. Of course you still have disk I/O and memory overhead that goes into index usage, and so large indexes don't perform as well as they theoretically could.