使用部分 RowKey 时是否会对 Azure 表存储的查询建立索引?
我从 MS PDC 演示中了解到,PartitionKey 用于跨多个服务器对表进行负载平衡,但似乎没有人就 PartitionKey 是否用作单个服务器内的索引提供任何建议。
同样,每个人都会告诉您指定 PartitionKey 和 RowKey 可以获得出色的性能,但似乎没有人告诉您 RowKey 是否用于提高 PartitionKey 内的性能。
以下是一些示例查询,可帮助我提出问题。假设整个表包含 100,000,000 行。
- PartionKey="123" 且 OtherField="def"
- PartitionKey="123" 且 RowKey >= "aaa" 且 RowKey < “aac”
这是我的问题:
- 如果每个分区中只有 10 行,查询 1 会很快吗?
- 如果每个分区中有 1,000,000 行,查询 2 会很快吗?
I understand from the MS PDC presentations that the PartitionKey is used to load balance the table across multiple servers, but nobody seems to give any advice on whether the PartitionKey is used as an index WITHIN a single server.
Likewise, everyone will tell you that specifying the PartitionKey AND the RowKey gets you great performance, but nobody seems to tell you if the RowKey is being used to improve performance WITHIN a PartitionKey.
Here are some sample queries to help me frame the questions. Assume the entire table contains 100,000,000 rows.
- PartionKey="123" and OtherField="def"
- PartitionKey="123" and RowKey >= "aaa" and RowKey < "aac"
Here are my questions:
- If I have only 10 rows in each Partition, would Query 1 be fast?
- If I have 1,000,000 rows in each Partition, would Query 2 be fast?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
两者应该都比较快。
查询 1 必须在单个分区内进行全面扫描(ATS 行话中的范围扫描),但这意味着迭代 10 个实体。
查询 2 也将导致范围扫描,但使用 RowKey 作为分区内的索引,因此它应该仍然很快。
您可以获得一篇非常详细的博客文章,其中包含每个查询的所有性能影响,以及如何定义最佳键:http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/ 06/how-to-get-most-out-of-windows-azure-tables.aspx
Both should be relatively fast.
Query 1 would have to do a full scan within a single partition (a Range scan in ATS lingo), but that would mean iterating through 10 entities.
Query 2 will also result in a range scan, but using the RowKey as an index within the partition, so it should still be fast.
You can get a VERY detailed blog post with all the performance implications of each of the queries, and how to define an optimum key: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx
除了泰勒的回答之外,类似的陈述也适用于范围查询,如所讨论的此处。
换句话说,Azure 表存储确实可以被认为只有一个索引,该索引由两部分组成,即分区键和范围键(按顺序)。
In addition to Taylor's answer, analogous statements also hold for range queries, as discussed here.
In other words, Azure Table Storage can indeed be thought of as just having one index consisting of two parts, the partition key and the range key, in that order.
我认为自从 WAS 论文< /a> 是这样写的,但如果你读过它,你可以得出一些结论。
例如,分区可以在节点/物理服务器之间移动。如果您有多个分区,其扩展性比单个分区更好。如果您有 1 个巨大分区,您将受到单个分区吞吐量的限制。
如果您需要分区键来处理超过提供的 2000 个请求/秒,您必须找到一种方法将分区键拆分为多个分区,否则,这并不重要。
哦,您只能在单个分区键内执行实体组事务,这可能会影响您的设计。
回顾一下:
这是你需要问自己的两个问题。
I think some things might have changed since the WAS paper was written but if you read that, you can draw some conclusions.
For example, a partition can be moved between nodes/physical servers. If you have many partitions that can scale better than a single partition. If you have 1 huge partition you will be limited by the throughput of a single partition.
If you need to partition key to handle more than the 2000 req/sec that is offered you have to figure out a way to split your partition key into multiple partitions, otherwise, it doesn't matter.
Oh, and you can only do entity groups transactions within a single partition key, that might impact your design.
So to recap:
Those are the two questions you need to ask yourself.
在 ATS 中,PartitionKey 用作分布查找,而不是索引。从使用ATS的层面来看,只需考虑PartitionKey和“服务器”/节点共享1:1的关系。 (在幕后这不是真的,但是优化驻留在同一物理/虚拟节点上的 PartitionKey 等概念是从 Azure 消费者必须处理的内容中抽象出几个级别的。这些细节纯粹是内部的对于整个 Azure 基础设施而言,对于 ATS 来说,最好假设这是一个最佳方案……又名“不用担心它”)
在 DBMS 与 ATS 的背景下,RowKey 是最接近“索引”的东西,因为它有助于跨相似节点查找数据。要直接回答您的问题之一,RowKey 是 PartitionKey 中的索引。
然而,稍微跳出框框,PartitionKey 可以让您获得更接近传统索引的性能收益,但这只是因为数据在 ATS 节点上分布的分布式性质。您应该首先优化 PartitionKey 的布局,然后优化 RowKey。 (也就是说,如果您只有一个可设置键的值,请将其设为 PartKey)
一般来说,查询将按照从最高效到最低效的顺序执行
1. PartitionKey=x 和 RowKey=y(以及 OtherProp = z)
因为查找到达正确的节点,然后到达分区
2 上的索引 prop。PartitionKey=x(且 OtherProp =z),
因为您到达正确的节点,然后到达 ATS equvi。全表扫描
3. OtherProp = z
因为你必须先进行分区扫描,然后再进行表扫描
有了这个,对于你的直接问题
我认为这无法回答。它是主观的(即“什么是快?”)。它总是比 Query2 慢,但是对于 10 行,“慢”可能是毫秒,即使
(类似主题)它也会比查询 1 更快。任何时候你可以执行 Query2,你应该
因此通过解释和您的问题,真正的答案取决于您的架构师如何使用 ATS。
根据您的数据集(当前和预期的增长),您需要确定一个适当的方案,以便您可以以最快的方式到达您的分区和行。了解查找是如何发生的,您可以做出逻辑决策,决定什么路径可以足够快地到达那里,更多的部分,更少的行 - 与更少的部分,更多的行等
In ATS, PartitionKey is used as a distribution lookup, not an index. From the level of working with ATS, just consider PartitionKey and "server"/node to share a 1:1 relationship. (Behind the scenes this isn't true, but concepts such as optimizing PartitionKeys that happen to reside on the same physical/virtual node are abstracted several levels from what a consumer of Azure has to deal with. Those details are purely internal to the overall Azure infrastructure and in the case of ATS, its best to assume that is an optimal as it can be ... aka "dont worry about it")
In the context of a DBMS vs ATS, RowKey is the closest thing to an "index" in that it assists in finding data across a similar node. To directly answer one of your question, RowKey is the index within the PartitionKey.
Stepping outside the box a bit, however, PartitionKey can give you perf gains closer to how you think of a traditional index, but only because of the distributed nature of how your data is spread across ATS nodes. You should optimize layout 1st to the PartitionKey, then to the RowKey. (aka, if you only have one keyable value, make it the PartKey)
In general rule, queries are going to perform in this order, from most efficient to least efficient
1. PartitionKey=x and RowKey=y (and OtherProp = z)
because the lookup gets to the right node and then to an indexed prop on the partition
2. PartitionKey=x (and OtherProp =z)
because you get to the proper node, but then to the ATS equvi. of a full table scan
3. OtherProp = z
because you have to a partition scan, then a table scan
With that, to your direct questions
I don't feel this can be answered. Its subjective (ie "what is fast?"). It will always be slower than Query2, but with 10 rows that "slowness" is likely milliseconds if even
(similar theme) It will be faster than Query 1. Anytime you can do Query2, you should
So with that explaination and your questions, the real answer comes down to how your architect your usage of ATS.
Based on your data set (both current and expected growth) you need to determine a proper scheme so that you can get to your Partition AND to your Row is the fastest way possible. Knowing how the lookup occurs, you can make logical decisions as to what path is going to get you there fast enough, more parts, less rows -vs- less parts, more rows, etc
对于#1,扫描十个实体的速度是很快的。
对于#2,这取决于该 RowKey 范围内有多少实体。 (指定分区键和行键的范围意味着我们将仅对该范围内的实体进行索引查询。)您没有说明有多少个,但如果作为示例,有十个,那么它的性能应该与#1 相同。
For #1, it's however fast scanning ten entities is.
For #2, it depends on how many entities there are in that RowKey range. (Specifying the partition key and a range for the row key means we'll do an indexed query over just the entities within that range.) You didn't say how many there are, but if, as an example, there are ten, then it should be the same performance as #1.
表通过 (PartitionKey, RowKey) 进行索引。保证从同一分区提供具有相同分区键的行。具有不同 PartitionKey 的行可能位于也可能不在同一分区。所以我不知道你怎么知道分区中只有 10 行。
如果只有 10 行 PartitionKey="123",那么第一个查询将“快”。
第二个查询将是“快速”。
Tables are indexed by (PartitionKey, RowKey). Rows with the same partition key are guaranteed to be served from the same partition. Rows with different PartitionKey may or may not be on the same partition. So I don't know how you would know that you have only 10 rows in a partition.
If you have only 10 rows with PartitionKey="123" then the first query will be "fast".
The second query will be "fast".