每个人都警告不要查询 Azure 表存储 (ATS) 中除 RowKey 或 PartitionKey 之外的任何内容,以免被迫进行表扫描。有一段时间,这让我陷入瘫痪,当我需要查询其他内容时,我不得不试图想出完全正确的 PK 和 RK,并在其他表中创建伪二级索引。
然而,我突然想到,当我认为合适的时候,我通常会在 SQL Server 中进行表扫描。
所以问题就变成了,我可以多快地对 Azure 表进行表扫描。这是一个以实体/秒为单位的常量,还是取决于记录大小等。如果您想要一个响应式应用程序,是否有一些经验法则可以确定多少记录对于表扫描来说太多?
Everyone warns not to query against anything other than RowKey or PartitionKey in Azure Table Storage (ATS), lest you be forced to table scan. For a while, this has paralyzed me into trying to come up with exactly the right PK and RK and creating pseudo-secondary indexes in other tables when I needed to query something else.
However, it occurs to me that I would commonly table scan in SQL Server when I thought appropriate.
So the question becomes, how fast can I table scan an Azure Table. Is this a constant in entities/second or does it depend on record size, etc. Are there some rules of thumb as to how many records is too many to table scan if you want a responsive application?
发布评论
评论(4)
表扫描的问题与跨越分区边界有关。您所保证的性能级别是在分区级别明确设置的。因此,当你运行全表扫描时,它a)效率不高,b)没有任何性能保证。这是因为分区本身设置在单独的存储节点上,当您运行跨分区扫描时,您可能会消耗大量资源(同时占用多个节点)。
我相信,跨越这些边界的效果也会导致连续令牌,这需要额外的往返存储才能检索结果。这会导致性能下降以及事务计数(以及随后的成本)增加。
如果您要跨越的分区/节点数量相当小,您可能不会注意到任何问题。
但请不要引用我的话。我不是 Azure 存储方面的专家。这实际上是我对 Azure 最不了解的领域。 :P
The issue of a table scan has to do with crossing the partition boundaries. The level of performance you are guaranteed is explicity set at the partition level. therefore, when you run a full table scan, its a) not very efficient, b) doesn't have any guarantee of performance. This is because the partitions themselves are set on seperate storage nodes, and when you run a cross partition scan, you're consuming potentially massive amounts of resources (tieing up multiple nodes simultaneously).
I believe, that the effect of crossing these boundaries also results in continuation tokens, which require additional round-trips to storage to retrieve the results. This results then in reducing performance, as well as an increase in transaction counts (and subsequently cost).
If the number of partitions/nodes you're crossing is fairly small, you likely won't notice any issues.
But please don't quote me on this. I'm not an expert on Azure Storage. Its actually the area of Azure I'm the least knowledgeable about. :P
我认为 Brent 是 100% 的,但如果你仍然想尝试一下,我只能建议你自己进行一些测试来找出答案。尝试在查询中包含分区键以防止交叉分区,因为最终这会成为性能杀手。
I think Brent is 100% on the money, but if you still feel you want to try it, I can only suggest to run some tests to find out yourself. Try include the partitionKey in your queries to prevent crossing partitions because at the end of the day that's the performance killer.
Azure 表未针对表扫描进行优化。对于长时间运行的后台作业来说,扫描表可能是可以接受的,但当需要快速响应时,我不会这样做。对于任何合理大小的表,当查询到达分区边界或获得 1k 个结果时,您都必须处理延续标记。
Azure 存储团队有一个 这篇很棒的文章解释了可扩展性目标。单个表分区的吞吐量目标是 500 个实体/秒。存储帐户的总体目标是每秒 5,000 个事务。
Azure tables are not optimized for table scans. Scanning the table might be acceptable for a long-running background job, but I wouldn't do it when a quick response is needed. With a table of any reasonable size you will have to handle continuation tokens as the query reaches a partition boundary or obtains 1k results.
The Azure storage team has a great post which explains the scalability targets. The throughput target for a single table partition is 500 entities/sec. The overall target for a storage account is 5,000 transactions/sec.
答案是分页。将
top_size
(结果或结果中的记录的最大数量)与next_partition_key
和next_row_key
延续标记结合使用。这使得性能出现显着的甚至阶乘差异。其一,从统计角度来看,您的结果更有可能来自单个分区。简单的结果表明,集合是按分区连续键而不是行连续键分组的。换句话说,您还需要考虑您的 UI 或系统输出。不要费心返回超过 10 到 20 个结果,最多 50 个。用户可能不会再使用或检查。
听起来很愚蠢。在 Google 上搜索“dog”,您会发现搜索仅返回 10 个项目。不再。如果您费心点击“继续”,则可以使用接下来的记录。研究证明,几乎没有用户会冒险超越第一页。
select
(返回键值的子集)可能会有所不同;例如,使用select
="PartitionKey,RowKey"
或'Name'
任意您需要的最小值。...有点不正确。使用延续令牌不是因为跨越边界,而是因为 azure 表允许不超过 1000 个结果;因此两个延续令牌用于下一组。默认 top_size 本质上是 1000 为了您
的观看乐趣,这里是来自 azure python api 的查询实体的描述,其他内容大致相同。
The answer is Pagination. Use the
top_size
-- max number of results or records in result -- in conjunction withnext_partition_key
andnext_row_key
the continuation tokens. That makes a significant even factorial difference in performance. For one, your results are statistically more likely to come from a single partition. Plain results show that sets are grouped by the partition continuation key and not the row continue key.In other words, you also need to think about your UI or system output. Don't bother returning more than 10 to 20 results max 50. The user probably wont utilize or examine any more.
Sounds foolish. Do a Google search for "dog" and notice that the search returns only 10 items. No more. The next records are avail for you if you bother to hit 'continue'. Research has proven that almost no user ventures beyond that first page.
the
select
(returning a subset of the key-values) may make a difference; for example, useselect
="PartitionKey,RowKey"
or'Name'
whatever minimum you need....is slightly incorrect. the continuation token is used not because of crossing boundaries but because azure tables permit no more than 1000 results; therefore the two continuation tokens are used for the next set. default top_size is essentially 1000.
For your viewing pleasure, here's the description for queries entities from the azure python api. others are much the same.