查询 Azure 表中最新项目的最快方法?
我有一个 Azure 表,客户在其中发布消息,单个表中可能有数百万条消息。我想找到获取最近 10 分钟内发布的消息的最快方法(这是我刷新网页的频率)。由于只有分区键被索引,所以我考虑了使用日期和日期的想法。消息作为分区键发布的时间,例如 ISO8601 日期格式的字符串,如“2009-06-15T13:45:30.0900000”
示例伪代码:
var message = "Hello word!";
var messagePartitionKey = DateTime.Now.ToString("o");
var messageEntity = new MessageEntity(messagePartitionKey, message);
dataSource.Insert(messageEntity);
,然后像这样查询过去 10 分钟内发布的消息(再次未经测试的伪代码):
// Get the date and time 10 minutes ago
var tenMinutesAgo = DateTime.Now.Subtract(new TimeSpan(0, 10, 0)).ToString("o");
// Query for the latest messages
var latestMessages = (from t in
context.Messages
where t.PartitionKey.CompareTo(tenMinutesAgo) <= 0
select t
)
但这会被索引很好地接受吗?还是会导致全表扫描?有人有更好的主意吗?我知道每个表项上都有一个时间戳,但它没有索引,所以对于我的目的来说它太慢了。
I have a Azure table where customers post messages, there may be millions of messages in a single table. I want to find the fastest way of getting the messages posted within the last 10 minutes (which is how often I refresh the web page). Since only the partition key is indexed I have played with the idea of using the date & time the message was posted as a partition key, for example a string as a ISO8601 date format like "2009-06-15T13:45:30.0900000"
Example pseudo code:
var message = "Hello word!";
var messagePartitionKey = DateTime.Now.ToString("o");
var messageEntity = new MessageEntity(messagePartitionKey, message);
dataSource.Insert(messageEntity);
, and then query for the messages posted within the last 10 minutes like this (untested pseudo code again):
// Get the date and time 10 minutes ago
var tenMinutesAgo = DateTime.Now.Subtract(new TimeSpan(0, 10, 0)).ToString("o");
// Query for the latest messages
var latestMessages = (from t in
context.Messages
where t.PartitionKey.CompareTo(tenMinutesAgo) <= 0
select t
)
But will this be taken well by the index? Or will it cause a full table scan? Anyone have a better idea of doing this? I know there is a timestamp on each table item, but it is not indexed so it will be too slow for my purpose.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为你的基本想法是正确的。您设计的查询应该与您希望的一样高效。但我可以提供一些改进。
不要使用
DateTime.Now
,而是使用Date.UtcNow
。据我了解,实例无论如何都设置为使用 Utc 时间作为基础,但这只是确保您将苹果与苹果进行比较,并且您可以在显示它们时可靠地将时间转换回您想要的任何时区。与其将时间存储为
.ToString("o")
将时间转换为刻度并存储,您最终会遇到更少的格式问题(有时您会在最后得到时区规范) ,有时不是)。此外,如果您始终希望查看从最新到最旧排序的这些消息,您可以从最大刻度数中减去刻度数,例如指定行键也是一个好主意。虽然两条消息在完全相同的时间发布的可能性极小,但这并非不可能。如果你没有明显的行键,那么只需将其设置为 Guid 即可。
I think you've got the right basic idea. The query you've designed should be about as efficient as you could hope for. But there are some improvements I could offer.
Rather than using
DateTime.Now
, useDate.UtcNow
. From what I understand instances are set to use Utc time as their base anyway, but this just makes sure you're comparing apples with apples and you can reliable convert the time back into whatever timezone you want when displaying them.Rather than storing the time as
.ToString("o")
turn the time into ticks and store that, you'll end up with less formatting problems (sometimes you'll get the timezone specification at the end, sometimes not). Also if you always want to see these messages sorted from most recent to oldest you can subtract the number of ticks from the max number of ticks e.g.It would also be a good idea to specify a row key. While it is highly unlikely that two messages will be posted with exactly the same time, it's not impossible. If you don't have an obvious row key, then just set it to be a Guid.
Table 的主键是 PartitionKey 和 RowKey 的组合(形成聚集索引)。
在您的情况下,只需使用 RowKey 而不是 ParitionKey(为此提供一个常量值)。
您还可以遵循诊断方法,例如每十分钟创建一个新的分区键。但这种方法主要是针对归档/清除等需求,
The Primary key for Table is the combination of PartitionKey and RowKey(which forms a clustered index).
In your case, just go for RowKey instead of ParitionKey(provide a constant value for this).
You can also follow the Diagnostics approach, like for every ten minutes create a new Partition Key. But this approach is mainly for requirements like Archieving/Purging etc.,
我建议执行类似于 Diagnostics API 使用 WADPerformanceCountersTable 执行的操作。 PartitionKey 将多个时间戳分组到单个项目中。即:它将所有时间戳四舍五入为最近的几分钟(例如,最近的 5 分钟)。这样,您就没有有限数量的分区键,但仍然可以对它们进行范围查询。
因此,例如,您可以有一个 PartitionKey 映射到每个时间戳,四舍五入为 00:00、00:05、00:10、00:15 等,然后转换为 Ticks
I would suggest doing something similar to what Diagnostics API is doing with WADPerformanceCountersTable. There PartitionKey groups a number of timestamps into a single item. Ie: it rounds all timestamps into nearest few minutes (say, nearest 5 minutes). This way you do not have a limited amount of partition keys and yet are still able to do ranged queries on them.
So, for example, you can have a PartitionKey that maps to each timestamp that is rounded into 00:00, 00:05, 00:10, 00:15, etc.. and then converted to Ticks