如何从 Azure 表存储快速下载 1 亿行

发布于 2024-09-09 07:15:42 字数 873 浏览 7 评论 0原文

我的任务是从 Azure 表存储下载大约 1 亿行数据。这里重要的是速度。

我们使用的过程是从 Azure 表存储下载 10,000 行。将它们处理到 Sql Server 的本地实例中。处理行时,它一次会从 Azure 表中删除 100 行。该进程采用线程化方式,有 8 个线程一次下载 10,000 行。

唯一的问题是根据我们的计算。下载和处理我们存储的约 1 亿行数据大约需要 40 天。有谁知道更快的方法来完成这项任务?

附带问题:在下载过程中,Azure 将发回没有任何数据的 xml。它不会发回错误。但它传达了这样的信息:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="azure-url/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
  <title type="text">CommandLogTable</title>
  <id>azure-url/CommandLogTable</id>
  <updated>2010-07-12T19:50:55Z</updated>
  <link rel="self" title="CommandLogTable" href="CommandLogTable" />
</feed>
0

还有其他人遇到这个问题并有解决办法吗?

I have been tasked with downloading around 100 million rows of data from Azure Table Storage. The important thing here being speed.

The process we are using is downloading 10,000 rows from Azure Table storage. Process them into a local instance of Sql Server. While processing the rows it deletes 100 rows at a time from the Azure table. This process is threaded to have 8 threads downloading 10,000 rows at a time.

The only problem with this is that according to our calculations. It will take around 40 days to download and process the around 100 million rows we have stored. Does anyone know a faster way to accomplish this task?

A side question: During the download process Azure will send back xml that just does not have any data. It doesn't send back an error. But it sends this:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="azure-url/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
  <title type="text">CommandLogTable</title>
  <id>azure-url/CommandLogTable</id>
  <updated>2010-07-12T19:50:55Z</updated>
  <link rel="self" title="CommandLogTable" href="CommandLogTable" />
</feed>
0

Does anyone else have this problem and have a fix for it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

滥情稳全场 2024-09-16 07:15:42

除了禁用 Nagling提高 Azure 表存储的性能。实际上,提高 ADO.NET 反序列化的速度为 Sqwarea(使用 Lokad.Cloud 框架)。

但是,表存储可能不是海量存储场景(超过数百万条记录)的最佳解决方案。 延迟是这里的致命因素。为了解决这个问题,我已经成功地使用基于文件的数据库存储,其中更改在本地完成(没有任何 CLAP 的网络延迟),并通过上传回文件来提交到 BLOB(并发性和横向扩展在这里由 < a href="https://web.archive.org/web/20221127075355/https://code.google.com/archive/p/lokad-cqrs" rel="nofollow noreferrer">Lokad.CQRS适用于 Windows Azure 的应用程序引擎)。

一次向 SQLite 数据库插入 1000 万条记录(在事务内,每条记录由 2 个字段索引,并通过 ProtoBuf 序列化任意无模式数据)总共只平均花费 200 秒。上传/下载结果文件 - 平均大约 15 秒。按索引随机读取 - 瞬时(前提是文件缓存在本地存储中并且 ETag 匹配)。

In addition to the suggestions of Disabling Nagling, there is an extremely nice post on improving performance of Azure Table Storage. Actually improving the speed of ADO.NET Deserialization provided 10x speed-up for Sqwarea (massive online multiplayer game built with Lokad.Cloud framework).

However, table storage might not be the best solution for huge storage scenarios (more than millions of records). Latency is the killing factor here. To work around that, I've been successfully using file-based database storages, where changes are done locally (without any network latency of CLAP) and are committed to BLOB by uploading the file back (concurrency and scaling out was enforced here by Lokad.CQRS App Engine for Windows Azure).

Inserting 10 millions of records to SQLite database at once (within transaction, where each record was indexed by 2 fields and had arbitrary schema-less data serialized via ProtoBuf) took only 200 seconds in total on the average. Uploading/downloading resulting file - roughly 15 seconds on the average. Random reads by index - instantaneous (provided the file is cached in the local storage and ETag is matching).

ぶ宁プ宁ぶ 2024-09-16 07:15:42

至于你的附带问题,我希望你会得到一个“继续令牌”。如果您使用的是 .NET 存储客户端库,请尝试将 .AsTableServiceQuery() 添加到您的查询中。

至于您的主要问题,将查询展开是您能做的最好的事情。听起来您正在从本地计算机(而不是在 Windows Azure 中)访问存储。如果是这样,我想您可以通过向 Windows Azure 部署一个小型服务来加快速度,该服务从表存储中获取数据(速度更快,因为数据中心内的带宽更高,延迟更低),然后压缩数据结果并将其发送回您的本地计算机。发送回的 XML Windows Azure 表会产生大量开销,因此将其剥离并捆绑行可能会节省大量传输时间。

As to your side question, I expect you're getting a "continuation token." If you're using the .NET storage client library, try adding .AsTableServiceQuery() to your query.

As to your main question, fanning out the query is the best thing that you can do. It sounds like you're accessing storage from a local machine (not in Windows Azure). If so, I would imagine you can speed things up quite a bit by deploying a small service to Windows Azure which fetches the data from table storage (much faster, since there's higher bandwidth and lower latency within the data center), and then compresses the results and sends them back down to your local machine. There's a lot of overhead to the XML Windows Azure tables send back, so stripping that out and bundling up rows would probably save a lot of transfer time.

帅哥哥的热头脑 2024-09-16 07:15:42

获取数据的最快方法(Amazon 支持,但 Azure 尚未支持)是向他们发送 USB 磁盘(甚至是 U 盘),让他们将数据放入磁盘中,然后将其寄回给您。

另一种选择是使用 AppFabric 服务总线在创建数据时将数据传输到另一个系统,而不是等待立即下载所有数据。

The fastest way to get your data, supported by Amazon but not yet Azure, is to ship them a USB disk (even a USB stick), have them put the data in the disk and ship it back to you.

Another option is to use AppFabric Service Bus to get the data out to another system when it is created, instead of waiting to download it all at once.

虚拟世界 2024-09-16 07:15:42

除了有关带宽限制的建议之外,您还可能很容易遇到存储帐户限制,因为每个表分区限制为每秒大约 500 个事务。

此外:部署了一种优化(Nagle 算法),实际上可以减慢小读取(例如 1K 数据读取)的速度。这是关于禁用 Nagling 的博客文章,这可能会大大加快您的读取速度,特别是如果您直接在Azure 服务没有互联网延迟。

Aside from suggestions about bandwidth limits, you could easily be running into storage account limits, as each table partition is limited to roughly 500 transactions per second.

Further: there's an optimization deployed (Nagle's algorithm) that could actually slow things down for small reads (such as your 1K data reads). Here's a blog post about disabling Nagling, which could potentially speed up your reads considerably, especially if you're running directly in an Azure service without Internet latency in the way.

瞎闹 2024-09-16 07:15:42

最有可能的是,您的限制因素是网络带宽,而不是处理能力。如果是这种情况,您唯一真正的希望就是扩展:更多的机器运行更多的线程来下载数据。

顺便说一句,Azure 是否公开了一些“导出”机制来消除手动下载所有行的需要?

Most likely, your limiting factor is network bandwidth, not processing. If that's the case, your only real hope is to expand out: more machines running more threads to download data.

BTW, doesn't Azure expose some "export" mechanism that will remove the need to download all of the rows manually?

过度放纵 2024-09-16 07:15:42

这里的一个重要因素是数据如何跨分区分布。跨越分区边界的查询将在每个边界处返回,需要重新提交 - 即使相关分区有 0 行。如果数据是 1 个分区 = 1 行,那么速度会很慢,但是您可以将线程数增加到 8 以上。如果数据位于 n 个分区 = m 行中,那么下面的想法应该会加快您的速度。

假设您有多个分区,并且每个分区都有一定数量的行,最快的方法是启动尽可能多的线程(如果您使用 .Net PLINQ 或 Parallel.ForEach(partition) 或 QueueWorkItem())并让线程扫描其分区中的所有行、处理、发布到 SQL,以及返回前删除。

考虑到所涉及的延迟(10 毫秒)和多次往返,即使有 8 个线程,您也可能没有您想象的那么忙。此外,您没有提及您正在使用哪个虚拟机,但您可能想要分析不同的大小。

或者,另一种方法是利用队列和一些“n”个工作人员。对于每个分区(或一组分区),将一条消息放入队列中。让工作人员从队列中拉出(多线程)并查询/处理/发布/重复。您可以根据需要启动任意数量的工作人员,并将其分布在数据中心的更多区域(即更高的吞吐量等)。

The big factor here is how the data is spread across partitions. A query that spans partition boundaries will return at each boundary requiring a resubmit - even if the partition in question has 0 rows. IF the data is 1 Partition = 1 Row, then it will be slow, but you could increase the thread count well above 8. If the data is in n partitions = m rows, then the ideas below should speed you up.

Assuming that you have multiple partitions and each with some number of rows, the fastest way to go will be to spin up as many threads as possible (if you are using .Net the PLINQ or Parallel.ForEach(partition) or QueueWorkItem()) and have a thread scan its partition for all rows, process, post to SQL, & delete before returning.

Given the latencies involved (10s of ms) and the multiple round trips, even w/8 threads you are probably not as busy as you might think. Also, you don't mention which VM you are using but you may want to profile different sizes.

Alternatively, another way to do this would be to leverage a queue and some 'n' workers. For each partition (or set of partitions) put a message in the queue. Have the workers pull from the queue (multi-threaded) and query/process/post/repeat. You could spin up as many workers as needed and be spread across more of the data center (i.e. more throughput, etc.).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文