如何从 Azure 表存储快速下载 1 亿行

发布于 2024-09-09 07:15:42 字数 873 浏览 12 评论 0原文

我的任务是从 Azure 表存储下载大约 1 亿行数据。这里重要的是速度。

我们使用的过程是从 Azure 表存储下载 10,000 行。将它们处理到 Sql Server 的本地实例中。处理行时，它一次会从 Azure 表中删除 100 行。该进程采用线程化方式，有 8 个线程一次下载 10,000 行。

唯一的问题是根据我们的计算。下载和处理我们存储的约 1 亿行数据大约需要 40 天。有谁知道更快的方法来完成这项任务？

附带问题：在下载过程中，Azure 将发回没有任何数据的 xml。它不会发回错误。但它传达了这样的信息：

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="azure-url/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
  <title type="text">CommandLogTable</title>
  <id>azure-url/CommandLogTable</id>
  <updated>2010-07-12T19:50:55Z</updated>
  <link rel="self" title="CommandLogTable" href="CommandLogTable" />
</feed>
0

还有其他人遇到这个问题并有解决办法吗？

原文

I have been tasked with downloading around 100 million rows of data from Azure Table Storage. The important thing here being speed.

The process we are using is downloading 10,000 rows from Azure Table storage. Process them into a local instance of Sql Server. While processing the rows it deletes 100 rows at a time from the Azure table. This process is threaded to have 8 threads downloading 10,000 rows at a time.

The only problem with this is that according to our calculations. It will take around 40 days to download and process the around 100 million rows we have stored. Does anyone know a faster way to accomplish this task?

A side question: During the download process Azure will send back xml that just does not have any data. It doesn't send back an error. But it sends this:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<feed xml:base="azure-url/" xmlns:d="http://schemas.microsoft.com/ado/2007/08/dataservices" xmlns:m="http://schemas.microsoft.com/ado/2007/08/dataservices/metadata" xmlns="http://www.w3.org/2005/Atom">
  <title type="text">CommandLogTable</title>
  <id>azure-url/CommandLogTable</id>
  <updated>2010-07-12T19:50:55Z</updated>
  <link rel="self" title="CommandLogTable" href="CommandLogTable" />
</feed>
0

Does anyone else have this problem and have a fix for it?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

滥情稳全场 2024-09-16 07:15:42

除了禁用 Nagling，提高 Azure 表存储的性能。实际上，提高 ADO.NET 反序列化的速度为 Sqwarea（使用 Lokad.Cloud 框架）。

但是，表存储可能不是海量存储场景（超过数百万条记录）的最佳解决方案。 延迟是这里的致命因素。为了解决这个问题，我已经成功地使用基于文件的数据库存储，其中更改在本地完成（没有任何 CLAP 的网络延迟），并通过上传回文件来提交到 BLOB（并发性和横向扩展在这里由 < a href="https://web.archive.org/web/20221127075355/https://code.google.com/archive/p/lokad-cqrs" rel="nofollow noreferrer">Lokad.CQRS适用于 Windows Azure 的应用程序引擎）。

一次向 SQLite 数据库插入 1000 万条记录（在事务内，每条记录由 2 个字段索引，并通过 ProtoBuf 序列化任意无模式数据）总共只平均花费 200 秒。上传/下载结果文件 - 平均大约 15 秒。按索引随机读取 - 瞬时（前提是文件缓存在本地存储中并且 ETag 匹配）。

回复收藏 0 原文

ぶ宁プ宁ぶ 2024-09-16 07:15:42

至于你的附带问题，我希望你会得到一个“继续令牌”。如果您使用的是 .NET 存储客户端库，请尝试将 .AsTableServiceQuery() 添加到您的查询中。

至于您的主要问题，将查询展开是您能做的最好的事情。听起来您正在从本地计算机（而不是在 Windows Azure 中）访问存储。如果是这样，我想您可以通过向 Windows Azure 部署一个小型服务来加快速度，该服务从表存储中获取数据（速度更快，因为数据中心内的带宽更高，延迟更低），然后压缩数据结果并将其发送回您的本地计算机。发送回的 XML Windows Azure 表会产生大量开销，因此将其剥离并捆绑行可能会节省大量传输时间。

回复收藏 0 原文