批量上传大量图像到Azure Blob存储
我的硬盘上本地存储了大约 110,000 张不同格式(jpg、png 和 gif)和大小 (2-40KB) 的图像。我需要将它们上传到 Azure Blob 存储。在执行此操作时,我需要设置一些元数据和 blob 的 ContentType,但除此之外,它是直接批量上传。
我目前正在使用以下方法来处理一次上传一张图像(并行超过 5-10 个并发任务)。
static void UploadPhoto(Image pic, string filename, ImageFormat format)
{
//convert image to bytes
using(MemoryStream ms = new MemoryStream())
{
pic.Save(ms, format);
ms.Position = 0;
//create the blob, set metadata and properties
var blob = container.GetBlobReference(filename);
blob.Metadata["Filename"] = filename;
blob.Properties.ContentType = MimeHandler.GetContentType(Path.GetExtension(filename));
//upload!
blob.UploadFromStream(ms);
blob.SetMetadata();
blob.SetProperties();
}
}
我想知道是否可以采用另一种技术来处理上传,使其尽可能快。这个特定项目涉及将大量数据从一个系统导入到另一个系统,并且出于客户原因它需要尽快发生。
I have about 110,000 images of various formats (jpg, png and gif) and sizes (2-40KB) stored locally on my hard drive. I need to upload them to Azure Blob Storage. While doing this, I need to set some metadata and the blob's ContentType, but otherwise it's a straight up bulk upload.
I'm currently using the following to handle uploading one image at a time (paralleled over 5-10 concurrent Tasks).
static void UploadPhoto(Image pic, string filename, ImageFormat format)
{
//convert image to bytes
using(MemoryStream ms = new MemoryStream())
{
pic.Save(ms, format);
ms.Position = 0;
//create the blob, set metadata and properties
var blob = container.GetBlobReference(filename);
blob.Metadata["Filename"] = filename;
blob.Properties.ContentType = MimeHandler.GetContentType(Path.GetExtension(filename));
//upload!
blob.UploadFromStream(ms);
blob.SetMetadata();
blob.SetProperties();
}
}
I was wondering if there was another technique I could employ to handle the uploading, to make it as fast as possible. This particular project involves importing a lot of data from one system to another, and for customer reasons it needs to happen as quickly as possible.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
好吧,这就是我所做的。我修改了在异步链中运行 BeginUploadFromStream()、BeginSetMetadata()、BeginSetProperties(),并行超过 5-10 个线程(ElvisLive 和 Knightpfhor 建议的组合)。这可行,但超过 5 个线程的任何线程都会产生糟糕的性能,每个线程需要 20 秒以上(一次处理包含 10 个图像的页面)才能完成。
因此,总结一下性能差异:
好吧,这非常有趣。同步上传 blob 的一个实例的性能比另一种方法中每个线程的性能好 5 倍。因此,即使运行 5 个线程的最佳异步平衡,也能获得基本相同的性能。
因此,我调整了图像文件导入,将图像分成每个包含 10,000 个图像的文件夹。然后我使用 Process.Start() 为每个文件夹启动 Blob 上传器的实例。我在此批次中有 170,000 张图像需要处理,因此这意味着有 17 个上传器实例。当在我的笔记本电脑上运行所有这些时,所有这些的性能都稳定在每组约 4.3 秒。
长话短说,我没有尝试让线程以最佳方式工作,而是为每 10,000 个图像运行一个 Blob 上传器实例,所有图像同时在一台计算机上运行。总体性能提升?
Okay, here's what I did. I tinkered around with running BeginUploadFromStream(), then BeginSetMetadata(), then BeginSetProperties() in an asynchronous chain, paralleled over 5-10 threads (a combination of ElvisLive's and knightpfhor's suggestions). This worked, but anything over 5 threads had terrible performance, taking upwards of 20 seconds for each thread (working on a page of ten images at a time) to complete.
So, to sum up the performance differences:
Okay, that's pretty interesting. One instance uploading blobs synchronously performed 5x better than each thread in the other approach. So, even running the best async balance of 5 threads nets essentially the same performance.
So, I tweaked my image file importing to separate the images into folders containing 10,000 images each. Then I used Process.Start() to launch an instance of my blob uploader for each folder. I have 170,000 images to work with in this batch, so that means 17 instances of the uploader. When running all of those on my laptop, performance across all of them leveled out at ~4.3 seconds per set.
Long story short, instead of trying to get threading working optimally, I just run a blob uploader instance for every 10,000 images, all on the one machine at the same time. Total performance boost?
您绝对应该在多个流中并行上传(即同时发布多个文件),但在进行任何显示(错误地)没有好处的实验之前,请确保您实际上增加了
ServicePointManager.DefaultConnectionLimit
:使用默认值 2,您最多可以对任何目标有两个未完成的 HTTP 请求。
You should definitely upload in parallel in several streams (ie. post multiple files concurrently), but before you do any experiment showing (erroneously) that there is not benefit, make sure you actually increase the value of
ServicePointManager.DefaultConnectionLimit
:With a default value of 2, you can have at most two outstanding HTTP requests against any destination.
由于您上传的文件非常小,因此我认为您编写的代码可能是您能获得的最高效的。根据您的评论,您似乎已经尝试并行运行这些上传,这实际上是我唯一的其他代码建议。
我怀疑,为了获得最大的吞吐量,需要为您的硬件、连接和文件大小找到正确的线程数。您可以尝试使用 Azure 吞吐量分析器< /a> 使找到这种平衡变得更容易。
Microsoft 的极限计算小组还提供了有关提高吞吐量的基准测试和建议。它专注于部署在 Azure 上的辅助角色的吞吐量,但它会让您了解您所期望的最佳效果。
As the files that you're uploading are pretty small, I think the code that you've written is probably about as efficient as you can get. Based on your comment it looks like you've tried running these uploads in parallel which was really the only other code suggestion I had.
I suspect that in order to get the greatest throughput will be about finding the right number of threads for your hardware, your connection and your file size. You could try using the Azure Throughput Analyzer to make finding this balance easier.
Microsoft's Extreme Computing group have also benchmarks and suggestions on improving throughput. It's focused on throughput from worker roles deployed on Azure, but it will give you an idea of the best you could hope for.
您可能需要增加 ParallelOperationThreadCount,如下所示。我没有检查最新的 SDK,但在 1.3 中,限制是 64。不设置此值会导致并发操作较低。
You may want to increase ParallelOperationThreadCount as shown below. I haven't checked the latest SDK, but in 1.3 the limit was 64. Not setting this value resulted in lower concurrent operations.
如果并行方法的上传时间比串行方法多 5 倍,那么要么
即使我不使用内存,我的命令行 util 在并行运行时也会得到相当大的提升流或任何其他类似的漂亮东西,我只需生成文件名的字符串数组,然后使用 Parallel.ForEach 上传它们。
此外,
Properties.ContentType
调用可能会让您陷入相当大的困难。就我个人而言,我从不使用它们,而且我想它们甚至不重要,除非您想通过直接 URL 在浏览器中直接查看它们。If the parallel method takes 5 times more to upload than the serial one, then you either
My command-line util gets quite a boost when running in parallel even though I don't use memory streams nor any other nifty stuff like that, I simply generate a string array of the filenames, then upload them with
Parallel.ForEach
.Additionally, the
Properties.ContentType
call probably sets you back quite a bit. Personally I never use them and I guess they shouldn't even matter unless you want to view them right in the browser via direct URLs.您始终可以尝试异步上传方法。
)
http://msdn.microsoft.com/en-us/library/ windowsazure/ee772907.aspx
You could always try the async methods of uploading.
)
http://msdn.microsoft.com/en-us/library/windowsazure/ee772907.aspx