使用 boto 提高 SimpleDB 查询性能
我正在尝试按以下方式使用 SimpleDB。
我想随时将 48 小时的数据保存到 simpledb 中,并出于不同目的进行查询。 每个域都有 1 小时的数据,因此在任何时候 simpledb 中都会存在 48 个域。 随着新数据的不断上传,我删除了最旧的域,并为每个新的小时创建一个新域。
每个域的大小约为 50MB,所有域的总大小约为 2.2 GB。 域中的项目具有以下类型的属性
标识符 - 大约 50 个字符长 - 每项 1 个
timestamp - 时间戳值 -- 每项 1
serial_n_data - 500-1000 字节数据 - 每项 200
我正在使用 python boto 库上传和查询数据。 我每秒发送 1 个项目,域中包含大约 200 个属性。
对于该数据的应用之一,我需要获取所有 48 个域的所有数据。 对于所有域,查询看起来像“SELECT * FROM domain”。 我使用 8 个线程来查询数据,每个线程负责几个域。
例如域1-6线程1
域 7-12 线程 2 等等
获取整个数据需要将近 13 分钟。我正在使用 boto 的 select 方法。我需要比这更快的性能。关于加快查询过程有什么建议吗?我可以使用其他语言来加快速度吗?
I am trying to use the SimpleDB in following way.
I want to keep 48 hrs worth data at anytime into simpledb and query it for different purposes.
Each domain has 1 hr worth data, so at any time there are 48 domains present in the simpledb.
As the new data is constantly uploaded, I delete the oldest domain and create a new domain for each new hour.
Each domain is about 50MB in size, the total size of all the domains is around 2.2 GB.
The item in the domain has following type of attributes
identifier - around 50 characters long -- 1 per item
timestamp - timestamp value -- 1 per item
serial_n_data - 500-1000 bytes data -- 200 per item
I'm using python boto library to upload and query the data.
I send 1 item/sec with around 200 attributes in the domain.
For one of the application of this data, I need to get all the data from all the 48 domains.
The Query looks like, "SELECT * FROM domain", for all the domains.
I use 8 threads to query data with each thread taking responsibility of few domains.
e.g domain 1-6 thread 1
domain 7-12 thread 2 and so on
It takes close to 13 minutes to get the entire data.I am using boto's select method for this.I need much more faster performance than this. Any suggestions on speed up the querying process? Is there any other language that I can use, which can speed up the things?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用更多线程
我建议将线程/域比率从 1/6 反转为接近 30/1。从 SimpleDB 中提取大块数据所需的大部分时间都花在等待上。在这种情况下,增加线程数将极大地提高吞吐量。
SimpleDB 的限制之一是查询响应大小上限为 1MB。这意味着在单个域中下拉 50MB 将至少需要 50 个选择(原始页面 + 49 个附加页面)。这些必须按顺序发生,因为下一个请求需要当前响应中的 NextToken。如果每个选择花费 2 秒以上(对于大量响应和高请求量来说并不罕见),则您在每个域上花费 2 分钟。如果每个线程都必须依次遍历 6 个域中的每一个,则大约需要 12 分钟。每个域一个线程应该可以轻松地将其减少到大约 2 分钟。
但你应该能够做得更好。 SimpleDB 针对并发性进行了优化。我会尝试每个域 30 个线程,为每个线程提供一小时的一部分时间来查询,因为它毕竟是日志数据。例如:(
显然,您会使用真实的时间戳值)所有 30 个查询都可以启动,而无需等待任何响应。通过这种方式,您仍然需要为每个域进行至少 50 次查询,但您不必按顺序进行所有查询,而是可以获得更多的并发性。您必须自己测试多少线程可以提供最佳吞吐量。我鼓励您尝试每个域最多 60 个,将选择条件分解为一分钟增量。如果它适合您,那么您将拥有完全并行的查询,并且很可能已经消除了所有后续页面。如果出现 503 ServiceUnavailable 错误,请缩减线程数。
域是 SimpleDB 可扩展性的基本单位,因此有一种方便的方法来分区数据是件好事。您只需要利用并发性。如果您能够在 13 秒内获取同一区域中 EC2 上运行的应用程序的数据,而不是 13 分钟,我不会感到惊讶。但实际需要的时间将取决于许多其他因素。
成本问题
作为旁注,我应该提到您正在做的事情的成本,即使您没有提出这个问题。 CreateDomain 和DeleteDomain 是重量级操作。通常我不建议如此频繁地使用它们。每次您需要支付大约 25 秒的盒子使用费用,因此每小时创建和删除一个盒子,仅用于域管理,每月就会花费大约 70 美元。您可以在域中存储比您提到的 50MB 更多数量级的数据。因此,您可能希望在删除之前让数据积累更多。如果您的查询包含时间戳(或者可以包含时间戳),则域中拥有额外 GB 的旧数据可能根本不会损害查询性能。无论如何,GetAttributes 和 PutAttributes 永远不会因较大的域大小而受到性能影响,这只是没有充分利用选择性索引的查询。您必须测试您的查询才能看到。这只是一个建议,我意识到创建/删除在概念上更清晰。
此外,由于盒子使用公式中的一个怪癖,一次写入 200 个属性的成本很高。写入的框使用量与属性数量的 3 次方成正比!以小时为单位的公式为:
基本费用加上每个属性的费用,其中 N 是属性的数量。在您的情况下,如果您在单个请求中写入所有 200 个属性,则盒子使用费约为每百万项 250 美元(如果您写入 256 个属性,则每百万项为 470 美元)。如果将每个请求分成 4 个请求,每个请求包含 50 个属性,则 PutAttributes 量将增加四倍,但盒子使用费用会降低一个数量级,达到每百万项约 28 美元。如果您能够分解这些请求,那么这可能是值得做的。如果您不能(由于请求量或应用程序的性质),则意味着从成本的角度来看,SimpleDB 最终可能极其没有吸引力。
Use more threads
I would suggest inverting your threads/domain ratio from 1/6 to something closer to 30/1. Most of the time taken to pull down large chunks of data from SimpleDB is going to be spent waiting. In this situation upping the thread count will vastly improve your throughput.
One of the limits of SimpleDB is the query response size cap at 1MB. This means pulling down the 50MB in a single domain will take a minimum of 50 Selects (the original + 49 additional pages). These must occur sequentially because the NextToken from the current response is needed for the next request. If each Select takes 2+ seconds (not uncommon with large responses and high request volume) you spend 2 minutes on each domain. If every thread has to iterate thru each of 6 domains in turn, that's about 12 minutes right there. One thread per domain should cut that down to about 2 minutes easily.
But you should be able to do much better than that. SimpleDB is optimized for concurrency. I would try 30 threads per domain, giving each thread a portion of the hour to query on, since it is log data after all. For example:
(Obviously, you'd use real timestamp values) All 30 queries can be kicked off without waiting for any responses. In this way you still need to make at least 50 queries per domain, but instead of making them all sequentially you can get a lot more concurrency. You will have to test for yourself how many threads gives you the best throughput. I would encourage you to try up to 60 per domain, breaking the Select conditions down to one minute increments. If it works for you then you will have fully parallel queries and most likely have eliminated all follow up pages. If you get 503 ServiceUnavailable errors, scale back the threads.
The domain is the basic unit of scalability for SimpleDB so it is good that you have a convenient way to partition your data. You just need take advantage of the concurrency. Rather than 13 minutes, I wouldn't be surprised if you were able to get the data in 13 seconds for an app running on EC2 in the same region. But the actual time it takes will depend on a number of other factors.
Cost Concerns
As a side note, I should mention the costs of what you are doing, even though you haven't raised the issue. CreateDomain and DeleteDomain are heavyweight operations. Normally I wouldn't advise using them so often. You are charged about 25 seconds of box usage each time so creating and deleting one each hour adds up to about $70 per month just for domain management. You can store orders of magnitude more data in a domain than the 50MB you mention. So you might want to let the data accumulate more before you delete. If your queries include the timestamp (or could be made to include the timestamp) query performance may not be hurt at all by having an extra GB of old data in the domain. In any case, GetAttributes and PutAttributes will never suffer a performance hit with a large domain size, it is only queries that don't make good use of a selective index. You'd have to test your queries to see. That is just a suggestion, I realize that the create/delete is cleaner conceptually.
Also writing 200 attributes at a time is expensive, due to a quirk in the box usage formula. The box usage for writes is proportional to the number of attributes raised to the power of 3 ! The formula in hours is:
For the base charge plus the per attribute charge, where N is the number of attributes. In your situation, if you write all 200 attributes in a single request, the box usage charges will be about $250 per million items ($470 per million if you write 256 attributes). If you break each request in to 4 requests with 50 attributes each, you will quadruple your PutAttributes volume, but reduce the box usage charges by an order of magnitude to about $28 per million items. If you are able break the requests down, then it may be worth doing. If you cannot (due to request volume, or just the nature of your app) it means that SimpleDB can end up being extremely unappealing from a cost standpoint.
我和你查理有同样的问题。对代码进行分析后,我将性能问题缩小到 SSL。看起来这就是它花费大部分时间以及 CPU 周期的地方。
我读到 httplib 库(boto 用于 SSL)中的一个问题,除非数据包超过一定大小,否则性能不会提高,尽管这是针对 Python 2.5 的,并且可能已经得到修复。
I have had the same issue as you Charlie. After profiling the code, I have narrowed the performance problem down to SSL. It seems like that is where it is spending most of it's time and hence CPU cycles.
I have read of a problem in the httplib library (which boto uses for SSL) where the performance doesn't increase unless the packets are over a certain size, though that was for Python 2.5 and may have already been fixed.
SBBExplorer 使用多线程 BatchPutAttributes 在将批量数据上传到 Amazon SimpleDB 时实现高写入吞吐量。 SDB Explorer 允许多个并行上传。如果您有带宽,则可以通过在并行队列中同时运行多个 BatchPutAttributes 进程来充分利用该带宽,这将减少处理时间。
http://www.sdbexplorer.com/
SBBExplorer uses Multithreaded BatchPutAttributes to achieve high write throughput while uploading bulk data to Amazon SimpleDB. SDB Explorer allows multiple parallel uploads. If you have the bandwidth, you can take full advantage of that bandwidth by running number of BatchPutAttributes processes at once in parallel queue that will reduce the time spend in processing.
http://www.sdbexplorer.com/