App Engine 批量加载器性能

发布于 2024-09-18 07:01:02 字数 1473 浏览 12 评论 0原文

我正在使用 App Engine 批量加载器(Python 运行时)将实体批量上传到数据存储。我上传的数据以专有格式存储,因此我通过自己的连接器实现(在bulkload_config.py中注册)将其转换为中间Python字典。

import google.appengine.ext.bulkload import connector_interface
class MyCustomConnector(connector_interface.ConnectorInterface):
   ....
   #Overridden method
   def generate_import_record(self, filename, bulkload_state=None):
      ....
      yeild my_custom_dict

为了将这个中性 python 字典转换为数据存储实体,我使用了在 YAML 中定义的自定义后导入函数。

def feature_post_import(input_dict, entity_instance, bulkload_state):
    ....
    return [all_entities_to_put]

注意:我没有在 feature_post_import 函数中使用 entity_instance,bulkload_state。我只是创建新的数据存储实体(基于我的 input_dict),然后返回它们。

现在,一切都很好。然而,批量加载数据的过程似乎花费了太多时间。例如,一 GB(约 1,000,000 个实体)的数据需要约 20 小时。如何提高批量加载过程的性能。我错过了什么吗?

我在 appcfg.py 中使用的一些参数是(10 个线程,每个线程的批处理大小为 10 个实体)。

链接了 Google App Engine Python 小组帖子:http://groups .google.com/group/google-appengine-python/browse_thread/thread/4c8def071a86c840

更新: 为了测试批量加载过程的性能,我加载了“测试”种类实体。尽管此实体有一个非常简单的FloatProperty,但我仍然花费了相同的时间来批量加载这些实体

我仍然会尝试改变批量加载器参数,rps_limitbandwidth_limithttp_limit,看看是否可以获得更多吞吐量。

I am using the App Engine Bulk loader (Python Runtime) to bulk upload entities to the data store. The data that i am uploading is stored in a proprietary format, so i have implemented by own connector (registerd it in bulkload_config.py) to convert it to the intermediate python dictionary.

import google.appengine.ext.bulkload import connector_interface
class MyCustomConnector(connector_interface.ConnectorInterface):
   ....
   #Overridden method
   def generate_import_record(self, filename, bulkload_state=None):
      ....
      yeild my_custom_dict

To convert this neutral python dictionary to a datastore Entity, i use a custom post import function that i have defined in my YAML.

def feature_post_import(input_dict, entity_instance, bulkload_state):
    ....
    return [all_entities_to_put]

Note: I am not using entity_instance, bulkload_state in my feature_post_import function. I am just creating new data store entities (based on my input_dict), and returning them.

Now, everything works great. However, the process of bulk loading data seems to take way too much time. For e.g. a GB (~ 1,000,000 entities) of data takes ~ 20 hours. How can I improve the performance of the bulk load process. Am i missing something?

Some of the parameters that i use with appcfg.py are (10 threads with a batch size of 10 entities per thread).

Linked a Google App Engine Python group post: http://groups.google.com/group/google-appengine-python/browse_thread/thread/4c8def071a86c840

Update:
To test the performance of the Bulk Load process, I loaded entities of a 'Test' Kind. Even though this entity has a very simple FloatProperty, it still took me the same amount of time to bulk load those entities.

I am still going to try to vary the bulk loader parameters, rps_limit, bandwidth_limit and http_limit, to see if i can get any more throughput.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

懒猫 2024-09-25 07:01:02

有一个名为 rps_limit 的参数决定每秒上传的实体数量。这是主要的瓶颈。默认值为20

同时将 bandwidth_limit 增加到合理的值。

我将 rps_limit 增加到 500,一切都得到了改善。我实现了每 1000 个实体 5.5 - 6 秒,这比每 1000 个实体 50 秒有了重大改进。

There is parameter called rps_limit that determines the number of entities to upload per second. This was the major bottleneck. The default value for this is 20.

Also increase the bandwidth_limit to something reasonable.

I increased rps_limit to 500 and everything improved. I achieved 5.5 - 6 seconds per 1000 entities which is a major improvement from 50 seconds per 1000 entities.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文