导出应用程序引擎数据时如何排除列
我计划在我的 django 应用程序上进行一些数据挖掘,该应用程序使用 appengine 来存储数据,但是,我的一个表将图像存储在其中的两列中,因此,它的大小为千兆字节,因此速度太慢每次我想分析新数据时都下载。对于数据挖掘,我只关心该表中的计划文本列,如何在将数据导出到 csv 文件时排除这些列?
我知道 buildupload.yaml 的 csv 连接器有一个“column_list”,您可以指定它在导出数据时仅包含某些列,但看起来它仍然会下载整个表行,然后再过滤掉列。将 appengine 的中间 sqlite3 数据文件转换为 csv。
作为参考,我使用此处描述的方法下载我的数据 http: //code.google.com/appengine/docs/python/tools/uploadingdata.html,但我愿意接受其他解决方案,最好是可以每隔几天自动导出一次数据的解决方案。
I'm planning to do some data mining on my django app which uses appengine for storing data, however, one of my tables stores images in two of it's columns, and because of that, it is gigabytes in size so it's far too slow to download every time I want to analyse new data. For data mining, I only care about the plan text columns in that table, how do I exclude those columns while exporting data to an csv file?
I'm aware that there is a "column_list" for the csv connector for buildupload.yaml that you can specify to only include certain columns when exporting data, but it looks like it still downloads the entire table row before filtering out the columns when it's converting appengine's intermediate sqlite3 data file to csv.
For reference, I'm using the method described here to download my data http://code.google.com/appengine/docs/python/tools/uploadingdata.html, but I'm open to other solutions, preferably ones where I can automate this data export every few days.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
你不能。 AppEngine 数据存储区 API 和底层 GQL 仅执行两种 SELECT 查询:仅
__key__
和所有字段。无法获取字段的子集。You can't. The AppEngine datastore API, and the underlying GQL, only do two sorts of SELECT queries:
__key__
only, and all fields. There's no way of getting a subset of fields.正如您所观察到的,批量加载器使用remote_api 下载整个记录,然后仅将您关心的字段输出到CSV。如果您只想下载选定的字段,则必须编写自己的代码来在服务器端执行此操作 - 可能通过在 MapReduce 中使用新的文件 API 来编写一个可以下载的文件。
As you've observed, the bulkloader downloads the entire record using remote_api, then outputs only the fields you care about to the CSV. If you want to only download selected fields, you'll have to write your own code to do this on the server-side - possibly by using the new Files API in a mapreduce, to write a file you can then download.
有点晚了,但我在类似情况下所做的只是从自动生成的bulkloader.yaml 文件中删除不需要的属性。
以下是使用 Google 文档排除csv 文件中的“帐户”属性。我将它用于斑点之类的东西,它在那里也工作得很好:
Kind of late here but all I did in a similar situation was delete the unwanted property from the automatically generated bulkloader.yaml file.
Here is an example using the Google documentation to exclude the "account" property from the csv file. I use it for things like blobs and it works fine there too: