查询使用Python PyodBC的数据库,并将结果导出到HDF5文件(内存错误)
我最近一直在研究一个超过5000万行和40列的数据。我使用了pyodbc并通过块读取数据,这花费了将近40分钟。我的团队成员使用R(RODBC软件包)从MSSQL阅读并导出到FST文件。然后,为了将来使用,他们只能在该FST文件中读取(R的FST软件包提供了一种快速,简单和灵活的方法来序列化数据框架)。
但是,我认为Python与FST文件无关。因此,在使用pyodbc读取数据后,我尝试使用df.to_hdf('data.h5',“。\ input”)
将结果导出到H5文件中错误。
对于此类问题有什么工作吗?我可以在Python中使用任何FST等效文件类型吗?
I've been recently working on a data that's over 50 million rows and 40 columns. I used pyodbc and read in the data by chunks, which took almost 40 minutes. My team members use R (rodbc package) to read in from MSSQL and export to fst file. Then, for future use, they can just read in that fst file (The fst package for R provides a fast, easy and flexible way to serialize data frames).
However, I don't think Python works with fst files. So, after reading in the data using pyodbc, I tried to export the result into h5 file using df.to_hdf('data.h5', ".\input")
, but ended up getting a memory error.
Any work around for this kind of issue? Are there any fst equivalent file types that I can use in Python?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
对于遇到同样问题的人,我发现 dask pockati围绕大数据,几乎具有与熊猫数据框架相同的语法。阅读有关dask 在这里。
这些是我使用的脚本:
dask安装: https://docs.dask.org/ en/stable/install.html
For people who are having the same issue, I discovered dask package which works well around big data and has almost identical syntax as pandas dataframe. Read more about dask here.
These are the scripts I used:
Dask Installation: https://docs.dask.org/en/stable/install.html