查询使用Python PyodBC的数据库,并将结果导出到HDF5文件(内存错误)

发布于 2025-02-02 21:20:59 字数 317 浏览 5 评论 0原文

我最近一直在研究一个超过5000万行和40列的数据。我使用了pyodbc并通过块读取数据,这花费了将近40分钟。我的团队成员使用R(RODBC软件包)从MSSQL阅读并导出到FST文件。然后,为了将来使用,他们只能在该FST文件中读取(R的FST软件包提供了一种快速,简单和灵活的方法来序列化数据框架)。

但是,我认为Python与FST文件无关。因此,在使用pyodbc读取数据后,我尝试使用df.to_hdf('data.h5',“。\ input”)将结果导出到H5文件中错误。

对于此类问题有什么工作吗?我可以在Python中使用任何FST等效文件类型吗?

I've been recently working on a data that's over 50 million rows and 40 columns. I used pyodbc and read in the data by chunks, which took almost 40 minutes. My team members use R (rodbc package) to read in from MSSQL and export to fst file. Then, for future use, they can just read in that fst file (The fst package for R provides a fast, easy and flexible way to serialize data frames).

However, I don't think Python works with fst files. So, after reading in the data using pyodbc, I tried to export the result into h5 file using df.to_hdf('data.h5', ".\input"), but ended up getting a memory error.

Any work around for this kind of issue? Are there any fst equivalent file types that I can use in Python?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

诗笺 2025-02-09 21:20:59

对于遇到同样问题的人,我发现 dask pockati围绕大数据,几乎具有与熊猫数据框架相同的语法。阅读有关dask 在这里

这些是我使用的脚本:

import dask.dataframe as dd
from dask.dataframe import from_pandas
    
#Convert pandas.dataframe to dask.dataframe
df_dask = from_pandas(df, npartitions=3)
df_dask.divisions  

#save as parquet file for future use (faster read-in)
df_dask.to_parquet('./input/data pull.parq', schema="infer")

#re-read in the parquet file partitions as dask.dataframe
#One of the biggest advantages of parquet file 
  ##is that it has low storage consumption. 
df = dd.read_parquet("./input/data pull.parq/", engine="fastparquet", ignore_metadata_file=True)

dask安装: https://docs.dask.org/ en/stable/install.html

For people who are having the same issue, I discovered dask package which works well around big data and has almost identical syntax as pandas dataframe. Read more about dask here.

These are the scripts I used:

import dask.dataframe as dd
from dask.dataframe import from_pandas
    
#Convert pandas.dataframe to dask.dataframe
df_dask = from_pandas(df, npartitions=3)
df_dask.divisions  

#save as parquet file for future use (faster read-in)
df_dask.to_parquet('./input/data pull.parq', schema="infer")

#re-read in the parquet file partitions as dask.dataframe
#One of the biggest advantages of parquet file 
  ##is that it has low storage consumption. 
df = dd.read_parquet("./input/data pull.parq/", engine="fastparquet", ignore_metadata_file=True)

Dask Installation: https://docs.dask.org/en/stable/install.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文