查询使用Python PyodBC的数据库，并将结果导出到HDF5文件（内存错误）

发布于 2025-02-02 21:20:59 字数 317 浏览 5 评论 0原文

我最近一直在研究一个超过5000万行和40列的数据。我使用了pyodbc并通过块读取数据，这花费了将近40分钟。我的团队成员使用R（RODBC软件包）从MSSQL阅读并导出到FST文件。然后，为了将来使用，他们只能在该FST文件中读取（R的FST软件包提供了一种快速，简单和灵活的方法来序列化数据框架）。

但是，我认为Python与FST文件无关。因此，在使用pyodbc读取数据后，我尝试使用df.to_hdf（'data.h5'，“。\ input”）将结果导出到H5文件中错误。

对于此类问题有什么工作吗？我可以在Python中使用任何FST等效文件类型吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

诗笺 2025-02-09 21:20:59

对于遇到同样问题的人，我发现 dask pockati围绕大数据，几乎具有与熊猫数据框架相同的语法。阅读有关dask 在这里。

这些是我使用的脚本：

import dask.dataframe as dd
from dask.dataframe import from_pandas
    
#Convert pandas.dataframe to dask.dataframe
df_dask = from_pandas(df, npartitions=3)
df_dask.divisions  

#save as parquet file for future use (faster read-in)
df_dask.to_parquet('./input/data pull.parq', schema="infer")

#re-read in the parquet file partitions as dask.dataframe
#One of the biggest advantages of parquet file 
  ##is that it has low storage consumption. 
df = dd.read_parquet("./input/data pull.parq/", engine="fastparquet", ignore_metadata_file=True)

dask安装： https://docs.dask.org/ en/stable/install.html

For people who are having the same issue, I discovered dask package which works well around big data and has almost identical syntax as pandas dataframe. Read more about dask here.

These are the scripts I used:

import dask.dataframe as dd
from dask.dataframe import from_pandas
    
#Convert pandas.dataframe to dask.dataframe
df_dask = from_pandas(df, npartitions=3)
df_dask.divisions  

#save as parquet file for future use (faster read-in)
df_dask.to_parquet('./input/data pull.parq', schema="infer")

#re-read in the parquet file partitions as dask.dataframe
#One of the biggest advantages of parquet file 
  ##is that it has low storage consumption. 
df = dd.read_parquet("./input/data pull.parq/", engine="fastparquet", ignore_metadata_file=True)

Dask Installation: https://docs.dask.org/en/stable/install.html

回复收藏 0 原文

~没有更多了~