以 SQLite 和 HDF5 格式从 numpy、scipy 导出/导入
Python 与 SQLite(sqlite3、atpy)和 HDF5(h5py、pyTables)接口的选择似乎有很多——我想知道是否有人有将它们与 numpy 数组或数据表(结构化/记录数组)一起使用的经验,以及哪些其中最无缝地与每种数据格式(SQLite 和 HDF5)的“科学”模块(numpy、scipy)集成。
There seems to be many choices for Python to interface with SQLite (sqlite3, atpy) and HDF5 (h5py, pyTables) -- I wonder if anyone has experience using these together with numpy arrays or data tables (structured/record arrays), and which of these most seamlessly integrate with "scientific" modules (numpy, scipy) for each data format (SQLite and HDF5).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
大部分取决于您的用例。
与传统的关系数据库相比,我在处理各种基于 HDF5 的方法方面拥有更多的经验,因此我无法对 Python 的 SQLite 库发表太多评论......
至少就
h5py
与 < code>pyTables,它们都通过 numpy 数组提供非常无缝的访问,但它们面向截然不同的用例。如果您有 n 维数据,并且想要快速访问任意基于索引的切片,那么使用 h5py 会简单得多。如果您有更像表格的数据,并且想要查询它,那么 pyTables 是一个更好的选择。
与 pyTables 相比,h5py 是 HDF5 库的相对“普通”包装器。如果您要定期从另一种语言访问 HDF 文件(
pyTables
添加一些额外的元数据),这是一件非常好的事情。h5py
可以做很多,但对于某些用例(例如pyTables
所做的事情),您将需要花费更多时间进行调整。pyTables
有一些非常很好的功能。但是,如果您的数据看起来不太像表格,那么它可能不是最佳选择。举一个更具体的例子,我经常处理相当大(数十 GB)的 3 维和 4 维数据数组。它们是浮点数、整数、uint8 等的同质数组。我通常想要访问整个数据集的一小部分。
h5py
使这个非常变得简单,并且在自动猜测合理的块大小方面做得相当好。从磁盘抓取任意块或切片比简单的内存映射文件要快得多。 (强调任意......显然,如果你想抓取整个“X”切片,那么 C 有序内存映射数组是不可能击败的,因为“X”切片中的所有数据在磁盘上都是相邻的。)一个反例是,我的妻子从各种传感器收集数据,这些传感器在几年内以分钟到秒的间隔进行采样。她需要存储数据并对其运行任意查询(以及相对简单的计算)。 pyTables 使这个用例变得非常简单和快速,并且仍然比传统关系数据库具有一些优势。 (特别是在磁盘使用率和将大量(基于索引的)数据读入内存的速度方面)
Most of it depends on your use case.
I have a lot more experience dealing with the various HDF5-based methods than traditional relational databases, so I can't comment too much on SQLite libraries for python...
At least as far as
h5py
vspyTables
, they both offer very seamless access via numpy arrays, but they're oriented towards very different use cases.If you have n-dimensional data that you want to quickly access an arbitrary index-based slice of, then it's much more simple to use
h5py
. If you have data that's more table-like, and you want to query it, thenpyTables
is a much better option.h5py
is a relatively "vanilla" wrapper around the HDF5 libraries compared topyTables
. This is a very good thing if you're going to be regularly accessing your HDF file from another language (pyTables
adds some extra metadata).h5py
can do a lot, but for some use cases (e.g. whatpyTables
does) you're going to need to spend more time tweaking things.pyTables
has some really nice features. However, if your data doesn't look much like a table, then it's probably not the best option.To give a more concrete example, I work a lot with fairly large (tens of GB) 3 and 4 dimensional arrays of data. They're homogenous arrays of floats, ints, uint8s, etc. I usually want to access a small subset of the entire dataset.
h5py
makes this very simple, and does a fairly good job of auto-guessing a reasonable chunk size. Grabbing an arbitrary chunk or slice from disk is much, much faster than for a simple memmapped file. (Emphasis on arbitrary... Obviously, if you want to grab an entire "X" slice, then a C-ordered memmapped array is impossible to beat, as all the data in an "X" slice are adjacent on disk.)As a counter example, my wife collects data from a wide array of sensors that sample at minute to second intervals over several years. She needs to store and run arbitrary querys (and relatively simple calculations) on her data.
pyTables
makes this use case very easy and fast, and still has some advantages over traditional relational databases. (Particularly in terms of disk usage and speed at which a large (index-based) chunk of data can be read into memory)