使用 Python 的只读二进制平面文件存储选项
我的任务是建立一个平面文件 SKU 数据库,用于存储和处理器速度有限的嵌入式设备。
基本上我需要存储的数据包括以下内容:
SKU 描述 地点 价格 数量
该文件将包含数百万条记录。
最重要的考虑因素是存储空间和检索时间。记录只需要按SKU检索,并且是只读的,因此文件可以按SKU排序。
我想用 Python 访问这些数据。所以我的问题归结为这一点。
是否有现有的 Python 库可以为我提供此功能,还是我需要自己构建?
如果答案归结为我自己的,是否有人对此有建议或好的参考?
I have been tasked with setting up a flat-file SKU database for use on embedded devices with limited storage and processor speed.
Basically the data I need to store consists of the following:
SKU
Description
Location
Price
Qty
The file will consist of several million records.
The most important considerations are storage space and retrieval time. Records will only need to be retrieved by SKU and it will be read-only, so the file can be sorted by SKU.
I would like to access this data with Python. So my questions comes down to this.
Are there existing Python libraries that can provide this functionality for me, or do I need to roll my own?
If the answer comes down to roll my own, does anyone have a suggestions, or good references for doing so?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
SQLite 与 Python 绑定怎么样?它的功能比您需要的多一点,但它是标准软件并且经过了良好的测试。
How about SQLite with Python bindings? It has a little more than you need, but it's standard software and well-tested.
旧的方法是使用简单的键/值数据表,例如 gdbm 模块。 Python 对此提供了支持,但它并未内置到我的计算机上的默认 Python 安装中。
一般来说,使用SQLite。正如其他人所写,它是 Python 的标准配置,并且已经在很多嵌入式系统中使用。
如果记录是固定长度的,那么您可以使用二等分模块。文件大小/记录大小给出了文件中的记录数。二分搜索将在文件中执行 O(log(n)) 查找,并且您需要编写一个适配器来测试相等性。虽然我还没有测试过,但这里有一个草图:
您还可以对文件进行 gzip 压缩并在经过 gzip 压缩的文件上进行搜索,但这是您必须测试的空间与时间的权衡。
The old way would be to use a simple key/value data table like gdbm module. Python comes with support for that, but it's not built into the default Python installation on my machine.
In general, use SQLite. As others wrote, it comes standard with Python, and it's used in a lot of embedded systems already.
If the records are fixed length then you can use the bisect module. The file size / the record size gives the number of records in the file. The bisect search will do an O(log(n)) lookup in the file, and you'll need to write an adapter to test for equality. While I haven't tested it, here's a sketch:
You could additionally gzip the file and seek on a gzip'ped file, but that's a tradeoff for space vs. time that you'll have to test.
我可以建议cdb吗? (Python 绑定:python-cdb。)
这是一种用于只读的格式数据,就像您拥有的那样;它基本上是 256 个巨大的哈希表,每个哈希表都有不同数量的存储桶。 cdb 的一个很酷的地方是文件不需要加载到内存中;它的结构方式使您只需
mmap
输入您需要的位即可进行查找。cdb 规范 是一本很好的读物,尤其是因为这些行被格式化为创建一个统一的右边距。 :-D
May I suggest cdb? (Python bindings: python-cdb.)
It's a format used for read-only data, like you have; it's basically 256 giant hash tables, each able to have a different number of buckets. The cool thing about cdb is that the file doesn't need to be loaded into memory; it's structured in a way that you can do lookups by just
mmap
ing in the bits you need.The cdb spec is a good read, not least because the lines are formatted to create a uniform right margin. :-D
HDF 怎么样?如果您不需要 SQL 并且需要快速访问数据,那么对于数字或结构化数据,Python 中没有比这更快的了。
查看 DatabaseInterfaces 部分。 python.org/" rel="nofollow noreferrer">Python 维基。很全面。列出了几个“纯”Python 选项(例如 SnakeSQL),它们更好一点部署。当然,总是有 Berkeley DB 等,它们是超级精简的;生的。
老实说,SQLite 可能会很适合你。如果您确实需要寻求更高的性能,那么您会考虑基于记录的格式,例如 BDB。
How about HDF? If you don't need SQL and require fast access to your data, there's nothing faster... in Python... for numerical or structured data.
Take a look at the DatabaseInterfaces section on the Python wiki. It's comprehensive. There are a couple of "pure" Python options listed (like SnakeSQL), which are a tad nicer to deploy. And, of course, there's always Berkeley DB and the like, which are super lean & raw.
Honestly, SQLite will probably work fine for you. If you really need to eek out more performance, then you'd be looking at a record-based format like BDB.
一个简单的解决方案是 CPickle。您还可以找到 关于SO的类似问题。
A simple solution is CPickle. You can also find similar questions on SO.
Andrew Dalke 的答案的一种变体(因此您仍然可以使用二分搜索来快速定位 SKU),这可能会减少空间需求,即在文件开头具有固定大小的记录(每个 SKU 一个),然后是所有描述和位置(如空终止字符串所示)
您不必将位置和描述填充到固定长度,从而节省空间。如果有很多重复位置,您还可以节省空间
以下是一个示例:
说你有
A variation of Andrew Dalke's answer (so you can still use binary search to locate the SKU quickly) which may reduce the space requirements would be to have fixed sized records at the start of the file (one per SKU) and then all the Descriptions and Locations (as null terminated strings say)
You get to save space by not having to pad out the locations and descriptions to fixed length. Also you can save space if there are lots of duplicate locations
Here is an example:
say you have