PyTables 与 SQLite3 插入速度
我买了Kibot的股票数据,数据非常庞大。我有大约 125,000,000 行要加载(1000 只股票 * 125k 行/股票 [自 2010 年 1 月 1 日以来的 1 分钟柱线数据],CSV 文件中的每只股票,其字段为日期、时间、开盘价、最高价、最低价、收盘价,体积)。我对 python 完全陌生(我选择它是因为它是免费的并且得到社区的良好支持)并且我选择 SQLite 来存储数据是因为 python 内置支持它。 (而且我非常了解 SQL 语言。SQLiteStudio 是免费程序中的瑰宝。)
我的加载程序运行良好,但速度越来越慢。 SQLite 数据库大约有 6 GB,并且仅加载了一半。我使用 INSERT 语句加载大约 500k 行/小时,并在每个股票之后提交事务(大约 125k 行)。
那么问题来了:PyTables 是否比 SQLite 快得多,学习如何使用它的努力值得吗? (由于我正处于学习模式,请随意提出这两者的替代方案。) PyTables 让我困扰的一件事是,对于免费版本来说,它真的很简单,几乎就像保存二进制文件一样。没有“where 子句”函数或索引,因此您最终会扫描所需的行。
加载数据后,我将使用基于 NumPy 的工具进行统计分析(滚动回归和相关性等):Timeseries、larry、pandas 或 scikit。我还没有选择分析包,所以如果您有建议,并且该建议最好与 PyTables 或 pandas(或其他)一起使用,请在您的回复中考虑到这一点。
(对于@约翰) Python 2.6;
Windows XP SP3 32 位;
用作 INSERT 语句的制造字符串;
2G物理内存750M,内存使用坚如磐石;
CPU 使用率为 10% +/- 5%;
完全受 I/O 限制(磁盘总是在处理)。
数据库架构:
create table MinuteBarPrices (
SopDate smalldatetime not null,
Ticker char( 5 ) not null,
Open real,
High real,
Low real,
Close real not null,
Volume int,
primary key ( SopDate, Ticker )
);
create unique index MinuteBarPrices_IE1 on MinuteBarPrices (
Ticker,
SopDate
);
I bought Kibot's stock data and it is enormous. I have about 125,000,000 rows to load (1000 stocks * 125k rows/stock [1-minute bar data since 2010-01-01], each stock in a CSV file whose fields are Date,Time,Open,High,Low,Close,Volume). I'm totally new to python (I chose it because it's free and well-supported by a community) and I chose SQLite to store the data because of python's built-in support for it. (And I know the SQL language very well. SQLiteStudio is a gem of a free program.)
My loader program is working well, but is getting slower. The SQLite db is about 6 Gb and it's only halfway loaded. I'm getting about 500k rows/hour loaded using INSERT statements and committing the transaction after each stock (approx 125k rows).
So here's the question: is PyTables substantially faster than SQLite, making the effort to learn how to use it worth it? (And since I'm in learning mode, feel free to suggest alternatives to these two.) One things that bother me about PyTables is that it's really bare bones, almost like saving a binary file, for the free version. No "where clause" functions or indexing, so you wind up scanning for the rows you need.
After I get the data loaded, I'm going to be doing statistical analysis (rolling regression & correlation, etc) using something based on NumPy: Timeseries, larry, pandas, or a scikit. I haven't chosen the analysis package yet, so if you have a recommendation, and that recommendation is best used with either PyTables or pandas (or whatever), please factor that in to your response.
(For @John)
Python 2.6;
Windows XP SP3 32-bit;
Manufactured strings used as INSERT statements;
Memory usage is rock solid at 750M of the 2G physical memory;
CPU usage is 10% +/- 5%;
Totally i/o bound (disk is always crunching).
DB schema:
create table MinuteBarPrices (
SopDate smalldatetime not null,
Ticker char( 5 ) not null,
Open real,
High real,
Low real,
Close real not null,
Volume int,
primary key ( SopDate, Ticker )
);
create unique index MinuteBarPrices_IE1 on MinuteBarPrices (
Ticker,
SopDate
);
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
早在 2003 年,关于 PyTables 和 Sqlite 比较的科学论文由 PyTables 的作者 F. Altec 撰写。这表明 PyTables 通常更快,但并非总是如此。
就你的观点而言,PyTables 感觉“简单”,我想说 H5py 是在 python 中访问 HDF5 的简单方式,PyTables 引入了各种额外的功能,例如查询和索引,而 HDF5 本身没有这些功能.
查询示例:
请注意,具有更高级选项的 PyTables PRO 刚刚不复存在,Pro 版本从现在起将免费。这意味着还有额外的选择可供选择。
Back in 2003, a scientific paper on the comparison of PyTables and Sqlite was written by F. Altec, the author of PyTables. This shows that PyTables is usually faster, but not always.
On your point that PyTables feels 'bare bones', I would say the H5py is the bare bones way of accessing HDF5 in python, PyTables brings in all kinds of extra stuff like querying and indexing, which HDF5 doesn't natively have.
Example of querying:
Note that PyTables PRO, which has even fancier options, has just ceased to exist, the Pro version will be free from now on. This means yet extra options to play with.
建议:
您有 1 GB 内存未使用。尝试使用cache_size pragma - 此处的文档。其他感兴趣的编译指示:
synchronous
和page_size
...对于后者来说可能为时已晚。有时,加载没有任何索引的基表,然后创建索引会更快。
“将制造的字符串用作 INSERT 语句”或任何其他 SQL 语句都是一个坏主意,无论是速度方面还是安全方面(google(“SQL 注入攻击”))。现在就改掉这个习惯吧。使用参数化的 SQL 语句。
Suggestions:
You have 1 GB of memory that's not being used. Try using the cache_size pragma -- docs here. Other pragmas of interest:
synchronous
andpage_size
... maybe too late for the latter.Sometimes it is faster to load the base table without any index(es), then create the index(es).
"Manufactured strings used as INSERT statements" or any other SQL statements is a bad idea, both speed wise and security wise (google("SQL injection attack")). Get out of the habit now. Use parameterised SQL statements.