收集、存储和检索大量数值数据

发布于 2024-09-30 11:55:48 字数 580 浏览 4 评论 0原文

我即将开始实时收集大量数字数据(对于那些感兴趣的人,各种股票和期货的出价/要价/最后或“磁带”)。稍后将检索数据以进行分析和模拟。这一点也不难,但我想高效地完成它,这带来了很多问题。我不需要最好的解决方案(无论如何,根据指标可能有很多“最佳”)。我只想要一个计算机科学家会认可的解决方案。 (或者不笑?)

(1) 优化磁盘空间、I/O 速度或内存?

对于模拟来说,整体速度很重要。我们希望数据的 I/O(实际上是 I)速度比计算引擎快,因此我们不受 I/O 限制。

(2) 存储文本或其他内容(二进制数字)?

(3) 给定 (1)-(2) 中的一组选择,是否有任何出色的语言/库组合可以完成这项工作——Java、Python、C++,还是其他什么?

我会将此代码归类为“写完就忘记”,因此更注重效率而不是代码的清晰度/紧凑性。我非常非常愿意坚持使用 Python 来编写模拟代码(因为模拟确实改变了很多并且需要清晰)。因此,优秀的 Pythonic 解决方案会加分。

编辑:这是针对 Linux 系统(Ubuntu)的,

谢谢

I am about to start collecting large amounts of numeric data in real-time (for those interested, the bid/ask/last or 'tape' for various stocks and futures). The data will later be retrieved for analysis and simulation. That's not hard at all, but I would like to do it efficiently and that brings up a lot of questions. I don't need the best solution (and there are probably many 'bests' depending on the metric, anyway). I would just like a solution that a computer scientist would approve of. (Or not laugh at?)

(1) Optimize for disk space, I/O speed, or memory?

For simulation, the overall speed is important. We want the I/O (really, I) speed of the data just faster than the computational engine, so we are not I/O limited.

(2) Store text, or something else (binary numeric)?

(3) Given a set of choices from (1)-(2), are there any standout language/library combinations to do the job-- Java, Python, C++, or something else?

I would classify this code as "write and forget", so more points for efficiency over clarity/compactness of code. I would very, very much like to stick with Python for the simulation code (because the sims do change a lot and need to be clear). So bonus points for good Pythonic solutions.

Edit: this is for a Linux system (Ubuntu)

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

橪书 2024-10-07 11:55:48
  1. 优化磁盘空间和 IO 速度是同一件事 - 如今,CPU 与 IO 相比速度如此之快,以至于在存储数据之前压缩数据通常总体上更快(您可能确实想这样做)。我确实不认为内存发挥了重要作用(尽管您可能应该使用合理大小的缓冲区来确保进行顺序写入)。

  2. 二进制更紧凑(因此更快)。考虑到数据量,我怀疑人类可读是否有任何价值。文本格式的唯一优点是,如果它被损坏或丢失解析代码,更容易找出并纠正。

  1. Optimizing for disk space and IO speed is the same thing - these days, CPUs are so fast compared to IO that it's often overall faster to compress data before storing it (you may actually want to do that). I don't really see memory playing a big role (though you should probably use a reasonably-sized buffer to ensure you're doing sequential writes).

  2. Binary is more compact (and thus faster). Given the amount of data, I doubt whether being human-readable has any value. The only advantage of a text format would be that it's easier to figure out and correct if it gets corrupted or you lose the parsing code.

两个我 2024-10-07 11:55:48

Fame 是一种常用的时间序列存储商业解决方案。

如果你认真对待这个问题,建立自己的将是一项艰巨的任务。 HDF 可能有用,他们声称它适合刻度数据处理,并且具有 C++ 访问权限。 此处提供了 Python 支持

遇到同样问题的人的有用的现实生活经验 此处,包括 HDF5 参考。

Fame is an often-used commercial solution for time-series storage.

If you are serious about this, building your own will be a big job. HDF might be useful, they claim that it is suitable for tick data handling, and have C++ access. There is Python support here.

Useful real-life experience from somebody with the same problem here, including HDF5 refs.

花落人断肠 2024-10-07 11:55:48

实际上,这与我正在做的事情非常相似,即监控玩家在游戏中对世界所做的改变。我目前正在使用 python 的 sqlite 数据库。
在程序开始时,我将磁盘数据库加载到内存中,以便快速编写程序。每个更改都会放入两个列表中。这些列表适用于内存数据库和磁盘数据库。每 x 左右更新,内存数据库就会更新,并且计数器会增加 1。重复此操作,当计数器等于 5 时,它会被重置,并且磁盘更改列表将刷新到磁盘数据库并清除列表。我发现如果我还将写入更多设置为 WOL(Write提前记录)。如果我每 100 次更新更新一次内存,并且将磁盘计数器设置为每 5 次内存更新更新,则此方法每秒可以承受大约 100-300 次更新。您可能应该选择二进制,感觉,除非您的数据源有错误,否则是最合乎逻辑的

Actually, this is quite similar to what I'm doing, which is monitoring changes players make to the world in a game. I'm currently using an sqlite database with python.
At the start of the program, I load the disk database into memory, for fast writing procedures. Each change is put in to two lists. These lists are for both the memory database and the disk database. Every x or so updates, the memory database is updated, and a counter is pushed up one. This is repeated, and when the counter equals 5, it's reset and the list with changes for the disk is flushed to the disk database and the list is cleared.I have found this works well if I also set the writing more to WOL(Write Ahead Logging). This method can stand about 100-300 updates a second if I update memory every 100 updates and the disk counter is set to update every 5 memory updates. You should probobly choose binary, sense, unless you have faults in your data sources, would be most logical

假装不在乎 2024-10-07 11:55:48

使用 D-Bus 格式发送信息可能对您有利。该格式是标准的、二进制的,并且D-Bus可以用多种语言实现,并且可以用于通过网络发送和在同一台机器上进行进程间发送。

Using D-Bus format to send the information may be to your advantage. The format is standard, binary, and D-Bus is implemented in multiple languages, and can be used to send both over the network and inter-process on the same machine.

嘿看小鸭子会跑 2024-10-07 11:55:48

如果你只是存储,那么使用系统工具。不要自己写。如果您需要在存储数据之前对其进行一些实时处理,那么情况就完全不同了。

If you are just storing, then use system tools. Don't write your own. If you need to do some real-time processing of the data before it is stored, then that's something completely different.

つ低調成傷 2024-10-07 11:55:48

在阅读在给定特定条件下有效存储整数的这篇文章后,我才想到这一点 当我们将刻度数据存储为双精度数或浮点数或其他类型时,我们浪费了很多位。 价格是量化的!并且相当严格。例如,昨天的NQ范围约为2175-2191,即约26点,量化为0.25。因此,这将报价限制为约 100 个不同的价格。明白我要说的是什么吗?每个价格只需要一个字节。股票以 0.01 量化,因此每日范围内每美元需要约 1 个字节。

所以我概述的方法是:
(1) 将最高价、最低价和增量存储为一行标题
(2) 之后将报价数据存储为两个字节,最左边的两个位用于编码报价类型(00 = 最后,01 = 出价,11 = 询问)

我认为这是 CS 会批准的!

It just occurred to me after reading this thread on storing integers efficiently given certain conditions that we are wasting a lot of bits when we store tick data as doubles or floats or whatever. THE PRICES ARE QUANTIZED! And quite severely, at that. For example, yesterday's NQ range was from about 2175-2191, or about 26 points, quantized by 0.25. So that limits the ticks to ~100 different prices. See where I'm going with this? You only need one byte for each price. Stocks are quantized by 0.01 so you'd need ~ 1 byte for each dollar in the daily range.

So the method I'm outlining is:
(1) store high price, low price, and increment as one line header
(2) store tick data after that as two bytes, with the two left-most bits used to encode the tick type (00 = last, 01 = bid, 11 = ask)

I think this is something a CS would approve of!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文