大规模、基于时间序列的聚合操作的架构和模式
我将尝试描述我的挑战和操作: 我需要计算历史时期的股票价格指数。例如,我将选取 100 只股票并计算它们去年每秒(甚至更少)的聚合平均价格。 我需要创建许多像这样的不同指数,其中股票是从 30,000 种不同的工具中动态挑选的。
主要考虑的是速度。我需要尽快输出几个月的此类索引。
因此,我认为传统的 RDBMS 太慢,因此我正在寻找一种复杂且原始的解决方案。
这是我使用 NoSql 或面向列的方法时想到的: 将所有股票分配到某种时间:价格的键值对中,并在所有股票上匹配时间行。然后使用某种地图缩减模式仅选择所需的股票并汇总其价格,同时逐行读取它们。
我想要一些关于我的方法的反馈、对工具和用例的建议,或者对完全不同的设计模式的建议。我对该解决方案的指导方针是价格(希望使用开源)、处理大量数据的能力以及快速查找(我不关心插入,因为它只制作一次并且永远不会改变)
更新:by快速查找我的意思不是实时的,而是相当快速的操作。目前,我需要花费几分钟来处理每天的数据,这相当于每年计算几个小时。我想在几分钟左右的时间内实现这一目标。
I will try to describe my challenge and operation:
I need to calculate stocks price indices over historical period. For example, I will take 100 stocks and calc their aggregated avg price each second (or even less) for the last year.
I need to create many different indices like this where the stocks are picked dynamically out of 30,000~ different instruments.
The main consideration is speed. I need to output a few months of this kind of index as fast as i can.
For that reason, i think a traditional RDBMS are too slow, and so i am looking for a sophisticated and original solution.
Here is something i had In mind, using NoSql or column oriented approach:
Distribute all stocks into some kind of a key value pairs of time:price with matching time rows on all of them. Then use some sort of a map reduce pattern to select only the required stocks and aggregate their prices while reading them line by line.
I would like some feedback on my approach, suggestion for tools and use cases, or suggestion of a completely different design pattern. My guidelines for the solution is price (would like to use open source), ability to handle huge amounts of data and again, fast lookup (I don't care about inserts since it is only made one time and never change)
Update: by fast lookup i don't mean real time, but a reasonably quick operation. Currently it takes me a few minutes to process each day of data, which translates to a few hours per yearly calculation. I want to achieve this within minutes or so.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
过去,我参与过几个涉及使用不同存储技术(文件、RDBMS、NoSQL 数据库)存储和处理时间序列的项目。在所有这些项目中,要点是确保时间序列样本按顺序存储在磁盘上。这确保了快速读取数千个连续样本。
由于您似乎拥有中等数量的时间序列(大约 30,000 个),每个时间序列都有大量样本(每秒 1 个价格),因此一种简单而有效的方法是将每个时间序列写入单独的文件。在文件中,价格按时间排序。
然后,您需要为每个文件建立一个索引,以便您可以快速找到文件中的某些时间点,并且当您只需要某个时间段时,不需要从头读取文件。
通过这种方法,您可以充分利用当今的操作系统,这些操作系统具有大型文件缓存,并且针对顺序读取进行了优化(通常在检测到顺序模式时在文件中预读)。
聚合多个时间序列涉及将每个文件中的某个时间段读取到内存中,计算聚合的数字并将其写入某处。为了充分利用操作系统,请一一读取每个时间序列所需的完整周期,不要尝试并行读取它们。如果您需要计算很长的周期,那么不要将其分成更小的周期。
您提到每天有 25,000 个价格,当您将它们减少到每秒一个时。在我看来,在这样的时间序列中,许多连续价格将是相同的,因为很少有工具每秒交易(甚至定价)超过一次(除非您只处理标准普尔 500 股票及其衍生品)。因此,额外的优化可能是通过仅在价格确实发生变化时存储新样本来进一步压缩时间序列。
在较低级别上,时间序列文件可以组织为由样本运行组成的二进制文件。每次运行都以第一个价格的时间戳和运行长度开始。之后,接下来是连续几秒的价格。每次运行的文件偏移量可以存储在索引中,这可以通过关系型DBMS(例如MySQL)来实现。该数据库还将包含每个时间序列的所有元数据。
(请远离内存映射文件。它们速度较慢,因为它们没有针对顺序访问进行优化。)
In the past, I've worked on several projects that involved the storage and processing of time series using different storage techniques (files, RDBMS, NoSQL databases). In all these projects, the essential point was to make sure that the time series samples are stored sequentially on the disk. This made sure reading several thousand consecutive samples was quick.
Since you seem to have a moderate number of time series (approx. 30,000) each having a large number of samples (1 price a second), a simple yet effective approach could be to write each time series into a separate file. Within the file, the prices are ordered by time.
You then need an index for each file so that you can quickly find certain points of time within the file and don't need to read the file from the start when you just need a certain period of time.
With this approach you can take full advantage of today's operating systems which have a large file cache and are optimized for sequential reads (usually reading ahead in the file when they detect a sequential pattern).
Aggregating several time series involves reading a certain period from each of these files into memory, computing the aggregated numbers and writing them somewhere. To fully leverage the operating system, read the full required period of each time series one by one and don't try to read them in parallel. If you need to compute a long period, then don’t break it into smaller periods.
You mention that you have 25,000 prices a day when you reduce them to a single one per second. It seems to me that in such a time series, many consecutive prices would be the same as few instruments are traded (or even priced) more than once a second (unless you only process S&P 500 stocks and their derivatives). So an additional optimization could be to further condense your time series by only storing a new sample when the price has indeed changed.
On a lower level, the time series files could be organized as a binary files consisting of sample runs. Each run starts with the time stamp of the first price and the length of the run. After that, the prices for the several consecutive seconds follow. The file offset of each run could be stored in the index, which could be implemented with a relational DBMS (such as MySQL). This database would also contain all the meta data for each time series.
(Do stay away from memory mapped files. They're slower because they aren’t optimized for sequential access.)
如果您描述的场景是唯一的要求,那么就有“低技术”的简单解决方案,它们更便宜且更容易实施。第一个想到的是LogParser。如果您还没有听说过,它是一个对简单 CSV 文件运行 SQL 查询的工具。它的速度令人难以置信 - 通常约为 500K 行/秒,具体取决于行大小和 HD 的 IO 吞吐量。
将原始数据转储到 CSV 中,通过命令行运行简单的聚合 SQL 查询,然后就完成了。很难相信事情会这么简单,但事实确实如此。
有关 logparser 的更多信息:
If the scenario you described is the ONLY requirement, then there are "low tech" simple solutions which are cheaper and easier to implement. The first that comes to mind is LogParser. In case you haven't heard of it, it is a tool which runs SQL queries on simple CSV files. It is unbelievably fast - typically around 500K rows/sec, depending on row size and the IO throughput of the HDs.
Dump the raw data into CSVs, run a simple aggregate SQL query via the command line, and you are done. Hard to believe it can be that simple, but it is.
More info about logparser:
您真正需要的是一个内置时间序列功能的关系数据库,IBM 最近发布了一个 Informix 11.7(请注意,必须是 11.7 才能获得此功能)。更好的消息是,对于您正在做的免费版本来说,Informix Innovator-C 已经足够了。
http://www.freeinformix.com/time-series-presentation-technical.html
What you really need is a relational database that has built in time series functionality, IBM released one very recently Informix 11.7 ( note it must be 11.7 to get this feature). What is even better news is that for what you are doing the free version, Informix Innovator-C will be more than adequate.
http://www.freeinformix.com/time-series-presentation-technical.html