Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'
I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)
You can group by time intervals which can be in microseconds (u), seconds (s), minutes (m), hours (h), days (d) or weeks (w).
TL;DR
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.
On the other hand, you'll run into different variants of these problems with SQL.
Of course there are some benefits here:
Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.
Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation or trend-based analysis on it.
If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.
What sucks is that there are very few, if any, options to extract this data and cram it into a structure more suitable for statistical analysis e.g. columnar database or cube. If you load it into a basic relational database, there are a host of tools, both commercial and open source such as pentaho that will accommodate the ETL and analysis very nicely.
Ultimately though, what you want to keep in mind is that every financial firm in the world has a stock analysis/ auto-trader application; they just caused a major U.S. stock market tumble and they are not toys. :)
A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.
I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)
发布评论
评论(4)
自从 2010 年提出这个问题以来,已经发布了多个数据库引擎或开发了专门处理时间序列(例如股票报价数据)的功能:
对于 MongoDB 或其他面向文档的数据库,如果您的目标是性能,建议是 扭曲您的架构以组织以秒为键的对象中的刻度(或分钟的对象,每分钟是另一个 60 秒的对象) 。借助专门的时间序列数据库,您可以简单地查询数据
对于 InfluxDB,这非常简单。以下是获取每日最小值和最大值的方法:
您可以按时间间隔进行分组,时间间隔可以是微秒 (
u
)、秒 (s
)、分钟 (m
)、小时(h
)、天(d
)或周(w
)。TL;DR
对于存储和查询大量股票数据,时间序列数据库是比面向文档的数据库更好的选择。
Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
You can group by time intervals which can be in microseconds (
u
), seconds (s
), minutes (m
), hours (h
), days (d
) or weeks (w
).TL;DR
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.
这里的答案取决于范围。
MongoDB 是“输入”数据的好方法,并且查询各个数据块的速度非常快。它也很好,因为它是为了水平扩展而构建的。
然而,您必须记住的是,所有重要的“查询”实际上都是由“批处理作业输出”产生的。
例如,Gilt Groupe 创建了一个名为 Hummingbird 的系统,用于实时分析他们的网站。演示文稿 这里。它们基本上是根据在很短的时间间隔(15 分钟)内收集的性能数据动态呈现页面。
在他们的例子中,他们有一个简单的循环:将数据发布到 mongo ->运行map-reduce ->将数据推送到网络进行实时优化 ->冲洗/重复。
老实说,这非常接近您可能想做的事情。然而,这里有一些限制:
另一方面,您将遇到这些 SQL 问题的不同变体。
当然,这里有一些好处:
但正如其他人所提到的,您将无法访问 ETL 和其他常见分析工具。您肯定会编写许多自己的分析工具。
The answer here will depend on scope.
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
On the other hand, you'll run into different variants of these problems with SQL.
Of course there are some benefits here:
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.
这是我对这个想法的保留——我将公开承认我对文档数据库的工作知识很薄弱。我假设您希望存储所有这些数据,以便您可以对其执行一些聚合或基于趋势的分析。
如果您使用基于文档的数据库作为源,则每行数据的加载和操作(CRUD 操作)非常简单。非常高效,非常直接,基本上很可爱。
糟糕的是,提取这些数据并将其填充到更适合统计分析的结构(例如柱状数据库或立方体)中的选项(如果有的话)非常少。如果将其加载到基本的关系数据库中,则有许多工具,包括商业工具和开源工具,例如 pentaho 可以很好地适应 ETL 和分析。
但最终,您要记住的是,世界上每家金融公司都有股票分析/自动交易应用程序;它们只是导致了美国股市的大幅下跌,而且它们不是玩具。 :)
Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation or trend-based analysis on it.
If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.
What sucks is that there are very few, if any, options to extract this data and cram it into a structure more suitable for statistical analysis e.g. columnar database or cube. If you load it into a basic relational database, there are a host of tools, both commercial and open source such as pentaho that will accommodate the ETL and analysis very nicely.
Ultimately though, what you want to keep in mind is that every financial firm in the world has a stock analysis/ auto-trader application; they just caused a major U.S. stock market tumble and they are not toys. :)
当执行分析合理地超出单个系统的容量时,简单的数据存储(例如键值或文档数据库)也很有用。 (或者需要一台特别大的机器来处理负载。)在这些情况下,使用简单的存储是有意义的,因为分析无论如何都需要批处理。我个人会考虑寻找一种水平扩展处理方法来提出所需的单位/时间分析。
我会研究使用基于 Hadoop 构建的东西进行并行处理。要么使用 Java/C++ 中的原生框架,要么使用一些更高级别的抽象:Pig、Wukong、通过流接口的二进制可执行文件等。如果您对此途径感兴趣,亚马逊会提供相当便宜的处理时间和存储。 (我没有个人经验,但许多人有并依赖它来开展业务。)
A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.
I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)