数据库设计问题

发布于 2024-07-10 13:26:41 字数 564 浏览 2 评论 0 原文

我以原始形式(csv 和二进制)积累了大量数据 - 准确地说,几个月每天 4GB。

我决定加入文明世界并使用数据库来访问数据,我想知道什么是正确的布局; 格式非常简单:每次报价(买价、卖价、时间戳等)的几行 x 最多 50 万/天 x 数百种金融工具 x 月的数据。

有一个带有 MYISAM 的 MySQL 服务器(我知道它是这种类型使用的正确引擎),在商用硬件(2 x 1GB RAID 0 SATA,核心 2 @ 2.7GHz)上运行

正确的布局是什么数据库? 表格/索引应该是什么样子? 对于这种情况,一般建议是什么? 您预计什么会给我带来陷阱?

编辑:我的常见用法是简单的查询来提取特定日期和工具的时间序列信息,例如

SELECT (ask + bid) / 2
  WHERE instrument='GOOG'
  AND date = '01-06-2008'
  ORDER BY timeStamp;

编辑:我试图将所有数据填充到一个索引表中到了时间戳,但速度太慢了 - 因此我认为需要一个更复杂的方案。

I accumulated a quite a lot of data in a raw form (csv and binary) - 4GB per day for a few months to be precise.

I decided to join the civilized world and use database to access the data and I wondered what would be the correct layout; the format is quite simple: a few rows for every time tick (bid, ask, timestamp, etc.) x up to 0.5Million/day x hundreds of financial instruments x monthes of data.

There is a MySQL server with MYISAM (which I understood would be the correct engine for this type of usage) running on commodity harware (2 x 1GB RAID 0 SATA, core 2 @ 2.7GHz)

What would be correct layout of the database? How should the tables/indices look like? What are the general recommendations with this scenario? What would you predict set me pitfalls along the way?

Edit: my common usage will be simple queries to extract time series information for a specific date and instruments, e.g.

SELECT (ask + bid) / 2
  WHERE instrument='GOOG'
  AND date = '01-06-2008'
  ORDER BY timeStamp;

Edit: I tried to stuff all my data in one table indexed by the timeStamp but it was way too slow - therefore I reckoned it would take a more elaborate scheme.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

素染倾城色 2024-07-17 13:26:41

你并没有真正说出你的背景是什么以及你对编程和数据库设计了解多少。 听起来你应该读点书。 从概念上讲,您的设计相当简单。 您的描述仅标识了两个实体:

  • 金融工具; 和
  • 报价。

所以你需要识别属性。

金融工具:

  • 安全码;
  • 市场;
  • 引用

  • 时间戳;
  • 金融工具;
  • 竞价; 和
  • 要价。

对金融工具的引用就是所谓的外键。 每个表还需要一个主键,可能只是一个自动增量场地。

从概念上讲相当简单。

CREATE TABLE instrument (
  id BIGINT NOT NULL AUTO_INCREMENT,
  code CHAR(4),
  company_name VARCHAR(100),
  PRIMARY KEY (id)
);

CREATE TABLE quote (
  id BIGINT NOT NULL AUTO_INCREMENT,
  intrument_id BIGINT NOT NULL,
  dt DATETIME NOT NULL,
  bid NUMERIC(8,3),
  ask NUMERIC(8,3),
  PRIMARY KEY (id)
)

CREATE INDEX instrument_idx1 ON instrument (code);

CREATE INDEX quote_idx1 ON quote (instrument_id, dt);

SELECT (bid + ask) / 2
FROM instrument i
JOIN quote q ON i.id = q.instrument_id
WHERE i.code = 'GOOG'
AND q.dt >= '01-06-2008' AND q.dt < '02-06-2008'

如果您的数据集足够大,您可能需要在表中包含 (bid + Ask) / 2,这样您就不必即时计算。

好的,这就是标准化视图。 之后您可能需要开始进行性能优化。 考虑这个关于在 MySQL 中存储数十亿行的问题。 分区是 MySQL 5.1+(相当新)的一项功能。

但要问自己的另一个问题是:您需要存储所有这些数据吗? 我问这个问题的原因是,我曾经在网上经纪工作,我们只存储非常有限的窗口内的所有交易,并且交易将是比报价更小的数据集,这似乎是您想要的。

存储数十亿行数据是一个严重的问题,您确实需要认真的帮助来解决这个问题。

You don't really say what your background is and how much you know about programming and database design. It sounds like you should do some reading. Conceptually though your design is fairly simple. Your description identifies a mere two entities:

  • Financial instrument; and
  • Quote.

So you need to then identify the attributes.

Financial instrument:

  • Security code;
  • Market;
  • etc.

Quote:

  • Timestamp;
  • Financial instrument;
  • Bid price; and
  • Ask price.

The reference to the financial instrument is what's called a foreign key. Each table also needs a primary key, probably just an auto-increment field.

Conceptually fairly simple.

CREATE TABLE instrument (
  id BIGINT NOT NULL AUTO_INCREMENT,
  code CHAR(4),
  company_name VARCHAR(100),
  PRIMARY KEY (id)
);

CREATE TABLE quote (
  id BIGINT NOT NULL AUTO_INCREMENT,
  intrument_id BIGINT NOT NULL,
  dt DATETIME NOT NULL,
  bid NUMERIC(8,3),
  ask NUMERIC(8,3),
  PRIMARY KEY (id)
)

CREATE INDEX instrument_idx1 ON instrument (code);

CREATE INDEX quote_idx1 ON quote (instrument_id, dt);

SELECT (bid + ask) / 2
FROM instrument i
JOIN quote q ON i.id = q.instrument_id
WHERE i.code = 'GOOG'
AND q.dt >= '01-06-2008' AND q.dt < '02-06-2008'

If your dataset is sufficiently large you might want to include (bid + ask) / 2 in the table so you don't have to calculate on the fly.

Ok, so that's the normalized view. After this you may need to start making performance optimizations. Consider this question about storing billions of rows in MySQL. Partitioning is a feature of MySQL 5.1+ (fairly new).

But another question to ask yourself is this: do you need to store all this data? The reason I ask this is that I used to be working in online broking and we only stored all the trades for a very limited window and trades would be a smaller set of data than quotes, which you seem to want.

Storing billions of rows of data is a serious problem and one you really need serious help to solve.

子栖 2024-07-17 13:26:41

您需要做的是阅读数据库规范化。 如果您发现该文章太多,您只需浏览一下 第三范式教程。

What you need to do is to read up on database normalization. If you find that article too much, you should simply skim through a 3rd normal form tutorial.

回梦 2024-07-17 13:26:41

当在价格变动级别存储数据时,许多金融数据库至少按工具对数据进行分区,因为很少需要跨工具运行查询。 所以每个仪器一个表是正常的。 有些更进一步,还按日期进行分区,为每个工具/日期组合提供一个表格。 如果跨日期查询是常态,这可能会使查询变得更加困难。

因此有两个选择:

  1. 每个工具一个刻度表,在时间戳上有一个聚集索引
  2. 每个工具/日期一个刻度表,在时间戳上有一个聚集索引

这是访问速度和查询方便性之间的基本权衡。

When storing data at tick level, many financial databases partition the data at least by instrument as it is rare to want run a query across instruments. So a table per instrument is normal. Some go further and partition also by date, giving a table per instrument/date combination. This can make querying a lot more difficult if queries across dates are the norm.

So two options:

  1. A tick-table per instrument, with a clustered index on timestamp
  2. A tick-table per instrument/date, with a clustered index on timestamp

It's a basic trade-off between speed of access and ease of querying.

余生一个溪 2024-07-17 13:26:41

或者考虑星型模式、维度和事实。 Ralph Kimball 有一些好东西来告诉您如何去做。

Or perhaps consider a star schema, dimensions and facts. Ralph Kimball has some nice stuff to tell you how to go about it.

装迷糊 2024-07-17 13:26:41

丹尼,
我多年来一直在处理逐笔数据,并且很乐意就此进行合作。 给我发电子邮件 Hotmail 的 IanTebbutt。 (顺便说一句,我已经检查过,没有办法在 StackOverflow 上发送私人电子邮件,而 Jeff 似乎反对它 拒绝,。)

简而言之,我发现按日期和工具进行分区效果非常好。 您可以选择使用 InstrumentX_YYDD 等模式将仪器 X 一个月的数据放入一组表中。 然后,在访问数据时,您至少需要一个表名生成器,但更可能需要一个 sql 生成器,它可以决定使用哪个单个表,或者可能使用 Union 来查看多个表。

无论从哪个角度来看,这些数据量都不容易处理。 这已经接近数据仓库的领域了,而且有很多方法可以剥掉那只猫的皮。 就像我说的,很高兴合作 - 我可能已经解决了您一半的问题。

Dani,
I've been working with Tick by Tick data for years and would be happy to collaborate on this. Email me IanTebbutt at Hotmail. (BTW I've checked and there's no way to do private email on StackOverflow and Jeff seems way against it rejected,. )

Briefly I've found partitioning by date and instrument to work pretty well. You could chose to put a months worth of data for instrument X into a set of tables using a pattern like InstrumentX_YYDD. Then when accessing the data you need at the very least a table name generator, but more likely a sql generator that can decide which single table to use, or potentially use Union to look at multiple tables.

Whichever way you look at this those kind of data volumes are not easy to deal with. This verges into DataWarehouse territory and there's a huge number of ways of skinning that cat. Like I said, happy to collaborate - I've probably got half your issues fixed already.

只为守护你 2024-07-17 13:26:41

只是一些一般性观察:

  • 不要使用 TIMESTAMP 列,因为它是根据 INSERT 时间自动设置的。 由于您正在导入数据,这不是您想要的。
  • 如果您使用 MySQL DATETIME 列类型,则可以使用 MySQL 日期和时间函数 就可以了。
  • MyISAM 不支持 FOREIGN KEY 约束并默默地忽略它们。
  • 索引,索引,索引。 确保它们位于将用于查找的列上。 但是,如果您的列包含大量文本,则可能需要使用 FULLTEXT 搜索 来代替。
  • 如果您计划将其转变为具有 INSERT 以及 SELECT 查询的实时数据库,请考虑使用具有事务和行级锁定的 InnoDB (选择...进行更新)

Just some general observations:

  • Don't use a TIMESTAMP column, as it's automatically set based on the INSERT time. Since you're importing data, that's not what you want.
  • If you use the MySQL DATETIME column type, you can use the MySQL Date and Time functions on it.
  • MyISAM doesn't support FOREIGN KEY constraints and silently ignores them.
  • Indexes, indexes, indexes. Make sure you have them on columns you'll use for lookups. However, if you have columns with a lot of text, you may want to use FULLTEXT searches on them instead.
  • If you plan on turning this into a live database with INSERTs as well as SELECT queries, considering using InnoDB with transactions and row-level locking (SELECT ... FOR UPDATE)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文