如何使用日终库存数据模式重用结果
我正在创建一个数据库模式,用于技术分析,例如成交量最高者、价格最高者等。我在这里检查了问题的答案,例如 设计问题。从 boe100 的答案中得到暗示后,我有一个几乎以此为模型的模式,因此:
Symbol - char 6 //primary
Date - date //primary
Open - decimal 18, 4
High - decimal 18, 4
Low - decimal 18, 4
Close - decimal 18, 4
Volume - int
现在这个包含日终(EOD)数据的表将在 3 年内大约有 300 万行。稍后,当我获取/需要更多数据时,可能会达到 2000 万行。
前端将提出诸如“给我 Y 天内 X 日价格涨幅最高的商品”之类的请求。我认为该请求是比较简单的请求之一,因此成本和时间都不算太高。
但像“以过去 100 天作为基准,给我过去 10 天成交量涨幅最大的股票”之类的请求可能会导致成本增加 10 到 100 倍。这样的请求的结果将是一个浮点数,表示体积增长了多少倍等。
我的一个选择是为每个这样的结果添加一列。如果用户要求 20 天内 10 天内的销量增长,则需要另一列。这些列的总数很容易超过 100,特别是如果我开始添加其他结果作为列,例如 MACD-10、MACD-100。每个都需要自己的列。
这是一个可行的解决方案吗?
另一个选择是我将结果保存在缓存的 html 文件中并将其呈现给用户。我在网络开发方面没有太多经验,所以对我来说它看起来很混乱;但我可能是错的(ofc!)。这也是一个选择吗?
让我补充一点,我正在/将使用 mod_perl 向用户呈现响应。 mysql 数据库的大部分工作都是使用 perl 完成的。我希望响应时间为 1-2 秒。
I am creating a database schema to be used for technical analysis like top-volume gainers, top-price gainers etc.I have checked answers to questions here, like the design question. Having taken the hint from boe100 's answer there I have a schema modeled pretty much on it, thusly:
Symbol - char 6 //primary
Date - date //primary
Open - decimal 18, 4
High - decimal 18, 4
Low - decimal 18, 4
Close - decimal 18, 4
Volume - int
Right now this table containing End Of Day( EOD) data will be about 3 million rows for 3 years. Later when I get/need more data it could be 20 million rows.
The front end will be asking requests like "give me the top price gainers on date X over Y days". That request is one of the simpler ones, and as such is not too costly, time wise, I assume.
But a request like " give me top volume gainers for the last 10 days, with the previous 100 days acting as baseline", could prove 10-100 times costlier. The result of such a request would be a float which signifies how many times the volume as grown etc.
One option I have is adding a column for each such result. And if the user asks for volume gain in 10 days over 20 days, that would require another column. The total such columns could easily cross 100, specially if I start adding other results as columns, like MACD-10, MACD-100. each of which will require its own column.
Is this a feasible solution?
Another option being that I keep the result in cached html files and present them to the user. I dont have much experience in web-development, so to me it looks messy; but I could be wrong ( ofc!) . Is that a option too?
Let me add that I am/will be using mod_perl to present the response to the user. With much of the work on mysql database being done using perl. I would like to have a response time of 1-2 seconds.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您应该尽可能保持数据标准化,并让 RDBMS 完成其工作:根据标准化数据高效执行查询。
不要事后猜测什么会有效,什么不会有效;相反,仅针对 RDBMS 的查询解释器报告的特定的、可测量的低效率进行优化。
有效的优化工具包括(按粗略的优先顺序排列):
进一步规范化数据,以允许 RDBMS 自行决定如何最好地回答查询。
重构特定查询以消除查询解释器报告的低效率问题。这将为如何提高应用程序的效率提供良好的反馈,或者可能导致如上所述的关系更好的规范化。
创建属性索引,这些属性在实践中被用于大量事务中。这可能非常有效,但它是在维护索引时减慢大多数写入操作的权衡,以在使用索引时提高某些特定读取操作的速度。
创建补充表来保存中间预先计算的结果以供将来查询使用。这绝不是一个好主意,尤其是因为它完全打破了 DRY 原则;现在,您必须想出一种保持重复信息(原始数据和派生数据)同步的策略,这样,当没有重复数据时,RDBMS 将发挥最佳作用。
这些都不会涉及到存储主要数据的表内部的混乱。
You should keep your data normalised as much as possible, and let the RDBMS do its work: efficiently performing queries based on the normalised data.
Don't second-guess what will or will not be efficient; instead, only optimise in response to specific, measured inefficiencies as reported by the RDBMS's query explainer.
Valid tools for optimisation include, in rough order of preference:
Normalising the data further, to allow the RDBMS to decide for itself how best to answer the query.
Refactoring the specific query to remove the inefficiencies reported by the query explainer. This will give good feedback on how the application might be made more efficient, or might lead to a better normalisation of relations as above.
Creating indexes on attributes that turn out, in practice, to be used in a great many transactions. This can be quite effective, but it is a trade-off of slowdown on most write operations as indexes are maintained, to gain speed in some specific read operations when the indexes are used.
Creating supplementary tables to hold intermediary pre-computed results for use in future queries. This is rarely a good idea, not least because it totally breaks the DRY principle; you now have to come up with a strategy of keeping duplicate information (the original data and the derived data) in sync, when the RDBMS will do its job best when there is no duplicated data.
None of those involve messing around inside the tables that store the primary data.