mySQL - 大型指标表和繁重的查询性能 - 缓存?
我有一个大型数据库,扩展速度相当快,并且有许多繁忙的表,记录用户行为的各个方面。
目前,我有一个工作室,用户可以在其中看到这种用法和行为明显显示在图表等中。问题是,现在加载这些东西是非常密集的。有一个使用人数为 80,000 人的项目,需要很长时间才能加载统计数据。
现在,表的结构非常好,并在连接上建立了索引等。我一直在寻求建议并寻求学习最佳实践,以尝试并帮助为这种数据大小做好最佳准备。但是,如果没有更多的查询/表优化范围,我还能如何加速这个密集的过程?。
我注意到大多数分析等默认情况下都允许您查看直到昨天的内容。这有帮助吗?
- 这是否意味着统计信息可以通过mysql上的query_cache缓存?如果查询不断地在明天结束(从而计算今天的统计数据),它不会缓存吗?
- 每小时编译可引用的静态 XML 等,而不是每次都进行查询,是否更明智?
- 还有什么办法呢?
任何想法都非常受欢迎。
I've got a large database, quite rapidly expanding and I've got a number of busy tables, logging every aspect of user's behaviour.
At the moment, I have a studio where users can see this usage and behaviour obviously displayed in charts, etc. etc. The thing is, it's seriously intensive to load this stuff now. Had a project that had usage of 80,000 people and it takes an age to load the stats.
Now, the tables are quite well structured and indexed on joins etc. I've had advice and sought learning along the way for best practice to try and help best prepare for this data size. But, without much more scope in query/table optimisation how else can I speed up this intensive process?.
I notice most analytics and such allow you to view up until yesterday by default. Does that help?
- Does this mean the statistics can be cached by query_cache on mysql? If the query constantly ends tomorrow (thereby counting today's stats), will it not cache?
- Is it more sensible to compile static XMLs etc. each hour that can be referenced, instead of doing queries each time?
- How else?
Any thoughts very much welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您想将数据分成两个数据库。一种针对插入进行优化以捕获数据的方法。第二个针对数据检索进行了优化。您无法使用一个数据库来处理这两项任务。针对大量数据插入进行优化意味着将完成的索引量减少到绝对最低限度(基本上只是主键),并且在进行数据挖掘时删除键会降低性能。
所以...两个数据库。将所有数据捕获到插入优化数据中。然后安排一个作业将当天的数据捕获到另一个数据库中,并在那里运行分析。
作为副作用,这就是“直到昨天”限制的来源。今天的数据将不可用,因为它位于单独的数据库中。
You'd want to split things up into two databases. One optimized for insertion to capture the data. And a second one optimized for data retrieval. You can't do this with one single database handling both tasks. Optimizing for heavy data insertion means reducing to absolute bare mininum the amount of indexing done (basically just primary keys), and removing keys kills performance when it comes time to do the data mining.
So... two databases. Capture all the data into the insert-optimized one. Then have a scheduled job slurp over the day's data capture into the other database, and run your analyses there.
As a side effect, this where the "up until yesterday" restriction comes from. Today's data won't be available as it's in a separate database.
如果您不需要显示实时结果;您可以将结果缓存到 Memcache、APC、Redis 或等效项,并在一天后缓存过期。
Mysql会将结果缓存到query_cache中。但是你不记得当表/行改变时mysql会清除query_cache。而且它的尺寸有限。
If you dont need to show realtime results ; You can cache results to Memcache, APC, Redis or equilevent with expire cache after one day.
Mysql will be cache results to query_cache. But you dont remember mysql clears query_cache when table/row was changed. And its having a limited size.
额外的硬件是不可能的吗?在这种情况下,将数据复制到几个从站可能会加快速度。您还可以使用 Mark B 建议的版本来分割数据库,仅在非高峰时间(例如夜间)更新从属数据库。
Is extra hardware out of the question? Replicating the data to a few slaves would probably speed things up in this situation. You could also use a version Mark B's suggestion for splitting the database by only updating the slaves at off peak times, overnight for example.
Marc B 是对的 - 您希望将数据捕获与分析/报告系统分开。
其常规名称是“数据仓库”或类似名称。这些往往与您的生产数据库具有非常不同的模式 - 高度非规范化或多维“星型”模式很常见。
如果您看到您的产品不断增长,您可能想立即实现跳跃 - 但这是一套全新的技能和技术,因此您可能需要采取小步骤。
无论哪种情况,都请在物理上独立的硬件上运行数据收集和报告数据库。如果您确实选择数据仓库路线,请预留大量磁盘空间。
Marc B is right - you want to separate your data capture from your analytics/reporting system.
The conventional name for this is "data warehouse", or similar. These tend to have very different schemas to your production database - highly denormalized, or multi-dimensional "star" schemas are common.
If you see your product growing continuously, you may want to make the jump right now - but it's a whole new skill and technology set, so you might want to take baby steps.
In either case, run your data collection and reporting databases on physically separate hardware. If you do go the data warehouse route, budget for lots of disk space.
您没有确切说明表有多大,它们是什么类型,如何填充以及如何使用它们。所以,我只是想给出一些随机的想法:)
当您报告大量数据时,您基本上会受到磁盘系统速度的限制,即磁盘将数据传送到 MySQL 的速率。该速率通常以兆字节/秒为单位。因此,如果您可以获得 100mb/s,那么如果您想要亚秒级响应时间(暂时完全忽略数据库缓存),则无法对大于 100mb 的表执行 select sum() 或 count(*)。请注意,100mb 相当于 2000 万条记录,行大小为 50 字节。
这在一定程度上起作用,然后一切就消失了。通常当数据库的大小变得大于可用内存并且并发用户数量增加时。
您将需要研究创建聚合表的可能性,以便可以减少需要扫描的兆字节数。最好通过一个例子来解释。假设您当前的度量表看起来像这样:
对于执行的每个操作(登录、注销、单击这个、放屁、单击那个),您存储用户的 ID 以及发生该操作时的时间戳。
如果您想绘制从年初开始的每日登录次数,则必须对所有 100,000,000 百万行执行 count(*),并按
天(时间戳)
进行分组。相反,您可以提供一个预先计算的表,例如:
该表通常会加载以下内容:
如果您有 100 个可能的操作,则只需要 36,500 行来存储全年的活动。用户运行统计数据、图表、报告以及其他与这些数据无关的内容不会比典型的 OLTP 事务重。当然,您也可以按小时存储(或改为按小时存储),并在一年内达到 876,000 行。您还可以使用上表报告每周、每月、第三次或每年的数据。
如果您可以将用户操作分为不同的操作类别,比如“有趣”、“不太有趣”、可能有害”和“完全错误”,您可以将存储空间从 100 个可能的操作进一步减少到 4 个。
显然,您的数据比这更复杂,但是您几乎总能找到合适的聚合表,它们可以在高聚合级别上回答几乎任何问题。 一旦您“深入”了聚合表,您就可以使用所有这些表。过滤器,然后您可能会发现很可能使用特定的
日期
和特定的操作
来选择最低的详细表格。You don't say exactly how big the tables are, what kind of tables they are, how the are being populated and how they are being used. So, I'm just going to give some random thoughts :)
When you are reporting over large amounts of data, you are basically limited to the speed of your disk system, i.e at what rate your disks deliver the data to MySQL. This rate is usually measured in megabytes/second. So if you can get 100mb/s, then you cannot perform a select sum() or count(*) on a table bigger than 100mb if you want subsecond response time (completely ignoring the DB cache for a moment). Please note that 100mb would be something like 20 million records with a rowsize of 50 bytes.
This works up to a point and then everything just dies. Usually when the size of the database becomes larger than available memory and the number of concurrent users increases.
You will want to investigate the possibility to create aggregate tables, so that you can reduce the nr of megabytes you need to scan through. It can best be explained by an example. Say that your current measure table looks something like this:
For every single action performed (logged in, logged out, clicked this, farted, clicked that) you store the ID of the user and the timestamp when it happened.
If you want to plot the daily nr of logins from the start of the year, you would have to perform a count(*) over all 100,000,000 million rows and group by the
day(timestamp)
.Instead, you could provide a precalculated table such as:
That table would typically be loaded with something like:
If you had 100 possible actions, you would need only 36,500 rows to store the activities of an entire year. Users running statistics, charts, reports and what not on that data wouldn't be any heavier than your typical OLTP transactions. Of course, you could store it on hourly basis as well (or instead) and arrive at 876,000 rows for a year. You can also report on weekly, monthly, tertial or yearly figures using the above table.
IF you can group your user actions into categories of actions, say "Fun", "Not so fun", potentially harmful" and "flat out wrong" you could reduce the storage further from 100 possible actions, down to 4.
Obviously, your data is more complicated than this, but you can almost always come up with a suitable nr of aggregate tables that can answer almost any question on an high aggregate level. Once you have "drilled down" through the aggregate tables, you can use all those filters and then you might find it is very possible to select against the lowest detailed table using a specific
date
, and a specificaction
.