MySql 中的大表和分析
对于我的初创公司,我自己跟踪一切,而不是依赖谷歌分析。这很好,因为我实际上可以拥有 ips 和用户 id 以及一切。
这种方法一直运行良好,直到我的跟踪表增加了大约 200 万行。该表称为 acts
,并记录:
- ip
- url
- note
- account_id
...(如果有)。
现在,尝试做这样的事情:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
基本上永远不会完成。我切换到这个:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
但它仍然很慢,尽管有一个索引。
我显然不是 mysql 专业人士。我的问题是:
拥有大量数据的公司如何跟踪渠道转化率等数据?是否可以在 mysql 中执行,而我只是缺少一些知识?如果没有,我可以阅读哪些关于网站如何做到这一点的书籍/博客?
For my startup, I track everything myself rather than rely on google analytics. This is nice because I can actually have ips and user ids and everything.
This worked well until my tracking table rose about 2 million rows. The table is called acts
, and records:
- ip
- url
- note
- account_id
...where available.
Now, trying to do something like this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.url LIKE '%some_marketing_page%';
Basically never finishes. I switched to this:
SELECT COUNT(distinct ip)
FROM acts
JOIN users ON(users.ip = acts.ip)
WHERE acts.note = 'some_marketing_page';
But it is still very slow, despite having an index on note.
I am obviously not pro at mysql. My question is:
How do companies with lots of data track things like funnel conversion rates? Is it possible to do in mysql and I am just missing some knowledge? If not, what books / blogs can I read about how sites do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
虽然已接近“可观”,但 200 万行对于表来说仍然是一个相对较小的大小。 (因此通常可以实现更快的性能)
正如您所发现的,前端通配符效率特别低,如果该用例对于您的应用程序很常见,我们将必须找到解决方案。
可能只是您没有正确的索引集。然而,在继续之前,我想强调,虽然索引通常会提高各种 SELECT 语句的 DBMS 性能,但它系统地对“CUD”操作(即 SQL CREATE/INSERT、UPDATE 、DELETE 动词,即写入数据库而不是读取数据库的查询)。在某些情况下,索引对“写”查询的负面影响可能非常显着。
我特别强调索引的矛盾性质的原因是,您的应用程序似乎会作为其操作的正常部分进行大量的数据收集,并且您将需要注意随着 INSERT 查询变慢而可能出现的降级。一种可能的替代方案是将数据收集到一个相对较小的表/数据库中,没有或很少有索引,并定期将数据从该输入数据库导入到发生实际数据挖掘的数据库中。 (导入后,这些行可能会从“输入数据库”中删除,使其保持小而快速的 INSERT 功能。)
另一个关注/问题是关于转换表中行的宽度(列数和列数)。这些列的宽度之和)。性能不佳可能与行太宽有关,导致表的叶节点中的行太少,从而导致树结构比需要的更深。
返回索引...
鉴于问题中的几个查询,看来您可以从 ip + note 索引中受益(至少按此顺序使用这两个键创建的索引)。对索引情况的全面分析,坦率地说,无法在这里对数据库模式进行可能的审查(没有足够的信息......),但这样做的一般过程是列出最常见的用例和查看哪些数据库索引可以帮助解决这些情况。人们可以使用 mySQL 命令 EXPLAIN 深入了解特定查询的处理方式(最初或添加索引后)。
规范化或非规范化(或者实际上是两者的组合!)通常也是提高采矿作业性能的可行想法。
While getting towards 'respectable', 2 Millions rows is still a relatively small size for a table. (And therefore a faster performance is typically possible)
As you found out, the front-ended wildcard are particularly inefficient and we'll have to find a solution for this if that use case is common for your application.
It could just be that you do not have the right set of indexes. Before I proceed, however, I wish to stress that while indexes will typically improve the DBMS performance with SELECT statements of all kinds, it systematically has a negative effect on the performance of "CUD" operations (i.e. with the SQL CREATE/INSERT, UPDATE, DELETE verbs, i.e. the queries which write to the database rather than just read to it). In some cases the negative impact of indexes on "write" queries can be very significant.
My reason for particularly stressing the ambivalent nature of indexes is that it appears that your application does a fair amount of data collection as a normal part of its operation, and you will need to watch for possible degradation as the INSERTs queries get to be slowed down. A possible alternative is to perform the data collection into a relatively small table/database, with no or very few indexes, and to regularly import the data from this input database to a database where the actual data mining takes place. (After they are imported, the rows may be deleted from the "input database", keeping it small and fast for its INSERT function.)
Another concern/question is about the width of a row in the cast table (the number of columns and the sum of the widths of these columns). Bad performance could be tied to the fact that rows are too wide, resulting in too few rows in the leaf nodes of the table, and hence a deeper-than-needed tree structure.
Back to the indexes...
in view of the few queries in the question, it appears that you could benefit from an ip + note index (an index made at least with these two keys in this order). A full analysis of the index situation, and frankly a possible review of the database schema cannot be done here (not enough info for one...) but the general process for doing so is to make the list of the most common use case and to see which database indexes could help with these cases. One can gather insight into how particular queries are handled, initially or after index(es) are added, with mySQL command EXPLAIN.
Normalization OR demormalization (or indeed a combination of both!), is often a viable idea for improving performance during mining operations as well.
为什么加入?如果我们可以假设没有 IP 使其进入没有用户关联记录的行为,那么您不需要加入:
如果您确实需要 JOIN,则可能需要首先从行为中选择不同的 IP,然后将这些结果加入到用户(您必须查看执行计划并进行实验以确定这是否更快)。
其次,带有前导通配符的 LIKE 将导致行为的全表扫描,并且还需要一些昂贵的文本搜索。您可以通过三种选择来改进这一点:
在存储之前将 url 分解为组成部分,以便搜索与列值精确匹配。
在存储
要求搜索词出现在 url 字段的开头,而不是中间。
研究一个全文搜索引擎,该引擎将以这样的方式对 url 字段进行索引,甚至可以对索引执行内部 LIKE 搜索。
研究
最后,在对acts.notes进行搜索的情况下,如果笔记上的索引不能提供足够的搜索改进,我会考虑计算和存储笔记上的整数哈希并进行搜索。
Why the JOIN? If we can assume that no IP makes it into acts without an associated record in users then you don't need the join:
If you really do need the JOIN it might pay to first select the distinct IPs from acts, then JOIN those results to users (you'll have to look at the execution plan and experiment to see if this is faster).
Secondly, that LIKE with a leading wild card is going to cause a full table scan of acts and also necessitate some expensive text searching. You have three choices to improve this:
Decompose the url into component parts before you store it so that the search matches a column value exactly.
Require the search term to appear at the beginning of the of the url field, not in the middle.
Investigate a full text search engine that will index the url field in such a way that even an internal LIKE search can be performed against indexes.
Finally, in the case of searching on acts.notes, if an index on notes doesn't provide sufficient search improvement, I'd consider calculating and storing an integer hash on notes and searching for that.
尝试在查询上运行“EXPLAIN PLAN”并查看是否有任何表扫描。
这应该是左连接吗?
也许这个网站可以提供帮助。
Try running 'EXPLAIN PLAN' on your query and look to see if there are any table scans.
Should this be a LEFT JOIN?
Maybe this site can help.