如何处理用于分析的大型数据集和不同数量的列?

发布于 2024-09-16 17:33:44 字数 521 浏览 6 评论 0原文

我正在为移动应用程序构建一个分析系统,但在决定如何存储和处理大量数据时遇到了一些困难。

每行将代表一个“视图”(如网页)并存储一些固定属性,如用户代理和日期。另外,每个视图可以具有不同数量的额外属性,这些属性与执行的动作或内容标识符相关。

我研究过 Amazon SimpleDb,它可以很好地处理不同数量的属性,但不支持 GROUP BY,并且在计算行时似乎也表现不佳。生成包含 30 个数据点的每月图表需要对每个数据集的每一天进行查询。

MySQL 可以更好地处理 COUNT 和 GROUP 修饰符,但附加属性需要存储在链接表中,并且需要 JOIN 来检索属性与给定值匹配的视图,这不是很快。 5.1 的分区功能可能有助于加快速度。

我从上述系统上的大量阅读和分析查询中收集到的信息是,最终所有数据都需要聚合并存储在表中,以便快速生成报告。

我在我的研究中是否遗漏了任何明显的东西?有没有比使用 MySQL 更好的方法来做到这一点?感觉这不是适合这项工作的任务,但我找不到任何能够同时进行 GROUP/COUNT 查询和灵活的表结构的东西。

I'm building an analytics system for a mobile application and have had some difficulty deciding how to store and process large amounts of data.

Each row will represent a 'view' (like a web page) and store some fixed attributes, like user agent and date. Additionally, each view may have a varying number of extra attributes, which relate to actions performed or content identifiers.

I've looked at Amazon SimpleDb which handles the varying number of attributes well, but has no support for GROUP BY and doesn't seem to perform well when COUNTing rows either. Generating a monthly graph with 30 data points would require a query for each day per dataset.

MySQL handles the COUNT and GROUP modifiers much better but additional attributes require storage in a link table and a JOIN to retrieve views where attributes match a given value, which isn't very fast. 5.1's partitioning feature may help speed things up a bit.

What I have gathered from a lot of reading and profiling queries on the aforementioned systems is that ultimately all of the data needs to be aggregated and stored in tables for quick report generation.

Have I missed anything obvious in my research and is there a better way to do this than use MySQL? It doesn't feel like the right task for the job, but I can't find anything capable of both GROUP/COUNT queries and a flexible table structure.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

停滞 2024-09-23 17:33:44

在这种情况下,您希望存储一次数据并一遍又一遍地读取它。此外,我认为您希望对查询进行预处理,而不是每次都需要计算。

我建议您将数据存储在 CouchDB 中,原因如下:

  • 它的表是无结构的
  • 它的查询是经过预处理的
  • 它对 Map-Reduce 的支持允许您的查询处理分组依据
  • 它有一个 REST 服务访问模型,可以让您连接 考虑到 CouchDB 的新特性,

您可能会发现这个建议有点过时了。不过,我建议您阅读它,因为我个人认为运行 CouchDB 数据库是甜蜜且轻量级的。比 MySQL 更轻量

This is a case where you want to store the data once and read it over and over. Further I think that you'd wish the queries to be preprocessed instead of needing to be calculated on every go.

My suggestion for you is to store your data in CouchDB for the following reasons:

  • Its tables are structureless
  • Its queries are pre-processed
  • Its support for map-reduce allows your queries to handle group by
  • It has a REST service access model which lets you connect from pretty much anything that handle HTTP requests

You may find this suggestion a little out there considering how new CouchDB is. However I'd suggest for you to read about it because personally I think running a CouchDB database is sweet and lightweight. More light weight than MySQL

甜扑 2024-09-23 17:33:44

保留在MySQL中:如果写入量有限/读取更常见,并且数据相对简单(即:您可以预测可能的字符),您可以尝试在主表中使用text/blob列,这通过连接表上的 AFTER INSERT / UPDATE 触发器使用逗号分隔的值或键/值对进行更新。您将实际数据保存在单独的表中,因此搜索 MAX/特定“额外”属性仍然可以相对较快地完成,但检索其中一个“视图”的完整数据集将是主表中的一行,这您可以使用您正在使用的脚本/应用程序将其拆分为单独的值,从而减轻数据库本身的大部分压力。

这样做的缺点是连接表中更新/插入的成本大大增加:每次数据更改都需要查询所有相关数据以获取记录,然后第二次插入到“普通”表中,例如

UPDATE join_table
JOIN main_table
ON main_table.id = join_table.main_id
SET main_table.cache  = GROUP_CONCAT(CONCAT(join_table.key,'=',join_table.value) SEPARATOR ';')
WHERE join_table.main_id = 'foo' GROUP BY main_table.id`).

但是,随着分析数据的发展,它通常会有所落后,因此可能并非每次更新都必须触发缓存中的更新,只需每日 cronscript 用昨天的数据填充缓存即可。

Keeping it in MySQL: If the amount of writes are limited / reads are more common, and the data is relatively simple (i.e: you can predict possible characters), you could try to use a text/blob column in the main table, which is updated with comma separated values or key/value pairs with an AFTER INSERT / UPDATE trigger on the join table. You keep the actual data in a separate table, so searching for MAX's / specific 'extra' attributes can still be done relatively fast, but retrieving the complete dataset for one of your 'views' would be a single row in the main table, which you can split into the separate values with the script / application you're using, relieving much of the stress on the database itself.

The downside of this is a tremendous increase in cost of updates / inserts in the join table: every alteration of data would require a query on all related data for a record, and a second insert into the 'normal' table, something like

UPDATE join_table
JOIN main_table
ON main_table.id = join_table.main_id
SET main_table.cache  = GROUP_CONCAT(CONCAT(join_table.key,'=',join_table.value) SEPARATOR ';')
WHERE join_table.main_id = 'foo' GROUP BY main_table.id`).

However, as analytics data goes it usually trails somewhat, so possibly not every update has to trigger an update in cache, just a daily cronscript filling the cache with yesterdays data could do.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文