当前位置：文江博客话题详情

如何处理用于分析的大型数据集和不同数量的列？

发布于 2024-09-16 17:33:44 字数 521 浏览 6 评论 0原文

我正在为移动应用程序构建一个分析系统，但在决定如何存储和处理大量数据时遇到了一些困难。

每行将代表一个“视图”（如网页）并存储一些固定属性，如用户代理和日期。另外，每个视图可以具有不同数量的额外属性，这些属性与执行的动作或内容标识符相关。

我研究过 Amazon SimpleDb，它可以很好地处理不同数量的属性，但不支持 GROUP BY，并且在计算行时似乎也表现不佳。生成包含 30 个数据点的每月图表需要对每个数据集的每一天进行查询。

MySQL 可以更好地处理 COUNT 和 GROUP 修饰符，但附加属性需要存储在链接表中，并且需要 JOIN 来检索属性与给定值匹配的视图，这不是很快。 5.1 的分区功能可能有助于加快速度。

我从上述系统上的大量阅读和分析查询中收集到的信息是，最终所有数据都需要聚合并存储在表中，以便快速生成报告。

我在我的研究中是否遗漏了任何明显的东西？有没有比使用 MySQL 更好的方法来做到这一点？感觉这不是适合这项工作的任务，但我找不到任何能够同时进行 GROUP/COUNT 查询和灵活的表结构的东西。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

停滞 2024-09-23 17:33:44

在这种情况下，您希望存储一次数据并一遍又一遍地读取它。此外，我认为您希望对查询进行预处理，而不是每次都需要计算。

我建议您将数据存储在 CouchDB 中，原因如下：

它的表是无结构的
它的查询是经过预处理的
它对 Map-Reduce 的支持允许您的查询处理分组依据
它有一个 REST 服务访问模型，可以让您连接考虑到 CouchDB 的新特性，

您可能会发现这个建议有点过时了。不过，我建议您阅读它，因为我个人认为运行 CouchDB 数据库是甜蜜且轻量级的。比 MySQL 更轻量

回复收藏 0 原文

甜扑 2024-09-23 17:33:44

保留在MySQL中：如果写入量有限/读取更常见，并且数据相对简单（即：您可以预测可能的字符），您可以尝试在主表中使用text/blob列，这通过连接表上的 AFTER INSERT / UPDATE 触发器使用逗号分隔的值或键/值对进行更新。您将实际数据保存在单独的表中，因此搜索 MAX/特定“额外”属性仍然可以相对较快地完成，但检索其中一个“视图”的完整数据集将是主表中的一行，这您可以使用您正在使用的脚本/应用程序将其拆分为单独的值，从而减轻数据库本身的大部分压力。

这样做的缺点是连接表中更新/插入的成本大大增加：每次数据更改都需要查询所有相关数据以获取记录，然后第二次插入到“普通”表中，例如

UPDATE join_table
JOIN main_table
ON main_table.id = join_table.main_id
SET main_table.cache  = GROUP_CONCAT(CONCAT(join_table.key,'=',join_table.value) SEPARATOR ';')
WHERE join_table.main_id = 'foo' GROUP BY main_table.id`).

但是，随着分析数据的发展，它通常会有所落后，因此可能并非每次更新都必须触发缓存中的更新，只需每日 cronscript 用昨天的数据填充缓存即可。

Keeping it in MySQL: If the amount of writes are limited / reads are more common, and the data is relatively simple (i.e: you can predict possible characters), you could try to use a text/blob column in the main table, which is updated with comma separated values or key/value pairs with an AFTER INSERT / UPDATE trigger on the join table. You keep the actual data in a separate table, so searching for MAX's / specific 'extra' attributes can still be done relatively fast, but retrieving the complete dataset for one of your 'views' would be a single row in the main table, which you can split into the separate values with the script / application you're using, relieving much of the stress on the database itself.

The downside of this is a tremendous increase in cost of updates / inserts in the join table: every alteration of data would require a query on all related data for a record, and a second insert into the 'normal' table, something like

UPDATE join_table
JOIN main_table
ON main_table.id = join_table.main_id
SET main_table.cache  = GROUP_CONCAT(CONCAT(join_table.key,'=',join_table.value) SEPARATOR ';')
WHERE join_table.main_id = 'foo' GROUP BY main_table.id`).

However, as analytics data goes it usually trails somewhat, so possibly not every update has to trigger an update in cache, just a daily cronscript filling the cache with yesterdays data could do.

回复收藏 0 原文

~没有更多了~