用于设计非常大的低粒度数据库的提示/链接/书籍?
我公司的 SAS 程序员与研究人员合作,分析存储在许多大小约为 1Tb 的文本文件中的数据。生成的 SAS 流程可能需要数天才能运行。每当研究人员想要稍微改变一个问题时,就必须重新运行该过程,这需要更多的时间或几天的时间。
SAS 程序员向我们的 DBA 团队寻求一种存储数据的方法,目的是大大提高查询性能。
两个主要困难是:
- 我们只有少数示例查询,并且没有特别典型的查询集。
许多查询的形式如下
选择计数(不同的 ID) 从表 t 其中 a = true 且 b = 3 AND c IN (3 to 10);
但其中 WHERE 过滤器参数未知,并且可以包含列和属性的任意组合。这就是说,在我看来(阅读了一些有关数据仓库的内容),我们的要求排除了典型的数据仓库方法,在该方法中我们执行一些聚合并使用更高粒度的记录。
我正在寻找任何涉及设计具有类似约束的数据库的资源。在 Bill Inmon 的构建数据仓库中,他简要提到了“探索仓库”和“数据挖掘仓库”。使用这些术语,我发现这篇文章稍微有帮助:“设计数据仓库以进行有效的数据挖掘”[pdf],但或多或少就是这样。我在搜索“数据挖掘”时发现的大部分内容都与 OLAP 有关。
我是一名新手 DBA,我的任务是为此设计提出一些建议。我认为此时我最有帮助的建议是建议我们设计时尽可能避免昂贵的连接。我在这里陷入了困境——不期待奇迹,但任何明智的建议或阅读推荐都会非常受欢迎。
谢谢!
SAS programmers at my company work with researchers to analyze data stored in a number of text files around 1Tb in size. The resulting SAS processes can take days to run. Whenever the researchers want to change a question slightly, the processes have to be re-run, requiring further hours or days.
The SAS programmers approached our DBA team for a way of storing their data with the aim of greatly improving query performance.
Two main difficulties are:
- We have only a handful of example queries, and there is no particularly typical set of queries to expect.
Many of the queries will be of a form like
SELECT COUNT(DISTINCT id)
FROM TABLE t
WHERE a = true
AND b = 3
AND c IN (3 to 10);
but in which the WHERE filter parameters are unknown and could include any combination of columns and attributes. This is to say, it seems to me (having read up a bit about data warehouses) that our requirements exclude a typical data warehouse approach in which we perform some aggregations and work with a higher granularity of records.
I'm looking for any resources that speak to designing databases with similar constraints. In Bill Inmon's Building the Data Warehouse, he briefly mentions "exploration warehouses" and "data mining warehouses". Using these terms I found this article that was slightly helpful: "Designing the Data Warehouse for Effective Data Mining" [pdf], but that's more or less it. Most of what I find when searching re: "data mining" regards OLAP.
I'm a novice DBA and I've been tasked with coming up with some suggestions for this design. I think at this point my most helpful suggestion will be to suggest we design to avoid expensive joins as much as possible. I'm out on a limb here--not expecting miracles, but any sage advice or reading recommendations would be very welcome.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
阅读拉尔夫·金博尔 (Ralph Kimball) 所著的所有内容。
http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/ dp/0471200247
您的典型查询(
SELECTaggregate FROMfactJOINdimensionWHERE criteria
)是明星的最佳选择架构。忘记“数据挖掘”。这不是一个有用的术语。
重点关注“星型模式”。构建正确的数据结构。
Read everything you can by Ralph Kimball.
http://www.amazon.com/Data-Warehouse-Toolkit-Complete-Dimensional/dp/0471200247
Your typical query (
SELECT aggregate FROM fact JOIN dimension WHERE criteria
) is the sweet spot for the star schema.Forget "data mining". It isn't a helpful term.
Focus on "Star Schema". Build the right data structure.
我想发表评论以获得更多澄清,但似乎我还不能!
比如...
维护一堆 SAS 数据集,您可以在其中附加增量
数据?
这里有一些建议...
如果资金不是问题,那么切换到像 Netezza 这样的后端 DBMS 将有助于解决这个问题。
一种更简单的方法可能是将数据拆分为更小的数据集,然后更改查询以动态查看正确的数据集。例如,如果所有查询都查看 A 变量为 true 或 false,并且 true 或 false 大约为 50/50,则此处将数据拆分为两个数据集可能会将给定示例的查询时间减半。这种方法的唯一问题是它实际上取决于找到最佳分割来容纳所有查询类型。
索引也可能有助于加快速度。您需要分析哪些变量将成为索引的候选变量。
如果您需要更多信息,请告诉我。
谢谢,
中号
I wanted to make a comment to get a bit more clarification but it seems I can't yet!
such as...
maintain a bunch of SAS datasets to which you append the incremental
data?
Here are some suggestions...
If funding is not an issue then switching to a backend DBMS like Netezza would help with this problem.
A simpler approach may be to split the data into smaller datasets and then change the queries to dynamically look at the correct datasets. e.g. if all the queries are looking at the A variable being either true or false and true or false is about 50/50 then splitting the data into two datasets here may halve your query time for that given example. The only problem with this approach is that it really depends on finding the best split to accommodate all the query types.
Also indexing may help speed things up. You would need to do the analysis on which variables would be candidates for the index.
Please let me know if you need any more information.
Thanks,
M