我正在开始一个分析项目,该项目将从多个不同来源获取数据并将它们相互比较。来源可以是从 API(例如 Google Analytics API)到本地托管数据库的任何内容。
我应该建立一个数据库来定期导入这些数据吗?
谁能建议一些最佳实践、模式或文章?我真的不知道从哪里开始,所以任何信息都会很棒!谢谢!
我将使用 SQL Server 2008 R2、C# 4.0。
I am starting on an analytics project that will be getting data from several different sources and comparing them to one another. Sources can be anything from an API such as google analytics API to a locally hosted database.
Should I build a single database to import this data into on a regular basis?
Can anyone suggest some best practices, patterns or articles? I really don't know where to start with this so any information would be great! Thanks!
I will be using SQL Server 2008 R2, C# 4.0.
发布评论
评论(3)
这是一个大问题,Mike - 很多人的整个职业生涯都只做数据仓库。
对于您的第一个问题,我会给一个合格的“是”——DWH 的主要吸引力之一是您可以将多个数据源合并为一个信息源。 (条件是,在某些情况下您可能不想这样做 - 例如,出于安全或性能原因。)
一如既往,维基百科 是获取有关此主题的信息的合理第一站。由于您的问题已标记为 data-warehouse< /a>,StackOverflow 是另一个可能的来源。
关于该主题的规范书籍可能是:
请注意,Inmon和 Kimball 的方法截然不同 - Inmon 专注于自上而下、规范化的关系方法来构建企业 DWH,而 Kimball 的方法更加自下而上、多维度、基于功能性数据集市。
DWH 工具包专注于构建 DWH 的技术方面,而 DWH 生命周期 工具包不仅基于技术细节,还基于组织挑战。
祝你好运!
That's a big question, Mike - plenty of people have entire careers doing nothing but Data Warehousing.
I would give a qualified "yes" to your first question - one of the main attractions of a DWH is that you can consolidate multiple data sources into a single source of information. (The qualification is that there may be circumstances where you don't want to do this - for example, for security or performance reasons.)
As ever, Wikipedia is a reasonable first stop for information on this subject. Since your question is already tagged with data-warehouse, StackOverflow is another possible source.
The canonical books on the subject are probably:
Note that the Inmon and Kimball approaches are radically different - Inmon concentrates on a top-down, normalised relational approach to constructing an enterprise DWH, while Kimball's approach is more bottom-up, dimensional, functional datamart-based.
The DWH Toolkit concentrates on the technical aspects of building a DWH, while The DWH Lifecycle Toolkit is based as much on the organisational challenges as on the technical details.
Good luck!
我将从 SSIS 开始,这是一种 SQL 附带的数据集成技术服务器。它可以处理您需要的大量数据源。如果您使用 Google 等 API 来获取数据,您可能需要首先将其放入临时表中。
从单个临时数据库开始,您将使用该数据库作为主要源将数据加载到 Analysis Services 中,并查看效果如何。使用 SSIS 填充该临时数据库。
I would start with SSIS which is a data integration technology that comes with SQL Server. It may handle a lot of the data sources you need. If you are using APIs such as Googles to get data you may need to put that in a staging table first.
Start with a single staging database which you will use as your primary source to load data into Analysis Services and see how that works out. Use SSIS to populate that staging database.
您需要采取以下步骤:
1.首先你需要选择ETL平台,如SSIS、Informatica或其他ETL工具等。
2.然后,您需要选择合适的数据库,如Oracle或SQL Server等。
3. 此后,需要对数据仓库进行逻辑建模(星型或雪花型)并
4. 最后,您需要开发整个数据仓库。
我建议创建两个数据库,即
1. ODS,用于存储不同来源的数据并进行清理和处理
2. 仓库数据库,用于存储所有相关数据。
You need to take up the following steps:
1. First you need to pick up the ETL platform like SSIS, Informatica, or other ETL tools, etc.
2. Then, you need to pick up the appropriate database like Oracle or SQL server, etc.
3. Thereafter, you need to make the logical data warehouse modeling (Star or Snowflake) and
4. Finally, you need to develop the whole data ware house.
I would advise making two databases, i.e.
1. ODS for storing the data from different sources and for cleansing and
2. Warehouse database for storing all the relevant data.