使用SSAS分析IIS日志
我想聚合我们的 IIS 日志,并能够快速对它们执行简单的查询(例如,过去一个月中页面 x 被点击了多少次等)。
我想聚合这些数据(可能会降低到 10 分钟左右的粒度 )在时间维度上)到 SSAS 立方体中。
我们已经通过 SSIS 将日志导入到表中。表变得非常大,我想开始删除较旧的数据(比如 3 个多月前),同时将历史记录保留在多维数据集中(这样我仍然可以查询超过 3 个月的时间)。这是合理的事情吗 - 显然,如果我想更改它,我将无法轻松重建我的多维数据集......而且我想我需要开始备份多维数据集,就好像它是数据库一样?
目前我对数据没有 PK - 添加身份列或创建过时、时间和 url 的复合 PK 会更好吗?
来自成功实施此功能的人的任何反馈都将非常棒:)
谢谢!
I would like to aggregate our IIS logs and be able to quickly perform simple queries against them (e.g. how many times as page x hit in the past month etc.)
I'd like aggregrate this data (maybe down to 10 minutes or so granularity on the time dimension) into an SSAS cube.
We already import the logs via SSIS into tables. The tables are becoming very large, and I'd like to start deleting older data (say more than 3 months ago), whilst keeping the history in the cube (so I can still query over longer than 3 months). Is this a reasonable thing to do - clearly I'd not be able to rebuild my cube easily should I wish to change it... and I guess I'd need to start backing up the cube as if it was a database?
Currently I don't have a PK on the data - Would it be better to add an identity column or create a composite PK out of date,time and url?
Any feedback from someone that has successfully implemented this would be fantastic :)
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我还没有完全做到这一点,但我会尽可能地给你意见:
表变得很大,为什么这是一个问题 - 是存储空间还是速度?
如果速度快,请考虑使用表分区来拆分大型表。您可以按日期范围对它们进行分区,然后将分区切换到另一个表(减少原始表的大小),这是一个纯粹的元数据操作,并且是即时的。然后,SSAS 可以使用在处理时合并两个表的视图(如果您需要重建的话)。
如果是存储空间,您是否查看过 SQL Server 中的压缩(2008 年可用,不确定您使用的是哪个版本?)。
就我个人而言,如果没有重建多维数据集的能力,我不会感到高兴 - 还要记住,多维数据集会复制源数据(或者更确切地说,它根据 DSV 使用的部分),因此您可能不会像您想象的那样节省那么多如果您删除旧数据&将立方体视为“存储设备”。您的多维数据集仅使用部分表格吗?与基础数据相比,它的大小是多少?
SSAS 并不严格要求对数据进行 PK - 但是 - 我总是使用它们,主要是为了防止重复加载(我也按时间加载 - 检查数据比上次加载的数据更新),但最好有一个 PK 约束来防止重复加载。
对于您的 PK,日期、时间、URL 听起来不错,但取决于您网站的繁忙程度。您的示例不允许两个人在同一秒查看相同的 URL。能否将 IP 地址添加到 PK 中?如果访客刷新得很快怎么办?您能否将其视为重复项,并将其从 SSIS 数据流中删除?
祝你好运,如果您对我所说的内容有任何疑问,请告诉我。
I haven't done exactly this, but I'll give you opinions on as much as I can:
The tables becoming large, why is this a problem - Is it storage space, or speed?
If speed, consider using table partitioning to split your large tables. You can partition them by date range, then switch the partitions into another table (Reducing the size of the original), this is a pure metadata operation and will be instant. SSAS can then use a view that unions both tables when it processes, if you ever need to rebuild you can.
If storage space, have you looked at compression in SQL Server (Available in 2008, not sure what version you're on?).
Personally I just wouldn't be happy without the ability to rebuild the cube - Also remember that the cube makes a copy of the source data (Or rather the parts it uses as per the DSV) so you may not save as much as you think if you delete old data & treat the cube as the 'storage device'. Does your cube only use a portion of the tables? What size is it compared to the underlying data?
A PK on the data is not strictly required for SSAS - BUT - I always use them, primarily to prevent duplicate loads (I also load by time - Check data is newer than last loaded) but it's good to have a PK constraint preventing duplicate loads.
For your PK, Date, Time, URL sounds good, but depends on how busy your site is. Your example would not allow two people to view the same URL at the same second. Could you add IP Address to the PK? What if a visitor refreshed quickly? Would/Could you treat that as a duplicate, and remove it in the SSIS dataflow?
Good luck, let me know if you have any questions on what I've said.