MapReduce 是一种如何分析 http 服务器日志的好方法？

发布于 2024-07-21 22:28:51 字数 548 浏览 10 评论 0原文

我研究MapReduce有一段时间了，它似乎是实现容错分布式计算的一个非常好的方法。我阅读了很多有关该主题的论文和文章，在一系列虚拟机上安装了 Hadoop，并做了一些非常有趣的测试。我真的认为我理解映射和减少步骤。

但这是我的问题：我不知道它如何帮助进行http服务器日志分析。

我的理解是大公司（例如Facebook）使用MapReduce来计算他们的http日志为了加快从中提取观众统计数据的过程。我工作的公司虽然比 Facebook 小，但每天都有大量的网络日志需要计算（100Go 每月增长 5% 到 10%）。现在我们在一台服务器上处理这些日志，并且运行得很好。但分布式计算工作立即浮现在我的脑海中，这是一种很快就会变得有用的优化。

以下是我现在无法回答的问题，任何帮助将不胜感激：

MapReduce 概念真的可以应用于博客分析吗？
MapReduce 是最聪明的方法吗？
您将如何拆分各个计算实例之间的网络日志文件？

谢谢。
尼古拉斯

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

女中豪杰 2024-07-28 22:28:51

MapReduce 概念真的可以应用于博客分析吗？

可以。

您可以将 hudge 日志文件分成 10,000 或 1,000,000 行的块（对于您的日志文件类型来说，什么是好的块 - 对于 apache 日志文件，我会选择更大的数量），将它们提供给一些映射器，这些映射器会提取特定的内容（例如浏览器，IP地址，...，用户名，...）从每个日志行中，然后通过计算每个日志行出现的次数来减少（简化）：

  192.168.1.1,FireFox x.x,username1
  192.168.1.1,FireFox x.x,username1
  192.168.1.2,FireFox y.y,username1
  192.168.1.7,IE 7.0,username1

您可以提取浏览器，忽略版本，使用映射操作来获取这个列表：

FireFox
FireFox
FireFox
IE

然后减少得到这个：
火狐,3
IE,1

MapReduce 是最聪明的方法吗？

它很聪明，但您需要非常大才能获得任何好处...拆分 PETABYTES 日志。

为了做这种事情，我更喜欢使用消息队列和一致的存储引擎（如数据库），处理客户端从队列中提取工作，执行作业并将结果推送到另一个队列，而不是作业在某个时间范围内执行可供其他人处理。这些客户端将是执行特定操作的小程序。

您可以从 1 个客户端开始，然后扩展到 1000 个...您甚至可以拥有一个在 LAN 上的所有 PC 上作为屏幕保护程序运行的客户端，并在 8 核服务器上运行 8 个客户端，在双核 PC 上运行 2 个客户端...

使用 Pull：您可以有 100 或 10 个客户端在工作，多核计算机可以有多个客户端在运行，并且客户端完成的任何操作都可以用于下一步。并且您不需要为要完成的工作进行任何散列或分配。它是 100% 动态的。

http://img355.imageshack.us/img355/7355/mqlogs.png

如何您是否在各个计算实例之间分割 Web 日志文件？

如果是基于文本的日志文件，则按元素或行数。

为了测试MapReduce，我建议你使用Hadoop。

Can the MapReduce concept really be applied to weblogs analysis ?

Yes.

You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified):

  192.168.1.1,FireFox x.x,username1
  192.168.1.1,FireFox x.x,username1
  192.168.1.2,FireFox y.y,username1
  192.168.1.7,IE 7.0,username1

You can extract browsers, ignoring version, using a map operation to get this list:

FireFox
FireFox
FireFox
IE

Then reduce to get this :
FireFox,3
IE,1

Is MapReduce the most clever way of doing it ?

It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs.

To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific.

You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs...

With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic.

http://img355.imageshack.us/img355/7355/mqlogs.png

How would you split the web log files between the various computing instances ?

By number of elements or lines if it's a text-based logfile.

In order to test MapReduce, I'd like to suggest that you play with Hadoop.

回复收藏 0 原文