MapReduce 是一种如何分析 http 服务器日志的好方法?

发布于 2024-07-21 22:28:51 字数 548 浏览 10 评论 0原文

我研究MapReduce有一段时间了,它似乎是实现容错分布式计算的一个非常好的方法。 我阅读了很多有关该主题的论文和文章,在一系列虚拟机上安装了 Hadoop,并做了一些非常有趣的测试。 我真的认为我理解映射和减少步骤。

但这是我的问题:我不知道它如何帮助进行http服务器日志分析。

我的理解是大公司(例如Facebook)使用MapReduce来计算他们的http日志为了加快从中提取观众统计数据的过程。 我工作的公司虽然比 Facebook 小,但每天都有大量的网络日志需要计算(100Go 每月增长 5% 到 10%)。 现在我们在一台服务器上处理这些日志,并且运行得很好。 但分布式计算工作立即浮现在我的脑海中,这是一种很快就会变得有用的优化。

以下是我现在无法回答的问题,任何帮助将不胜感激:

  • MapReduce 概念真的可以应用于博客分析吗?
  • MapReduce 是最聪明的方法吗?
  • 您将如何拆分各个计算实例之间的网络日志文件?

谢谢。
尼古拉斯

I've been looking at MapReduce for a while, and it seems to be a very good way to implement fault-tolerant distributed computing. I read a lot of papers and articles on that topic, installed Hadoop on an array of virtual machines, and did some very interesting tests. I really think I understand the Map and Reduce steps.

But here is my problem : I can't figure out how it can help with http server logs analysis.

My understanding is that big companies (Facebook for instance) use MapReduce for the purpose of computing their http logs in order to speed up the process of extracting audience statistics out of these. The company I work for, while smaller than Facebook, has a big volume of web logs to compute everyday (100Go growing between 5 and 10 percent every month). Right now we process these logs on a single server, and it works just fine. But distributing the computing jobs instantly come to mind as a soon-to-be useful optimization.

Here are the questions I can't answer right now, any help would be greatly appreciated :

  • Can the MapReduce concept really be applied to weblogs analysis ?
  • Is MapReduce the most clever way of doing it ?
  • How would you split the web log files between the various computing instances ?

Thank you.
Nicolas

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

女中豪杰 2024-07-28 22:28:51

MapReduce 概念真的可以应用于博客分析吗?

可以。

您可以将 hudge 日志文件分成 10,000 或 1,000,000 行的块(对于您的日志文件类型来说,什么是好的块 - 对于 apache 日志文件,我会选择更大的数量),将它们提供给一些映射器,这些映射器会提取特定的内容(例如浏览器,IP地址,...,用户名,...)从每个日志行中,然后通过计算每个日志行出现的次数来减少(简化):

  192.168.1.1,FireFox x.x,username1
  192.168.1.1,FireFox x.x,username1
  192.168.1.2,FireFox y.y,username1
  192.168.1.7,IE 7.0,username1

您可以提取浏览器,忽略版本,使用映射操作来获取这个列表:

FireFox
FireFox
FireFox
IE

然后减少得到这个:
火狐,3
IE,1

MapReduce 是最聪明的方法吗?

它很聪明,但您需要非常大才能获得任何好处...拆分 PETABYTES 日志。

为了做这种事情,我更喜欢使用消息队列和一致的存储引擎(如数据库),处理客户端从队列中提取工作,执行作业并将结果推送到另一个队列,而不是作业在某个时间范围内执行可供其他人处理。 这些客户端将是执行特定操作的小程序。

您可以从 1 个客户端开始,然后扩展到 1000 个...您甚至可以拥有一个在 LAN 上的所有 PC 上作为屏幕保护程序运行的客户端,并在 8 核服务器上运行 8 个客户端,在双核 PC 上运行 2 个客户端...

使用 Pull:您可以有 100 或 10 个客户端在工作,多核计算机可以有多个客户端在运行,并且客户端完成的任何操作都可以用于下一步。 并且您不需要为要完成的工作进行任何散列或分配。 它是 100% 动态的。

http://img355.imageshack.us/img355/7355/mqlogs.png

如何您是否在各个计算实例之间分割 Web 日志文件?

如果是基于文本的日志文件,则按元素或行数。

为了测试MapReduce,我建议你使用Hadoop。

Can the MapReduce concept really be applied to weblogs analysis ?

Yes.

You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile - for apache logfiles I'd go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, ..., Username, ... ) from each log line, then reduce by counting the number of times each one appeared (simplified):

  192.168.1.1,FireFox x.x,username1
  192.168.1.1,FireFox x.x,username1
  192.168.1.2,FireFox y.y,username1
  192.168.1.7,IE 7.0,username1

You can extract browsers, ignoring version, using a map operation to get this list:

FireFox
FireFox
FireFox
IE

Then reduce to get this :
FireFox,3
IE,1

Is MapReduce the most clever way of doing it ?

It's clever, but you would need to be very big in order to gain any benefit... Splitting PETABYTES of logs.

To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific.

You could start with 1 client, and expand to 1000... You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs...

With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don't need to do any hashing or assignment for the work to be done. It's 100% dynamic.

http://img355.imageshack.us/img355/7355/mqlogs.png

How would you split the web log files between the various computing instances ?

By number of elements or lines if it's a text-based logfile.

In order to test MapReduce, I'd like to suggest that you play with Hadoop.

最终幸福 2024-07-28 22:28:51
  • MapReduce 概念真的可以应用于博客分析吗?

当然。 您存储什么类型的数据?

  • MapReduce 是最聪明的方法吗?

它将允许您同时查询许多商品机器,所以是的,它可能很有用。 或者,您可以尝试分片

  • 您将如何在不同的计算实例之间分割 Web 日志文件?

通常,您会使用一致的哈希算法来分发数据,以便您以后可以轻松添加更多实例。 您应该根据普通数据库中的主键进行散列。 它可以是用户 ID、IP 地址、引用者、页面、广告; 无论您的日志记录主题是什么。

  • Can the MapReduce concept really be applied to weblogs analysis ?

Sure. What sort of data are you storing?

  • Is MapReduce the most clever way of doing it ?

It would allow you to query across many commodity machines at once, so yes it can be useful. Alternatively, you could try Sharding.

  • How would you split the web log files between the various computing instances ?

Generally you would distribute your data using a consistent hashing algorithm, so you can easily add more instances later. You should hash by whatever would be your primary key in an ordinary database. It could be a user id, an ip address, referer, page, advert; whatever is the topic of your logging.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文