用于构建分布式日志聚合器(如 Splunk)的最佳组件堆栈是什么?

发布于 2024-09-06 14:58:29 字数 927 浏览 8 评论 0原文

我正在尝试找到可以用来构建类似于 Splunk 的最佳组件,以便聚合计算网格中大量服务器的日志。此外,它应该是分布式的,因为我每天都有大量的日志,并且没有一台机器能够存储日志。

我对能够与 Ruby 一起工作并且能够在 Windows 和最新的 Solaris 上工作的东西特别感兴趣(是的,我有一个动物园)。

我将架构视为:

  • 日志爬虫(Ruby 脚本)。
  • 分布式日志存储。
  • 分布式搜索引擎。
  • 轻量化前端。

日志爬虫和分布式搜索引擎是不可能的——日志将由Ruby脚本解析,ElasticSearch将用于索引日志消息。前端也很容易选择——Sinatra。

我的主要问题是分布式日志存储。我研究了 MongoDB、CouchDB、HDFS、Cassandra 和 HBase。

  • MongoDB 被拒绝,因为它无法在 Solaris 上运行。
  • CouchDB 不支持分片(需要 smartproxy 才能使其工作,但这是我不想尝试的事情)。
  • Cassandra 运行良好,但它只是占用磁盘空间,并且需要每天运行自动平衡以在 Cassandra 节点之间分散负载。
  • HDFS 看起来很有前途,但 FileSystem API 仅限于 Java,而 JRuby 则很痛苦。
  • HBase 看起来是一个最好的解决方案,但部署它和监控只是一场灾难 - 为了启动 HBase,我需要首先启动 HDFS,检查它是否启动没有问题,然后启动 HBase 并检查它,然后启动 REST 服务和也检查一下。

所以我被困住了。有人告诉我 HDFS 或 HBase 是用作日志存储的最佳选择,但 HDFS 只能与 Java 顺利配合,而 HBase 只是部署/监控的噩梦。

任何人都可以分享使用我上面描述的组件或完全不同的东西构建类似系统的想法或经验吗?

I'm trying to find the best components I could use to build something similar to Splunk in order to aggregate logs from a big number of servers in computing grid. Also it should be distributed because I have gigs of logs everyday and no single machine will be able to store logs.

I'm particularly interested in something that will work with Ruby and will work on Windows and latest Solaris (yeah, I got a zoo).

I see architecture as:

  • Log crawler (Ruby script).
  • Distributed log storage.
  • Distributed search engine.
  • Lightweight front end.

Log crawler and distributed search engine are out of questions - logs will be parsed by Ruby script and ElasticSearch will be used to index log messages. Front end is also very easy to choose - Sinatra.

My main problem is distributed log storage. I looked at MongoDB, CouchDB, HDFS, Cassandra and HBase.

  • MongoDB was rejected because it doesn't work on Solaris.
  • CouchDB doesn't support sharding (smartproxy is required to make it work but this is something I don't want to even try).
  • Cassandra works great but it's just a disk space hog and it requires running autobalance everyday to spread the load between Cassandra nodes.
  • HDFS looked promising but FileSystem API is Java only and JRuby was a pain.
  • HBase looked like a best solution around but deploying it and monitoring is just a disaster - in order to start HBase I need to start HDFS first, check that it started without problems, then start HBase and check it also, and then start REST service and also check it.

So I'm stuck. Something tells me HDFS or HBase are the best thing to use as a log storage, but HDFS only works smoothly with Java and HBase is just a deploying/monitoring nightmare.

Can anyone share its thoughts or experience building similar systems using components I described above or with something completely different?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

黯然#的苍凉 2024-09-13 14:58:29

我建议使用 Flume 将数据聚合到 HBase。您还可以使用 Flume 的 Elastic Search Sink 来保持搜索索引最新即时的。

有关更多信息,请参阅我对Quora 上有类似的问题

I'd recommend using Flume to aggregate your data into HBase. You could also use the Elastic Search Sink for Flume to keep a search index up to date in real time.

For more, see my answer to a similar question on Quora.

最舍不得你 2024-09-13 14:58:29

关于 Java 和 HDFS - 使用 BeanShell 等工具,您可以通过 Javascript 与 HDFS 存储进行交互。

With regards to Java and HDFS - using a tool like BeanShell, you can interact with the HDFS store via Javascript.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文