使用 hadoop 接近实时

发布于 2024-09-02 18:58:54 字数 87 浏览 3 评论 0原文

我需要一些关于在实时系统中使用 Hadoop 的好参考,例如响应时间短的搜索。我知道hadoop有hdfs的开销,但是用hadoop做到这一点的最好方法是什么。

I need some good references for using Hadoop for real-time systems like searching with little response time. I know hadoop has its overhead of hdfs, but whats the best way of doing this with hadoop.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

薄情伤 2024-09-09 18:58:54

您需要提供有关系统目标和挑战的更多信息才能获得好的建议。也许Hadoop并不是你所需要的,而你只是需要一些分布式系统foo? (哦,您完全确定您需要一个分布式系统吗?您可以使用几台大内存机器上的复制数据库做很多事情)。

由于对您的问题一无所知,我将在黑暗中尝试回答您一些问题。

  1. 看一下 HBase,它在 HDFS 之上提供了一个结构化的可查询数据存储,类似于 Google 的 BigTable。 http://hadoop.apache.org/hbase/
  2. 您可能只需要一些帮助管理数据的复制和分片。查看 Gizzard,一个可以实现此目的的中间件: http://github.com/twitter/gizzard
  3. 总是可以提前完成。如果这意味着您具体化了太多数据,也许像 Lucandra 这样的东西可以提供帮助——Lucene 作为后端运行在 Cassandra 之上? http://github.com/tjake/Lucandra

如果你确实需要在查询时进行认真的处理,方法是运行专用进程来执行您需要的特定类型的计算,并使用 Thrift 之类的东西发送计算请求并接收返回结果。优化它们以将所有需要的数据存储在内存中。然后,接收查询的进程本身只能将问题分解为多个部分,将这些部分发送到计算节点并收集结果。这听起来像 Hadoop,但并不是因为它是为使用预加载数据计算特定问题而设计的,而不是用于任意计算的通用计算模型。

You need to provide a lot more information about the goals and challenges of your system to get good advice. Perhaps Hadoop is not what you need, and you just require some distributed systems foo? (Oh and are you totally sure you require a distributed system? There's an awful lot you can do with a replicated database on top of a couple of large-memory machines).

Knowing nothing about your problem, I'll give you are few shot-in-the-dark attempts at answering.

  1. Take a look at HBase, which provides a structured queriable datastore on top of HDFS, similar to Google's BigTable. http://hadoop.apache.org/hbase/
  2. It could be that you just need some help with managing replication and sharding of data. Check out Gizzard, a middleware to do just that: http://github.com/twitter/gizzard
  3. Processing can always be done beforehand. If that means you materialize too much data, maybe something like Lucandra can help -- Lucene running on top of Cassandra as a backend? http://github.com/tjake/Lucandra

If you really really need to do serious processing at query time, the way to do that is to run dedicated processes that do the specific kinds of computations you need, and use something like Thrift to send requests for computation and receive results back. Optimize them to have all the needed data in-memory. The process that receives the query itself can then do nothing more than break the problem into pieces, send the pieces to compute nodes, and collect the results. This sounds like Hadoop, but is not because it's made for computation of specific problems with pre-loaded data rather than a generic computation model for arbitrary computing.

离笑几人歌 2024-09-09 18:58:54

Hadoop 是完全不适合这种需求的工具。它针对运行几分钟到几小时甚至几天的大型批处理作业进行了明确优化。

FWIW,HDFS 与开销无关。事实上,Hadoop 作业将 jar 文件部署到每个节点上,设置工作区域,启动每个作业运行,通过文件在计算阶段之间传递信息,与作业运行器通信进度和状态等。

Hadoop is completely the wrong tool for this kind of requirement. It is explicitly optimised for large batch jobs that run for several minutes up to hours or even days.

FWIW, HDFS has nothing to do with the overhead. It's the fact that Hadoop jobs deploy a jar file onto every node, setup a working area, start each job running, pass information via files between stages of the computation, communicate progress and status with the job runner, etc., etc.

失去的东西太少 2024-09-09 18:58:54

这个查询很旧,但它寻求答案。即使有数以百万计的文档,但不像 FAQ 文档那样实时变化,Lucene + SOLR 进行分发也应该足以满足需要。 Hathi Trust 使用相同的组合对数十亿份文档进行索引。

如果索引实时变化,那就是一个完全不同的问题。即使 Lucene 在处理更新索引时也会遇到问题,您必须查看实时搜索引擎。人们已经尝试过实时改造 Lucene,也许应该可行。您还可以查看 HSearch,这是一个基于 Hadoop 和 HBase 构建的实时分布式搜索引擎,托管在 http://bizosyshsearch.sourceforge。网

This query is old but it begs an answer. Even if there are millions of documents but are not changing in real-time like FAQ docs, Lucene + SOLR for distribution should pretty much suffice the need. Hathi Trust indexes billions of documents using the same combination.

It is a completely different problem if the index is changing in real time. Even Lucene will have problems dealing with updating its index and you have to look at real time search engines. There has been some attempt at reworking Lucene for real time and maybe it should work. You can also look at HSearch, a real time distributed search engine built on Hadoop and HBase, hosted at http://bizosyshsearch.sourceforge.net

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文