快速写入持久队列

发布于 2025-01-07 10:33:58 字数 776 浏览 0 评论 0原文

我正在尝试更改当前的应用程序以进行扩展。

目前它每小时最多可以处理几百万个事件,但当我切换到 SaaS 模型时,数量预计会增长 10 到 100 倍,因此能够以分布式方式执行处理非常重要。

该应用程序是一个 Web 应用程序,目前每小时接收 120 万个事件。它使用 2 个 Tomcat 服务器,每个服务器监听 500 个线程,并使用一个工作管理器对事件进行排队,然后生成数百个工作线程来对事件进行后处理。

我想做的是将写入与处理分离,并将处理转移到分布式环境中。

  1. 将事件快速写入磁盘。

    这里的解决方案可以像写入 LinkedBlockingQueue 并将成百上千个条目批量转储到文件中一样简单,或者使用已经执行此操作的良好库,或者调整数据库以以合理的方式支持这种类型的排队.

    如果系统不可用,则无法捕获最后的事件并不是最重要的,重点是服务器工作时的性能。

  2. 将事件处理转移到分布式系统。

    我需要将数据移动到分布式系统(例如HDFS)。还有什么其他选择?

    处理具有中等复杂性(例如,一些复杂性在于自连接生成频繁项集并进一步过滤该集合,其他部分涉及跨多个层次结构聚合数据)。我目前使用数据库(MySql 和 DB2)并考虑 Hadoop。还有其他选择吗?

  3. 将结果存储在只读快速读取系统中。

我目前正在使用 SOLR,有更好的选择吗?

我知道这个问题会产生多个主题,任何意见都会受到赞赏。让我知道是否有更好的标签可以使用。

谢谢!

塞比

I'm trying to change my current application to scale.

It can currently handle at most a few million events per hour but the volume is expected to grow 10 to 100 times fold as I switch to a SaaS model, so it is important to be able to execute the processing in a distributed fashion.

The app is a web application receiving at peek currently 1.2 million events/hour. It is using 2 Tomcat servers, each listening on 500 threads and a workManager to queue the events and will then spawn a couple of hundred worker threads to postprocess the events.

What I'm trying to do is decoupling the writing from processing and move the processing to a distributed environment.

  1. Fast write to the disk the events.

    Here the solution can be as simple as writing to a LinkedBlockingQueue and dump batches of hundred, thousand of entries to a file or use a good library that does this already or tune the database to support this type of queueing in a reasonable fashion.

    Failing to capture the last events if the system becomes unavailable is not paramount, the focus is performance when the server is working.

  2. Move the event processing to a distributed system.

    I need to move the data to a distributed system (e.g. HDFS). What other options are there?

    The processing is of medium complexity (e.g. some of the complexity is in a self-join generating a frequent itemset and further filtering down this set, other parts involve aggregating data across multiple hierarchies). I currently use a database (MySql & DB2) and thinking about Hadoop. Any other options?

  3. Store the results in a readonly fast read system.

I am currently using SOLR, any better options?

I know the question spawns multiple topics, any input is appreciated. Let me know if there's a better tag I could use.

Thanks!

Sebi

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

奢望 2025-01-14 10:33:58

当今最好的系统,既能插入又能查询——确实是 RDBMS。但它不可扩展。 NoSQL 系统不可扩展,因为它们构建得更好,但因为它们放弃了一些东西。
让我们看看可以用它们构建什么。
HBase 和 Cassandra 都是专门为将随机事件插入转换为顺序磁盘 IO 而构建的。换句话说 - 它们是写入优化的系统,您可以将它们视为完美的分布式数据库索引。因此,您可以通过添加更多节点来获得所需的任何插入率

关于连接和聚合是一个有问题的点。
如果您能够成功地以要聚合的数据并置的方式设计密钥,那么就可以有效地提取和聚合数据。
连接也有问题,但可以选择写入已预先连接的数据。您应该在应用程序级别执行此操作。
对于更复杂的处理,您将需要诉诸 MapReduce,但这可能会影响插入率。
DataStax 的 Brisk 听起来很适合您的情况,因为它已将 Cassandra 与 MapReduce 预先集成,并且能够直接在 Cassandra 数据上运行 MapReduce。它还能够减少 MapReduce 对 OLTP 部分的影响。

The best system today, capable of both insertions and queries - indeed is RDBMS. But it is not scalable. NoSQL systems are not scalable because they built better, but because they gave up something..
Let see what can be built from them.
Both HBase a Cassandra are built specially to translate random events insertion to sequential disk IO. In other words - they are write optimized system and you can consider them as perfect distribtuted database index. So you can get to any insertion rate you need by adding more nodes

Regarding joins and aggregations is a problematic point there.
If you will succeed to design your key in the way that data to be aggregated is collocated - data can be pulled and aggregated efficiently.
Joins is also problematic but there is an option to write data already prejoined. You should do it in application level.
For more complex processing you will need to resort to MapReduce but it probabbly will affect insertion rates.
DataStax's Brisk sounds like good for your case since it has Cassandra preintegrated with the MapReduce with the capability to run MapReduce right over Cassandra Data. It also has capability to reduce MapReduce affect on the OLTP part of the story.

夜空下最亮的亮点 2025-01-14 10:33:58

您的一些问题听起来像是有 JMS 作为解决方案。它是一个队列,它应该是快速的,它是可靠的(跨机器故障),并且它是持久的。

例如,通过将 ActiveMQ 设置为“代理网络”,可以将其配置为强制客户端等待,直到数据已提交到多台计算机上的磁盘上。请参阅http://activemq.apache.org/networks-of-brokers.html

它还允许您将消息标记为持久消息,以便代理可以在重新启动后继续存在。我强烈推荐 http://activemq.apache.org/kahadb.html 的 ActiveMQ 建议,因为旧版本有严重的问题。

这有助于事件的分发,但对处理和数据的实际最终存储没有任何帮助。有多少客户需要访问多少数据,以及数据生成后多长时间?您可以使用 JMS 中的“主题”将消息分发到所有客户端,并使用“最后一个图像主题”等概念在代理上存储某些状态,以便您的客户端可以重新启动。 http://activemq.apache.org/subscription-recovery-policy.html解释了这些。

然而,尽管如此,听起来您最终还是会使用 Hadoop 来处理信息,因此不妨使用其堆栈中内置的任何内容。 :)

A couple of your problems sound like they have JMS as the solution. It's a queue, it's supposed to be fast, it's reliable (across machine failures), and it's persistent.

ActiveMQ, for example, can be configured to force a client to wait until the data has been committed to disc on more than one machine, by setting it up as a "network of brokers". See http://activemq.apache.org/networks-of-brokers.html

It also allows you to flag messages as persistent, such that the brokers can survive restarts. I strongly recommend the ActiveMQ suggestion of http://activemq.apache.org/kahadb.html , as the older versions have serious issues.

This helps with the distribution of events, but doesn't help at all with the processing, nor the actual eventual storage of the data. How many clients are going to need access to how much of the data, and how long after it has been produced? You can use "topics" in JMS to distribute messages to all clients, and concepts like "last image topics" to store some state on the broker, so your clients can restart. http://activemq.apache.org/subscription-recovery-policy.html explains these.

However, despite all that, it sounds like you're going to end up with Hadoop to process the information anyway, so may as well use anything built into their stack. :)

陪你到最终 2025-01-14 10:33:58

您可以使用内存映射文件作为持久队列。

该库支持每秒数百万(而不是每小时)的持久事件驱动消息,进程之间的延迟为亚微秒。它也非常简单(对于大多数用法来说级别太低,但您可以使用它作为开始)

https: //github.com/peter-lawrey/Java-Chronicle

You could use memory mapped files as a persisted queue.

This library supports persisted event driven messages into the millions per second (not per hour) with sub-microsecond latencies between processes. Its also pretty simple (too low level for most usages, but you can use it as a start)

https://github.com/peter-lawrey/Java-Chronicle

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文