低延迟读写的持久化策略

发布于 2024-08-11 12:35:28 字数 657 浏览 7 评论 0原文

我正在构建一个应用程序,其中包含以或多或少的交互方式批量标记数百万条记录的功能。用户交互与 Gmail 非常相似,用户可以标记单封电子邮件,或批量标记大量电子邮件。我还需要快速读取这些标签成员身份,并且读取模式或多或少是随机的。

现在我们使用 Mysql 并为每个标签-文档对插入一行。即使进行批量插入和大量优化,将数百万行写入 Mysql 也需要一段时间(高 I/O)。我们需要这是一个交互式过程,而不是批处理过程。

对于我们存储和读取的数据,数据的一致性和可用性并不像性能和可扩展性那么重要。因此,如果在写入时发生系统故障,我可以处理一些数据丢失的情况。然而,数据在某些时候肯定需要持久化到辅助存储。

因此,总而言之,以下是要求:

  • 潜在数千万条记录的低延迟批量写入
  • 数据需要以某种方式持久化
  • 低延迟随机读取
  • 不需要持久写入
  • 最终一致性是可以的

以下是我看过的一些解决方案at:

  • 写入缓存(Terracotta、Gigaspaces、Coherence),其中记录被写入内存并异步排出到数据库。这些让我有点害怕,因为它们似乎给应用程序增加了一定程度的复杂性,而我想避免这种情况。
  • 高度可扩展的键值存储,例如 MongoDB、HBase、Tokyo Tyrant

I am building an application that includes a feature to bulk tag millions of records, more or less interactively. The user interaction is very similar to Gmail where users can tag individual emails, or bulk tag large amounts of emails. I also need quick read access to these tag memberships as well, and where the read pattern is more or less random.

Right now we're using Mysql and inserting one row for every tag-document pair. Writing millions of rows to Mysql takes a while (high I/O), even with bulk insertions and heavy optimization. We need this to be an interactive process, not a batch process.

For the data that we're storing and reading, consistency and availability of the data are not as important as performance and scalability. So in the event of system failure while the writes are occurring, I can deal with some data loss. However, the data definitely needs to be persisted to secondary storage at some point.

So, to sum up, here are the requirements:

  • Low latency bulk writes of potentially tens of millions of records
  • Data needs to be persisted in some way
  • Low latency random reads
  • Durable writes not required
  • Eventual consistency is okay

Here are some solutions I've looked at:

  • Write behind caches (Terracotta, Gigaspaces, Coherence) where records are written to memory and drained to the database asynchronously. These scare me a little because they appear to add a certain amount of complexity to the app that I'd want to avoid.
  • Highly scalable key-value stores, like MongoDB, HBase, Tokyo Tyrant

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

初雪 2024-08-18 12:35:28

如果您有预算为此使用 Coherence,我强烈建议您这样做。 Coherence 直接支持 write-behind、最终一致性行为,并且对于数据库中断和 Coherence 集群节点中断都具有良好的生存能力(如果您在单独的 JVM 上使用 >= 3 个 Coherence 节点,最好是在单独的主机上)。我已经实现了这一点,为财富 100 强公司的电子商务网站进行大容量 CRM,效果非常好。

该架构的最佳方面之一是,您编写 Java 应用程序代码时就好像没有发生任何后写行为,然后插入使之发生的 Coherence 拓扑和配置。如果您稍后需要更改 Coherence 的行为或拓扑,则无需对应用程序进行任何更改。我知道可能有一些合理的方法可以做到这一点,但这种行为在 Coherence 中得到直接支持,而不是必须发明或手动执行一种方法。

说得更清楚一点——您担心增加应用程序的复杂性是一件好事。使用 Coherence,您只需将更新写入缓存(或者如果您使用 Hibernate,它可以是 L2 缓存提供程序)。根据您的 Coherence 配置和拓扑,您可以选择部署应用程序以使用后写式分布式缓存。因此,由于缓存的功能,您的应用程序不会变得更加复杂(并且坦率地说,您没有意识到)。

最后,我在 2005 年至 2007 年实施了上述解决方案,当时 Coherence 由 Tangosol 开发,他们得到了最好的支持。我不确定现在 Oracle 的情况如何——希望仍然很好。

If you have the budget to use Coherence for this, I highly recommend doing so. There is direct support for write-behind, eventual consistency behavior in Coherence and it is very survivable to both a database outage and Coherence cluster node outages (if you use >= 3 Coherence nodes on separate JVMs, preferably on separate hosts). I have implemented this for doing high-volume CRM for a Fortune 100 company's e-commerce site and it works fantastically.

One of the best aspects of this architecture is that you write your Java application code as if none of the write-behind behavior were taking place, and then plug in the Coherence topology and configuration that makes it happen. If you need to change the behavior or topology of Coherence later, no change in your application is required. I know there are probably a handful of reasonable ways to do this, but this behavior is directly supported in Coherence rather than having to invent or hand-roll a way of doing it.

To make a really fine point - your worry about adding application complexity is a good one. With Coherence, you simply write updates to the cache (or if you're using Hibernate it can be the L2 cache provider). Depending upon your Coherence configuration and topology, you have the option to deploy your application to use write-behind, distributed, caches. So, your application is no more complex (and, frankly unaware) due to the features of the cache.

Finally, I implemented the solution mentioned above from 2005-2007 when Coherence was made by Tangosol and they had the best possible support. I'm not sure how things are now under Oracle - hopefully still good.

北方的韩爷 2024-08-18 12:35:28

我曾经参与过一个使用异步写入的大型项目,尽管在这种情况下它只是使用后台线程手写的。您还可以通过将数据库写入进程卸载到 JMS 队列来实现类似的操作。

肯定会加快数据库写入速度的一件事是批量执行它们。 JDBC 批量更新可以比单独写入快几个数量级,如果您异步执行更新,则一次只需写入 500 个更新。

I've worked on a large project that used asyncrhonous writes althoguh in that case it was just hand-written using background threads. You could also implement something like that by offloading the db write process to a JMS queue.

One thing that will certainly speed up db writes is to do them in batches. JDBC batch updates can be orders of magnitude faster than individual writes, and if you're doing them asynchronously you can just write them 500 at a time.

薔薇婲 2024-08-18 12:35:28

根据数据的组织方式,也许您可​​以使用分片,
如果读取延迟不够低,您还可以尝试添加缓存。 Memcache 是一种流行的解决方案。

Depending on how your data is organized perhaps you would be able to use sharding,
if the read latency isn't low enough you can also try to add caching. Memcache is one popular solution.

泛泛之交 2024-08-18 12:35:28

Berkeley DB 具有非常高性能的基于磁盘的哈希表,支持事务,并在需要时与 Java EE 环境集成。如果您能够将数据建模为键/值对,那么这可能是一个非常可扩展的解决方案。

http://www.oracle.com/technology/products/ berkeley-db/je/index.html

(注:oracle 大约 5-10 年前购买了 berkeley db;最初的产品已经存在了 15-20 年)。

Berkeley DB has a very high performance disk-based hash table that supports transactions, and integrates with a Java EE environment if you need that. If you're able to model the data as key/value pairs, this can be a very scalable solution.

http://www.oracle.com/technology/products/berkeley-db/je/index.html

(Note: oracle bought berkeley db about 5-10 years ago; the original product has been around for 15-20 years).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文