当前位置：文江博客话题详情

我可以选择哪些选项来存储和查询大量重复的数据？

发布于 2024-07-11 19:58:10 字数 1233 浏览 7 评论 0 原文

我正在评估 Java 中高效数据存储的选项。该数据集是带有命名主键的带时间戳的数据值。例如，

Name: A|B|C:D
Value: 124
TimeStamp: 01/06/2009 08:24:39,223

可能是给定时间点的股票价格，所以我认为这是一个经典的时间序列数据模式。然而，我确实需要一个通用的 RDBMS 解决方案，它可以与任何合理的 JDBC 兼容数据库一起使用，因为我想使用 Hibernate。因此，像 Oracle 这样的数据库的时间序列扩展并不是真正的选择，因为我希望实现者能够使用他们自己的支持 JDBC/Hibernate 的数据库。

这里的挑战就是短时间内积累的大量数据。到目前为止，我的实现重点是定义定期汇总和清除计划，其中原始数据被聚合到“日”、“周”、“月”等表中，但缺点是粒度的早期损失以及存储在不同数据中的周期之间的周期不匹配带来的轻微不便。聚合体。

这一挑战的选择有限，因为在保留数据原始粒度的同时可以物理压缩多少数据存在绝对限制，并且使用关系数据库和支持通用 JDBC 的指令加剧了这一限制。。

借用经典数据压缩算法中的概念概念，并利用同一命名键的许多连续值可以相同的事实，我想知道是否有方法可以通过将重复值合并为无缝地减少存储记录的数量一个逻辑行，同时还存储一个计数器，该计数器有效地指示“接下来的 n 条记录具有相同的值”。这个实现看起来很简单，但代价是，使用标准 SQL 进行查询时，数据模型现在异常复杂，尤其是在使用任何类型的聚合 SQL 函数时。这显着降低了数据存储的实用性，因为只有复杂的自定义代码才能将数据恢复到“解压缩”状态，从而导致数百种工具的阻抗不匹配，而这些工具将无法正确呈现此数据。

我考虑了定义自定义 Hibernate 类型的可能性，这些类型基本上“理解”压缩数据集并将其备份并使用动态创建的合成行返回查询结果。（除了严格控制的输入流之外，数据库将对所有客户端只读）。除了原始 JDBC 之外，我想到的几个工具还将与 Hibernate/POJOS 集成（例如 JasperReports），但这并不能真正解决聚合函数问题，并且可能还有许多其他问题。

因此，我正在部分地接受自己可能必须使用更专有的[可能非 SQL] 数据存储（任何建议都表示赞赏），然后专注于编写伪 JDBC 驱动程序的可能不太复杂的任务，以至少简化与外部工具。

我听说过一种称为“位打包文件”的东西作为实现此数据压缩的机制，但我不知道有任何数据库提供此功能以及我想做的最后一件事（或者可以确实……）是编写我自己的数据库。

有什么建议或见解吗？

原文

I am evaluating options for efficient data storage in Java. The data set is time stamped data values with a named primary key. e.g.

Name: A|B|C:D
Value: 124
TimeStamp: 01/06/2009 08:24:39,223

Could be a stock price at a given point in time, so it is, I suppose, a classic time series data pattern. However, I really need a generic RDBMS solution which will work with any reasonable JDBC compatible database as I would like to use Hibernate. Consequently, time series extensions to databases like Oracle are not really an option as I would like the implementor to be able to use their own JDBC/Hibernate capable database.

The challenge here is simply the massive volume of data that can accumulate in a short period of time. So far, my implementations are focused around defining periodical rollup and purge schedules where raw data is aggregated into DAY, WEEK, MONTH etc. tables, but the downside is the early loss of granularity and the slight inconvenience of period mismatches between periods stored in different aggregates.

The challenge has limited options since there is an absolute limit to how much data can be physically compressed while retaining the original granularity of the data, and this limit is exacerbated by the directive of using a relational database, and a generic JDBC capable one at that.

Borrowing a notional concept from classic data compression algorithms, and leveraging the fact that many consecutive values for the same named key can expected to be identical, I am wondering if there is way I can seamlessly reduce the number of stored records by conflating repeating values into one logical row while also storing a counter that indicates, effectively, "the next n records have the same value". The implementation of just that seems simple enough, but the trade off is that the data model is now hideously complicated to query against using standard SQL, especially when using any sort of aggregate SQL functions. This significantly reduces the usefulness of the data store since only complex custom code can restore the data back to a "decompressed" state resulting in an impedance mismatch with hundreds of tools that will not be able to render this data properly.

I considered the possibility of defining custom Hibernate types that would basically "understand" the compressed data set and blow it back up and return query results with the dynamically created synthetic rows. (The database will be read only to all clients except the tightly controlled input stream). Several of the tools I had in mind will integrate with Hibernate/POJOS in addition to raw JDBC (eg. JasperReports) But this does not really address the aggregate functions issue and probably has a bunch of other issues as well.

So I am part way to resigning myself to possibly having to use a more proprietary [possibly non-SQL] data store (any suggestions appreciated) and then focus on the possibly less complex task of writing a pseudo JDBC driver to at least ease integration with external tools.

I heard reference to something called a "bit packed file" as a mechanism to achieve this data compression, but I do not know of any databases that supply this and the last thing I want to do (or can do, really....) is write my own database.

Any suggestions or insight ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

他不在意 2024-07-18 19:58:10

Hibernate（或任何 JPA 解决方案）不适合这项工作。

JPA/Hibernate 不是一个轻量级的解决方案。在大容量应用中，开销不仅很大而且令人望而却步。您确实需要研究网格和集群解决方案。这里不再重复各种技术的概述。

我在金融市场信息系统方面拥有丰富的经验。你所说的一些事情让我印象深刻：

你有很多原始数据；
您想要对该数据应用各种聚合（例如开盘价/最高价/最低价/收盘价每日摘要）；
高可用性可能是一个问题（在此类系统中总是如此）；低延迟
可能是一个问题（同上）。

现在，对于网格/集群类型的解决方案，我将它们松散地分为两类：

基于地图的解决方案，例如 Coherence 或 Terracotta；以及
基于 Javaspaces 的解决方案，例如 GigaSpaces。

我经常使用 Coherence，Map 解决方案可能很好，但也可能有问题。一致性地图上可以有监听器，您可以使用此类功能来执行以下操作：

市场价格警报（当价格达到特定水平时，用户可能希望收到通知）；
衍生品定价（例如，当基础证券改变最后交易价格时，交易所交易的期权定价系统将需要重新定价）；
出于对账目的，交易匹配/预订系统可能需要匹配收到的交易通知；
。

所有这些都可以通过侦听器来完成，但在 Coherence 中，侦听器必须很便宜，这会导致诸如 Map 拥有侦听器而不是向另一个 Map 写入内容之类的情况，并且这可能会持续一段时间此外，修改缓存条目可能会出现问题（尽管也有处理此类问题的机制；我说的是关闭市场价格警报之类的情况，这样它就不会再次触发）。

我发现 GigaSpaces 类型的网格解决方案对于此类应用程序更具吸引力。读取（或破坏性读取）操作是一种高度优雅且可扩展的解决方案，您可以获得亚毫秒性能的事务性网格更新。

考虑两种经典的排队架构：

请求/响应：错误的消息可能会阻塞队列，虽然可以有多个发送者和接收者（为了可扩展性），但扩大管道数量并不总是那么简单；发布
/订阅：这将发送者和接收者解耦，但缺乏可扩展性，因为如果您有多个订阅者，他们每个人都会收到消息（不一定是您想要的预订系统）。

在 GigaSpaces 中，破坏性读取就像可扩展的发布-订阅系统，而读取操作就像传统的发布-订阅模型。有一个构建在网格之上的 Map 和 JMS 实现，它可以进行 FIFO 排序。

现在我听到你问坚持怎么样？坚持是决定所有其他事情的结果。对于这种类型的应用程序，我喜欢持久化即服务模型（具有讽刺意味的是，它是关于 Hibernate 的，但它适用于任何事物）。

基本上，这意味着您的日期存储点击是异步的，并且它可以很好地处理汇总数据。就像您可以让服务监听交易通知并仅保留它感兴趣的通知（如果需要，则在内存中聚合）。您可以通过这种方式计算开盘价/最高价/最低价/收盘价。

对于大量数据，您并不真的希望将其全部写入数据库。反正不是同步的。持久存储加上数据仓库可能更适合您，但这又取决于需求、数量等。

这是一个复杂的主题，我只是真正接触过它。希望对您有帮助。

Hibernate (or any JPA solution) is the wrong tool for this job.

JPA/Hibernate isn't a lightweight solution. In high-volume applications, the overhead is not only significant but prohibitive. You really need to look into grid and cluster solutions. I won't repeat the overview of the various technologies here.

I've got a lot of experience in financial market information systems. A few of the things you said stuck out to me:

You have a lot of raw data;
You want to apply various aggregations to that data (eg open/high/low/close daily summaries);
High availability is probably an issue (it always is in these kinds of systems); and
Low latency is probably an issue (ditto).

Now for grid/cluster type solutions I divide them loosely into two categories:

Map-based solutions like Coherence or Terracotta; and
Javaspaces-based solutions like GigaSpaces.

I've used Coherence a lot and the Map solution can be nice but it can be problematic too. Coherence maps can have listeners on them and you can use this sort of thing to do things like:

Market price alerts (users may want a notification when a price reaches a certain level);
Derivative pricing (eg an exchange-traded option pricing system will want to reprice when an underlying security changes last traded price);
A trade-matching/booking system may want to match received trade notifications for reconciliation purposes;
etc.

All of these can be done with listeners but in Coherence for example listeners have to be cheap, which leads to things like a Map having a listener than writes something to another Map and this can chain on for awhile. Also, modifying the cache entry can be problematic (although there are mechanisms for dealing with that kind of problem too; I'm talking about situations like turning off a market price alert so it doesn't trigger a second time).

I found GigaSpaces type grid solutions to be far more compelling for this kind of application. The read (or destructive read) operation is a highly elegant and scalable solution and you can get transactional grid updates with sub-millisecond performance.

Consider the two classic queueing architectures:

Request/Response: a bad message can block the queue and while you can many senders and receivers (for scalability) scaling up the number of pipes isn't always straightforward; and
Publish/Subscribe: this decouples the sender and receiver but lacks scalability in that if you have multiple subscribers they'll each receive the message (not necessarily what you want with say a booking system).

In GigaSpaces, a destructive read is like a scalable publish-subscribe system and a read operation is like the traditional publish-subscribe model. There is a Map and JMS implementation built on top of the grid and it can do FIFO ordering.

Now whaqt about persistence I hear you ask? Persistence is a consequence of deciding all the other stuff. For this kind of application, I like the Persistence as a Service model (ironically written about Hibernate but it applies to anything).

Basically this means your date store hits are asynchronous and it works nicely with doing summary data. Like you can have a service listening for trade notifications and persist just the ones it's interested in (aggregating in memory if required). You can do open/high/low/close prices this way.

For high volume data you don't really want to write it all to the database. Not synchronously anyway. A persistent store plus a data warehouse is probably more the route you want to go but again this depends on requirements, volumes, etc.

It's a complicated topic and I've only really touche don it. Hope that helps you.

回复收藏 0 原文

缱倦旧时光 2024-07-18 19:58:10

您可能会发现聆听 Michael Stonebraker 在 Money:Tech 上的演讲很有趣。他谈到了您提到的许多需要的东西，并说明了三大大象（SQL Server、Oracle 和 DB2）如何永远无法满足分笔存储（看起来您正在构建）的需求。他超越了专栏商店，我同意这是正确的方向。他甚至讨论了压缩和速度，这对你来说都是问题。

这里还有一些您可能感兴趣的链接：

回复收藏 0 原文

黄昏下泛黄的笔记 2024-07-18 19:58:10

我会看看面向列的数据库。对于此类应用程序来说这会很棒

回复收藏 0 原文

星星的軌跡 2024-07-18 19:58:10

许多支持 JDBC 的数据库管理系统（例如 Oracle）在物理存储引擎中提供压缩。例如，Oracle 有“压缩”表的概念，无需解压开销：

http://www.ardentperf.com/wp-content/uploads/2007/07/advanced-compression-datasheet.pdf

回复收藏 0 原文

各自安好 2024-07-18 19:58:10

感谢您的回答。

Cletus，我很欣赏这个大纲，但我不能做出的权衡之一是放弃数据库灵活性和与 JDBC/Hibernate 的兼容性，以允许使用所有可用的工具。此外，虽然我没有明确说明这一点，但我不想强迫我的用户采用[可能昂贵的]商业解决方案。如果他们有数据库品牌 X，就让他们使用它。如果他们不在乎，我们推荐开源数据库品牌 Y。基本上，该应用程序有多个面孔，其中一个是传入数据的存储库，但另一个面孔是报告源，我真的不知道不想涉足编写报告生成器的业务。

虽然我还没有真正对其进行负载测试，但 LucidDB 给我留下了深刻的印象。它是一个面向列的数据库，提供良好的查询性能和看似良好的数据压缩。它有一个 JDBC 驱动程序，但据我所知，还没有 Hibernate 方言。它还支持用户定义的转换，简而言之，我认为这将允许我无缝地实现将重复值和连续值压缩到一个“行”中的想法，但在查询时将它们吹回多个“合成”行，所有这些都在无形中完成给查询调用者。最后，它支持外部表的这个漂亮功能，其他支持 JDBC 的数据库表可以放在 LucidDB 中。我认为这对于为其他数据库提供某种程度的支持可能是无价的。

谢谢你的指点，Javaman。它让我专注于 LucidDB。

回复收藏 0 原文

~没有更多了~

关于作者

如果没有

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

我可以选择哪些选项来存储和查询大量重复的数据？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

内心激荡

JSmiles

赏烟花じ飞满天

左秋

迪街小绵羊

瞳孔里扚悲伤

友情链接

我可以选择哪些选项来存储和查询大量重复的数据？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

内心激荡

JSmiles

赏烟花じ飞满天

左秋

迪街小绵羊

瞳孔里扚悲伤

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。