如何处理大量数据
我们有一个应用程序的一部分,比如说,20% 的时间它需要读取超出内存限制的大量数据。虽然我们可以增加内存限制,但我们对此犹豫不决,因为它需要高分配,而大多数时候是不必要的。
我们正在考虑使用自定义的 java.util.List 实现来在达到这样的峰值负载时假脱机到磁盘,但在较轻的情况下将保留在内存中。
数据被加载到集合中一次,随后进行迭代和处理,然后被丢弃。一旦进入集合,就不需要对其进行排序。
有人对这种方法有优点/缺点吗?
是否有开源产品提供类似这样的某种 List impl?
谢谢!
更新:
- 不是厚脸皮,但我所说的“巨大”是指超出我们愿意分配的内存量,而不会干扰同一硬件上的其他进程。您还需要哪些其他详细信息?
- 该应用程序本质上是一个批处理器,它从多个数据库表加载数据并对其执行广泛的业务逻辑。列表中的所有数据都是必需的,因为聚合操作是已完成逻辑的一部分。
- 我刚刚看到这篇文章,它提供了一个非常好的选择:Java 中的 STXXL 等效项
We have a part of an application where, say, 20% of the time it needs to read in a huge amount of data that exceeds memory limits. While we can increase memory limits, we hesitate to do so to since it requires having a high allocation when most times it's not necessary.
We are considering using a customized java.util.List implementation to spool to disk when we hit peak loads like this, but under lighter circumstances will remain in memory.
The data is loaded once into the collection, subsequently iterated over and processed, and then thrown away. It doesn't need to be sorted once it's in the collection.
Does anyone have pros/cons regarding such an approach?
Is there an open source product that provides some sort of List impl like this?
Thanks!
Updates:
- Not to be cheeky, but by 'huge' I mean exceeding the amount of memory we're willing to allocate without interfering with other processes on the same hardware. What other details do you need?
- The application is, essentially a batch processor that loads in data from multiple database tables and conducts extensive business logic on it. All of the data in the list is required since aggregate operations are part of the logic done.
- I just came across this post which offers a very good option: STXXL equivalent in Java
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
你真的需要使用列表吗?编写 Iterator 的实现(它可能有助于扩展 AbstractIterator) 来逐步遍历您的数据。然后,您可以使用有用的实用程序例如这些与该迭代器。所有这些都不会导致大量数据被急切地加载到内存中——相反,只有当迭代器前进时才会从源中读取记录。
Do you really need to use a List? Write an implementation of Iterator (it may help to extend AbstractIterator) that steps through your data instead. Then you can make use of helpful utilities like these with that iterator. None of this will cause huge amounts of data to be loaded eagerly into memory -- instead, records are read from your source only as the iterator is advanced.
如果您正在处理大量数据,您可能需要考虑使用数据库。
If you're working with huge amounts of data, you might want to consider using a database instead.
将其备份到数据库并对项目进行延迟加载。
ORM 框架可能是合适的。这取决于您的使用情况。这可能是非常简单的,或者是你最糟糕的噩梦,很难从你所描述的内容中看出。
我很乐观,我认为使用 ORM 框架(例如 Hibernate)将在大约 3 - 5 天内解决你的问题
Back it up to a database and do lazy loading on the items.
An ORM framework may be in order. It depends on your usage. It may be pretty straight forward, or the worst of your nightmares it is hard to tell from what you've described.
I'm optimist and I think that using a ORM framework ( such as Hibernate ) would solve your problem in about 3 - 5 days
将数据读入集合时是否正在进行排序/处理?它是从哪里读取的?
如果已经从磁盘读取它,是否可以直接从磁盘简单地对其进行批处理,而不是将其完全读入列表然后迭代?数据之间的相互依赖程度如何?
Is there sorting/processing that's going on while the data is being read into the collection? Where is it being read from?
If it's being read from disk already, would it be possible to simply batch-process it directly from disk, instead of reading it into a list completely and then iterating? How inter-dependent is the data?
我还想问为什么你需要将所有数据加载到内存中来处理它。通常,您应该能够在加载时进行处理,然后使用结果。这将使实际数据脱离内存。
I would also question why you need to load all of the data in memory to process it. Typically, you should be able to do the processing as it is being loaded and then use the result. That would keep the actual data out of memory.