/
跟进此问题< a>,似乎基于文件或基于磁盘的 Map 实现可能是我在那里提到的问题的正确解决方案。简短版本:
- 现在,我有一个作为
ConcurrentHashMap
实现的 Map
。
- 条目会以相当固定的速度不断添加到其中。稍后详细介绍。
- 最终,无论如何,这都意味着 JVM 耗尽了堆空间。
在工作中,有人(强烈)建议我使用 SQLite 解决这个问题,但是在问了上一个问题之后,我认为数据库不是完成这项工作的正确工具。所以 - 如果这听起来很疯狂,请告诉我 - 我认为更好的解决方案是将 Map
存储在磁盘上。
坏主意:我自己实现这个。更好的主意:使用别人的图书馆! 哪一个?
要求
必备条件:
- 免费。
- 持久。数据需要在 JVM 重新启动之间保留。
- 某种可搜索性。是的,我需要能够检索这些该死的数据并将其存放起来。基本的结果集过滤是一个优点。
- 独立于平台。需要可在 Windows 或 Linux 计算机上进行生产部署。
- 可清除。磁盘空间是有限的,就像堆空间一样。我需要删除
n
天前的条目。如果我必须手动执行此操作,那没什么大不了的。
必备要素:
- 易于使用。如果我能在本周末之前完成这项工作,那就太好了。
更好的是:一天结束了。如果我可以在类路径中添加一个 JAR,将 new ConcurrentHashMap();
更改为 new SomeDiskStoredMap();
并完成。
- 良好的可扩展性和性能。最坏的情况:新条目(平均)每秒添加 3 次,每秒、全天、每天。然而,插入并不总是那么顺利。可能是
(一小时内没有插入)
,然后是(一次插入 10,000 个对象)
。
可能的解决方案
Ehcache 和 Berkeley DB 现在看起来都很合理。在这两个方向上有什么特别的建议吗?
Problem
Following up on this question, it seems that a file- or disk-based Map
implementation may be the right solution to the problems I mentioned there. Short version:
- Right now, I have a
Map
implemented as a ConcurrentHashMap
.
- Entries are added to it continually, at a fairly fixed rate. Details on this later.
- Eventually, no matter what, this means the JVM runs out of heap space.
At work, it was (strongly) suggested that I solve this problem using SQLite, but after asking that previous question, I don't think that a database is the right tool for this job. So - let me know if this sounds crazy - I think a better solution would be a Map
stored on disk.
Bad idea: implement this myself. Better idea: use someone else's library! Which one?
Requirements
Must-haves:
- Free.
- Persistent. The data needs to stick around between JVM restarts.
- Some sort of searchability. Yes, I need the ability to retrieve this darn data as well as put it away. Basic result set filtering is a plus.
- Platform-independent. Needs to be production-deployable on Windows or Linux machines.
- Purgeable. Disk space is finite, just like heap space. I need to get rid of entries that are
n
days old. It's not a big deal if I have to do this manually.
Nice-to-haves:
- Easy to use. It would be great if I could get this working by the end of the week.
Better still: the end of the day. It would be really, really great if I could add one JAR to my classpath, change new ConcurrentHashMap<Foo, Bar>();
to new SomeDiskStoredMap<Foo, Bar>();
and be done.
- Decent scalability and performance. Worst case: new entries are added (on average) 3 times per second, every second, all day long, every day. However, inserts won't always happen that smoothly. It might be
(no inserts for an hour)
then (insert 10,000 objects at once)
.
Possible Solutions
- Ehcache? I've never used it before. It was a suggested solution to my previous question.
- Berkeley DB? Again, I've never used it, and I really don't know anything about it.
- Hadoop (and which subproject)? Haven't used it. Based on these docs, its cross-platform-readiness is ambiguous to me. I don't need distributed operation in the foreseeable future.
- A SQLite JDBC driver after all?
- ???
Ehcache and Berkeley DB both look reasonable right now. Any particular recommendations in either direction?
发布评论
评论(6)
更新(第一次发布后大约 4 年......):请注意,在较新版本的 ehcache 中,缓存项的持久性仅在付费产品中可用。感谢@boday 指出了这一点。
ehcache 很棒。它将为您提供在内存、磁盘或溢出到磁盘的内存中实现映射所需的灵活性。如果您使用 java.util.Map 这个非常简单的包装器,那么使用它就非常简单:
UPDATE (some 4 years after first post...): beware that in newer versions of ehcache, persistence of cache items is available only in the pay product. Thanks @boday for pointing this out.
ehcache is great. It will give you the flexibility you need to implement the map in memory, disk or memory with spillover to disk. If you use this very simple wrapper for java.util.Map then using it is blindingly simple:
您从未听说过流行框架吗?
编辑对该术语的一些澄清。
就像 James Gosling 现在所说的那样,没有任何 SQL DB 比内存存储更高效。 流行框架(最知名的是prevayler 和 space4j) 是建立在这种in-内存,也许可以存储在磁盘上,存储。它们如何工作?事实上,它看似简单:存储对象包含所有持久实体。该存储只能通过可序列化的操作来更改。因此,将对象放入存储中是一个 Put 操作在孤立的环境中进行。由于此操作是可序列化的,因此它也可以(取决于配置)保存在磁盘上以实现长期持久性。然而,主要的数据存储库是内存,它无疑提供了快速的访问时间,但代价是高内存使用率。
另一个优点是,由于其明显的简单性,这些框架几乎不包含超过十分之一的类
考虑到您的问题,使用 我立即想到了 Space4J(因为它提供了对很少使用的对象的“钝化”支持,也就是说它们的索引键位于内存中,但只要不使用这些对象就会保留在磁盘上) 。
请注意,您还可以在 c2wiki 找到一些信息。
Have you never heard about prevalence frameworks ?
EDIT some clarifications on the term.
Like James Gosling now says, no SQL DB is as efficient as an in-memory storage. Prevalence frameworks (most known being prevayler and space4j) are built upon this idea of an in-memory, maybe storable on disk, storage. How do they work ? In fact, it's deceptively simple : a storage object contains all persistent entities. This storage can only be changed by serializable operations. As a consequence, putting an object in storage is a Put operation performed in isolated context. As this operation is serializable, it may (depending upon configuration) be also saved on disk for long-term persistence. However, the main data repository is the memory, which proides undoubtly fast access time, at the cost of a high memory usage.
Another advantage is that, because of their obvious simplicity, these frameworks hardly contain more than a tenth of classes
Considering your question, the use of Space4J immediatly came to my mind (as it provides support for "passivation" of rarely used objects, that's to say their index key is in memory, but the objects are kept on disk as long as they're not used).
Notice you can also find some infos at c2wiki.
Berkeley DB Java 版 有一个 Collections API。在该 API 中,尤其是 StoredMap,它是 ConcurrentHashMap 的直接替代品。在创建 StoredMap 之前,您需要创建环境和数据库,但是集合教程应该会让这一切变得非常简单。
根据您的要求,Berkeley DB 被设计为易于使用,我认为您会发现它具有卓越的可扩展性和性能。 Berkeley DB 可在开源许可下使用,它是持久的、独立于平台的,并且允许您搜索数据。当然可以根据需要清除/删除数据。 Berkeley DB 拥有一长串其他功能,您可能会发现这些功能对您的应用程序非常有用,尤其是当您的需求随着应用程序的成功而变化和增长时。
如果您决定使用 Berkeley DB Java 版,请务必在 BDB JE 论坛 上提问。有一个活跃的开发者社区,很乐意帮助回答问题和解决问题。
Berkeley DB Java Edition has a Collections API. Within that API, StoredMap in particular, is a drop-in replacement for a ConcurrentHashMap. You'll need to create the Environment and Database before creating the StoredMap, but the Collections tutorial should make that pretty easy.
Per your requirements, Berkeley DB is designed to be easy to use and I think that you'll find that it has exceptional scalability and performance. Berkeley DB is available under an open source license, it's persistent, platform independent and allows you to search for data. The data can certainly be purged/deleted, as needed. Berkeley DB has long list of other features which you may find highly useful to your application, especially as your requirements change and grow with the success of the application.
If you decide to use Berkeley DB Java Edition, please be sure to ask questions on the BDB JE Forum. There's an active developer community that's happy to help answer questions and resolve problems.
我们使用 Xapian 实现了类似的解决方案。它速度快,可扩展,几乎提供了您所要求的所有搜索功能,它是免费的、多平台的,当然还可以清除。
We have a similar solution implemented using Xapian. It's fast, it's scalable, it provedes almost all search functionality you requested, it's free, multiplatform, and of course purgeable.
几周前我遇到了 jdbm2。使用方法非常简单。你应该能够在半小时内让它工作。一个缺点是放入映射中的对象必须是可序列化的,即实现
Serialized
。其他缺点在他们的网站上给出。然而,所有的对象持久化数据库并不是存储你自己的java类的对象的永久解决方案。如果您决定更改该类的字段,您将无法再从地图集合中检索该对象。非常适合存储标准可序列化类行
String
、Integer
等。I came accross jdbm2 a few weeks ago. The usage is very simple. You should be able to get it to work in half an hour. One drawback is that the object which is put into the map must be serializable, i.e. implement
Serializable
. Other Cons are given in their website.However, all object persistence database are not a permanent solution for storing objects of you own java class. If you decide to make change to the fields of the class, you will no longer be able to reteive the object from the map collection. It is ideal to store standard serializable classes line
String
,Integer
, etc.google-collections 库,http://code.google.com/p/guava- 的一部分Librarys/,有一些非常有用的地图工具。 MapMaker 尤其可以让您使用定时驱逐、软值(如果堆用完时将被垃圾收集器清除)和计算函数来创建并发 HashMap。
这将为您提供一个地图缓存,该缓存将自行清理并计算出其值。如果您能够计算出这样的值那就太好了,否则它将完美映射到 http://redis.io/ 你要写入的内容(公平地说,redis 本身可能就足够快了!)。
The google-collections library, part of http://code.google.com/p/guava-libraries/, has some really useful Map tools. MapMaker in particular lets you make concurrent HashMaps with timed evictions, soft values that will be swept up by the garbage collector if you're running out of heap, and computing functions.
That will give you a Map cache that will clean up after itself and can work out its values. If you're able to compute values like that then great, otherwise it would map perfectly onto http://redis.io/ which you'd be writing into (to be fair, redis would probably be fast enough on its own!).