我的应用程序当前存储数百万个 Double
元素用于计算。这些值在用于计算结束时运行的特定算法之前只是临时值。一旦完成此计算,数百万个值就可以被丢弃。
完整的故事位于此处,如果您需要更多信息细节。
提出的解决方案之一是使用内存数据库。
因此,如果我采用此解决方案,我将使用此数据库将我的值存储在表中以替换当前的 Map>
,例如:(
create table CALCULATION_RESULTS_XXX (
deal_id varchar2,
values number
);
每个计算一个表, XXX
是计算ID)
因此在计算过程中,我将执行以下操作:
- 当计算开始时,我创建
CALCULATION_RESULTS_XXX
表。
- 每次我需要添加一个值时,我都会在此表中插入一条记录。
- 在计算结束时,我将表格内容用于我的算法。
- 最后,我放弃了这张桌子。
正如另一个主题中所解释的,目前,我的计算可能会在内存中存储数百Mb的数据,如30 * 1,000,000个Double
列表将需要大约240Mb。
现在的问题是:
- 如果我使用内存数据库,我的内存消耗会减少吗?
- 关于数据库使用(或表创建)、数据插入等,我需要注意哪些具体点?
- 我想我会选择H2数据库。您认为这是满足我需求的最佳选择吗?
My application currently stores millions of Double
elements for a calculation. These values are only temporary values before they are used for a specific algorithm that is run at the end of the calculation. Once this calculation is done, the millions of values can be discarded.
The full story is here, if you need more details.
One of the solutions that was proposed is to use an in-memory database.
So if I go with this solution, I will use this database to store my values in a table to replace my current Map<String, List<Double>>
, like:
create table CALCULATION_RESULTS_XXX (
deal_id varchar2,
values number
);
(one table per calculation, XXX
is the calculation ID)
So during the calculation, I will do the following:
- When the calculation is started, I create the
CALCULATION_RESULTS_XXX
table.
- Every time I need to add a value, I insert a record in this table.
- At the end of the calculation, I use the table content for my algorithm.
- Finally, I drop this table.
As explained in the other subject, currently, my calculation may store several hundreds of Mb of data in the memory, as a list of 30 * 1,000,000 of Double
will need about 240Mb.
The questions now:
- If I go with an in-memory database, does my memory consomption will be decreased?
- What are the specific points that I will have to take care regarding the database usage (or table creation), the data insertion, etc. ?
- I think I will choose H2 database. Do you think it's the best choice for my needs?
发布评论
评论(4)
由 Terracotta 备份的简单 HashMap 会做得更好,并且允许存储比 JVM 虚拟内存更大的集合。
嵌入式数据库,尤其是基于 SQL 的数据库,会增加代码的复杂性和开销,因此不值得。如果您确实需要具有随机访问功能的持久存储,请尝试 nosql DB 之一,例如 CouchDB、卡桑德拉,neo4j
A simple HashMap backed up by Terracotta would do better and will allow to store collection bigger then JVM virtual memory.
Embedded databases, especially, the SQL-based ones, will add complexity and overhead to your code, so it doesn't worth it. If you really need a persistent storage with random access, try one of nosql DBs, like CouchDB, Cassandra, neo4j
这个问题非常简单,您确实需要尝试一下,看看(性能)结果如何。
您已经有了一个仅使用简单内存结构的实现。就个人而言,考虑到即使是戴尔最便宜的电脑也配备 1GB 以上的 RAM,您最好还是坚持使用。除此之外,插入一两个数据库应该相当简单。我会考虑 Sleepycat Berkerly DB(现在属于 Oracle...),因为您不需要使用 SQL 并且它们应该非常高效。 (他们确实支持 Java)。
如果结果有希望,我会考虑进一步调查,但这实际上最多只需要几天的时间,包括基准测试。
The problem is sufficiently simple that you really need to just give it a go and see how the (performance) results work out.
You already have an implementation that just uses simple in-memory structures. Personally, given that even the cheapest computer from Dell comes with 1GB+ of RAM, you might as well stick with that. That aside, it should be fairly simple to wack in a database or two. I'd consider Sleepycat Berkerly DB (Which is now owned by Oracle...), because you don't need to use SQL and they should be quite efficient. (They do support Java).
If the results are promising, I'd then consider further investigation, but this really should only take a few days work, at most, including the benchmarking.
不知道会不会更快,所以你必须尝试一下。我确实想建议的是,当您不再立即需要该列表时,批量插入整个列表。不要逐个保存值:)
如果你的最终算法可以用 SQL 表达,那么也可能值得你花时间这样做,而不是重新加载所有列表。在任何情况下,不要放入类似值的索引或约束,并且最好也不允许 NULL(如果可能)。维护索引和约束会花费时间,并且允许 NULL 也会花费时间或产生开销。 deal_ids 当然可以(并且)被索引,因为它们是主键。
这不是很多,但至少比单个被否决的答案好:)
I don't know whether it will be faster, so you'd have to try it. What I do want to recommend is to do batch inserts of an entire list when you don't immediately need that list anymore. Don't save value by value :)
If you're end algorithm can be expressed in SQL it might also be worth your while to do that, and not load all Lists back in. In any case, don't put anything like an index or constraint on the values, and preferably also not allow NULL (if possible). Maintaining indices and constraints cost time, and allowing NULL can also cost time, or create overhead. deal_ids can (and are) of course indexed as they're primary keys.
This isn't very much but at least better than a single down-voted answer :)
确实没有理由添加外部组件来降低程序运行速度。如果需要处理超出可用内存的数据,请压缩数据块并将其写入文件。现在工作站需要 192GB 内存,因此您不能在其上浪费太多时间。
There really is no reason at all to add an external component to make your program run slower. Compress the data block and write it to file if you need to handle more than the internal memory available. A workstation now takes 192GB of ram so you can't afford to waste much time on it.