我有一个保存为文本文件的数据集,其中基本上包含逐行存储的向量。我的向量的维度是 10k,我有 250 个这样的向量。每个向量条目都是双精度的。这是一个例子:
向量 1 -> 0.0 0.0 0.0 0.439367 0.0 .....10k 这样的条目
向量 2 -> 0.0 0.0 0.0 0.0 .....10k 这样的条目
......
0.0 0.0 0.0 0.439367
向量 250 -> 0.0 1.203973 0.0 0.0 0.0 .....10k 这样的条目
现在如果我计算一下,这应该占用 10k X 16bytes X 250 空间(假设每个向量条目是一个双精度占用 16bytes 的空间),大约 40MB 的空间。但是我看到文件大小仅显示为 9.8MB。我是不是哪里出错了?
问题是我在我的 Java 代码中使用了这些数据。我的算法的空间复杂度是 O(向量中的条目数 X 条目数)。即使当我通过分配 4GB 内存来运行代码时,我仍然会用完堆空间。我缺少什么?
谢谢。
安迪
I have a data set saved as a text file that basically contains a vectors stored line by line. My vector is 10k in dimensions and I have 250 such vectors. Each vector entry is a double. Here's an example:
Vector 1 -> 0.0 0.0 0.0 0.439367 0.0 .....10k such entries
Vector 2 -> 0.0 0.0 0.0 0.439367 0.0 0.0 0.0 0.0 .....10k such entries
...
...
Vector 250 -> 0.0 1.203973 0.0 0.0 0.0 .....10k such entries
Now if I do the math, this should take up 10k X 16bytes X 250 space (assuming each vector entry is a double taking up 16bytes of space) which is ~40MB of space. However I see that the file size is shown as 9.8MB only. Am I going wrong somewhere?
The thing is I am using this data in my Java code. The space complexity of my algorithm is O(no of entries in the vector X no of entries). Even when I run my code by allocating like 4GB of memory, I still run out of heap space. What am I missing?
Thanks.
Andy
发布评论
评论(5)
在这么多人猜测大小之后,我做了3个简单的测试,并使用Eclipse Memory Analyzer来确定大小。 (Win7、1.6.0_21 Java HotSpot (TM) 64 位服务器 VM)
double[][]
= 大小:19,2 MB 类:328 个对象:2,7kDouble[] [] 结构
= 大小:76,5 MB 类:332 个对象:2,5mArrayList>
= 大小:79,6 MB 类:330 个对象:2 ,5m256MB (
java -Xmx256m Huge
) 足以运行测试。所以我想问题不在于大小,可能有两件事:
如果有人对代码感兴趣:
After so many people guessing about the size, I have done 3 simple test, and used the Eclipse Memory Analyzer to determine the size. (Win7, 1.6.0_21 Java HotSpot (TM) 64-Bit Server VM)
double[][]
= Size: 19,2 MB Classes: 328 Objects: 2,7kDouble[][] structure
= Size: 76,5 MB Classes: 332 Objects: 2,5mArrayList<ArrayList<Double>>
= Size: 79,6 MB Classes: 330 Objects: 2,5m256MB (
java -Xmx256m Huge
) was enough to run the tests.So I guess the problem is not the size, it could be two things:
If somebody is interessed in the code:
在没有看到代码的情况下,我不能肯定地说,但是当您a)从文件中读取数据或b)算法中的某个位置时,听起来您正在过度分配。我建议您使用诸如 VisualVM 之类的工具来检查您的对象分配 - 它将能够告诉您如何分配以及犯了哪些错误。
Without seeing the code, I can't say for certain, but it sounds like you're over-allocating when you either a) read the data from the file or b) somewhere in your algorithm. I would advise that you use a tool such as visualVM to review your object allocation- it will be able to tell you how you're allocating and what mistakes you're making.
错误之处在于假设每个
double
在保存为文本时占用 16 个字节的空间。您似乎有很多 0 值,它们仅占用字符串形式的 4 个字节(包括分隔符)。这取决于你的代码。原因之一可能是您将数据存储在
ArrayList
或(更糟糕的)TreeSet
-Double
包装器中对象很容易导致 200% 的内存开销 - 而 Set/Map 结构更糟糕。Where you're going wrong is the assumption that every
double
takes 16 bytes of space when saved as text. You seem to have lots of 0 values, which take only 4 bytes in string form (including separator).That depends on your code. One reason might be that you're storing your data in an
ArrayList<Double>
or (worse)TreeSet<Double>
- theDouble
wrapper objects will cause a memory overhead of easily 200% - and the Set/Map structures are much worse.如果没有看到代码和 VM 参数,很难说。但请注意,算法中的变量也会消耗内存。文件大小与内存使用情况取决于您构建内存中对象的方式,例如,没有 double 的简单对象会自行占用空间。
获取合适的工具来对内存使用情况进行基准测试。查看 TPTP Eclipse 发行版。
另外,您可能想查看稀疏矩阵。
Hard to say without seeing the code and VM arguments. But note that variables in your algorithm also consume memory. And that file size vs memory usage depends on how you construct your in-memory objects, for example a simple object without a double takes up space on its own.
Get a proper tool for benchmarking memory usage. Check out the TPTP Eclipse distribution.
Also, do you might want to check out sparce matrixes.
如果我们看不到代码(这很公平),我只能说在启动应用程序时使用 -XX:+HeapDumpOnOutOfMemoryError 命令行选项,然后分析生成的堆转储与
jhat
。If we can't see the code (which is fair enough), all I can say is to use the
-XX:+HeapDumpOnOutOfMemoryError
command line option when you start your application, then analyse the resulting heap dump withjhat
.