声明一个带有微小单元的巨大动态数组 [C++]
我正在做这个项目。满足以下条件
- 在这个项目中,我需要创建一个巨大的数组(希望我能够创建一个像 ~7.13e+17 这样大的数组,但这个目标仍然领先。)
- 数组内的每个单元格可以包含以下之一三个值:0,1,2
- 我使用 C++ 作为我的语言。
我尝试使用普通的动态数组命令
int * p;
int i;
i=[size]; //This is calculated somewhere else.
p= new (nothrow) int[i];
,但据我了解,该数组创建的数组的可能最大大小为 int 的最大大小。如果我更改代码并使用以下代码
long long * p;
long long i;
i=[size]; //This is calculated somewhere else.
p= new (nothrow) long long [i];
,那么数组中的每个单元格都是“long long”类型,使得数组内存非常重。 有没有办法使用 long long 创建一个数组来确定数组中的单元格数量并使每个单元格的大小为 int ?
多谢, 乌列尔。
编辑:了解更多信息。
- 这个问题主要是理论问题,是我硕士论文的一部分。我仍然希望这个程序能够尽可能地工作。
- 我当前的步骤是使这项工作适用于具有 2.56e+09 项的数组,快速计算表明我们正在讨论至少 0.6 GB 的数组,我的系统应该能够处理。然而,即使所需的空间量实际上为 4.5GB,我仍无法使用当前的编码解决方案实现此目标。
I have this project I'm working on. The following conditions apply
- In this project I need to create one huge array (hopefully I will be able to create one as big as ~7.13e+17, but this target is still up ahead.)
- Each cell inside the array can contain one of the three values: 0,1,2
- I'm using C++ as my language.
I tried using the normal dynamic array command
int * p;
int i;
i=[size]; //This is calculated somewhere else.
p= new (nothrow) int[i];
But as far as I understand, this array makes an array with a possible maximum size of the maximum size of int. If I change my code and use the following code
long long * p;
long long i;
i=[size]; //This is calculated somewhere else.
p= new (nothrow) long long [i];
Then each cell in the array is of type "long long" making the array very memory heavy.
Is there any way to create an array using long long to determine the number of cells in the array and have every cell of size int?
Thanks a lot,
Uriel.
EDIT: for further information.
- This problem is mainly theoretical, it is part of my Master's thesis. I would still like this program to work as best as it can.
- My current step is to make this work for an array with 2.56e+09 items, quick calculation shows we are talking about an array that's at least 0.6 Gigabytes, something my system should be able to cope with. Yet I cannot achieve this goal with my current coding solution even though the amount of space required is really at 4.5GB.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
数组的类型没有必要与用于指定大小的变量的类型相同。因此,使用
long long
作为指定大小的变量,然后使用int
作为数组的类型。然而,当你说你需要创建一个“像〜7.13e + 17一样大”的数组时,我很担心。我不知道你指的是字节还是元素,但无论哪种方式对于直数组来说都非常巨大。这已经进入了 PB 级数据的领域。
在 32 位程序中,这是不可能的。理论上,您可以拥有一个高达几千兆字节的数组(尽管实际上大多数情况下要少得多)。
据我所知,在 64 位程序中理论上您可以分配这么大的数组。然而,我怀疑大多数机器是否真的可以处理它。由于这个数据量将远远超过机器中的 RAM,操作系统将被迫将该数组的大部分内容推送到页面文件中。但 PB 大小的页面文件将远远超过目前大多数典型机器上的硬盘空间。
无论哪种方式,您可能需要认真考虑一种不同的方案,而不是立即分配整个巨大的数组。
There is no reason the type of the array has to be the same as the type of the variable used to specify the size. So use
long long
for the variable that specifies the size and thenint
for the type of the array.However, I am concerned when you say you need to create an array "as big as ~7.13e+17". I don't know whether you mean bytes or elements but either way that is incredibly huge for a straight array. That's getting into the realm of petabytes of data.
In a 32-bit program, that's simply not possible. In theory you could have an array up to a couple gigabytes (though in practice considerably less most times).
In a 64-bit program you could in theory allocate an array that large, as far as I know. However, I'm skeptical that most machines could actually handle it. Since this amount of data would far exceed the RAM in the machine, the operating system would be forced to push most of this array into a page file. But a petabyte sized page file would far exceed the harddrive space on most typical machines right now.
Either way, you will probably need to seriously consider a different scheme than just allocating that whole huge array at once.
由于您希望最大化包装密度,因此最好使用位字段:
然后您可以创建这些位字段的数组并使用代理对象来支持读取和写入单个项目 - 附带条件是有一些限制你可以用代理对象做多少事情,所以你必须要小心如何使用它。稍微看一下有关
vector
的一些文章应该会提供一些合理的提示——它的大部分特征都源于这种通用的实现类型。尽管作为通用容器存在缺点,但它可以在一定范围内工作,并且比大多数明显的替代方案提供更紧密的信息打包。Since you want a to maximize packing density, you'd probably be best off using bit fields:
Then you can create an array of these and use proxy objects to support reading and writing individual items -- with the proviso that there are some limitations on how much you can do with proxy objects, so you'll have to be a bit careful about how you try to use this. A bit of looking at some of the articles about
vector<bool>
should provide some reasonable hints -- most of its characteristics stem from this general type of implementation. Despite the shortcomings as a general purpose container, this can work within limits, and provides much tighter packing of information than most of the obvious alternatives.这需要创建一个专用的结构,a la 数字树(或b-tree),键为索引,以避免进行大量分配。
大量分配,尤其是重新分配可能会导致不必要的内存碎片。如果将大数组分割成更小的块,那么不仅数组扩展变得容易,而且稀疏数组的表示也成为可能。
注意
~7.13e+17
大约为 60 位长。您是否有可以支持那么多 RAM 的硬件?这并不是说我密切关注行业,而是我简单地听说过具有 58 位地址总线的 NUMA 架构,但对 60+ 位架构一无所知。如果单元格可能只包含 3 个值(2.2 可以表示为 2),那么它就是 2 位信息。这意味着您可以打包到
uint32_t
16 个值和uint64_t
32 个值中。您可以尝试找到一些现有的数字树实现(或自己推出)并用作索引的关键高位。原始索引的剩余位将是树叶的索引,该树叶将是具有打包值的数组。举例来说,使用
std::map
代替 trie,未经测试:这种方式仍然浪费 0.5 位信息,因为存储是 2 位,允许 4 个值,但只使用了 3 个值。这也可以得到改善,但访问性能成本要高得多。
That calls to create a dedicated structure, a la digital tree (or b-tree) with key being the index, to avoid doing large allocations.
Large allocations and especially reallocations might cause unnecessary memory fragmentation. If you split large array into smaller chunks, then not only array extension becomes easy, but also presentation of sparse array becomes possible.
N.B.
~7.13e+17
is about 60 bit long. Do you even have hardware which can support that much RAM? It's not that I'm following industry closely, but I heard briefly about NUMA archs with 58 bit address bus - but nothing about 60+ bit archs.If cell might contain only 3 values (2.2 can be represented as 2) that makes it 2 bits of information. That means that you can pack into the
uint32_t
16 values and into theuint64_t
32 values.You can try to find some existing digital tree implementation (or roll your own) and use as the key upper bits of index. Remaining bits of the original index would be an index into the tree leaf which would be an array with packed values. To exemplify using
std::map
in place of the trie, untested:This way one still wastes 0.5 bit of information since storage is 2 bits what allows to 4 values, yet only 3 are used. That can be also improved, but at much higher access performance cost.
由于所有值都小于 255,因此您可能希望将其设为 char 数组。
无论如何,指针类型并不决定其最大可分配大小。
As all of your values are smaller than 255, you might want to make this an array of char.
In any case, the pointer type does not dictate the maximum allocatable size for same.
那绝对是错误的!数组的大小与数组类型的最大值完全无关。
所以不需要将其设为
long long
数组。相反,您甚至应该将其设为char
数组,甚至小于该数组。如果您只需要存储三个不同的值,则应该使用
char
或任何其他类型中的位。然后制作一个数组。一个
char
通常是 1 个字节,即 8 位。要存储 3 个值,需要 2 位;因此,您可以在char
中存储 4 个值。使用二进制掩码,您应该找到一种优化方法。
That's absolutely wrong ! The size of the array is completely independent from the maximum value of the type of the array.
So there is no need to make it a
long long
array. Instead you should even make it achar
array or even less than that.If you need to store only three different values, you should play with bits inside a
char
or any other type. Then make an array of these.A
char
is typically 1 byte, so 8 bits. To store 3 values, you need 2 bits; you can therefore store 4 values in achar
.Using binary masks you should work out a way to optimize that.
用于指定数组大小的大小需要为
size_t
类型。new
表达式中使用的类型是数组元素的类型。无论示例中的i
类型如何,它都会转换为size_t
以创建数组。现在在 32 位机器上,最大
size_t
约为 4e+9,因此创建大小为 1e+17 的数组是正确的。在 64 位计算机上,size_t
理论上可以达到 1e+19 左右,但是您不可能拥有接近该数量的内存,因此分配将会失败。因此,正如其他人讨论的那样,您需要某种稀疏数据结构。这里的关键是确定 3 个值中哪一个最常见,并且仅存储数组是其他 2 个值之一的值。您可以使用 std::map 来保存这些值(甚至支持使用
[index]
语法),或各种其他值,具体取决于您想要执行的操作以及详细信息你是数据。The size used to specify the size of an array needs to be type
size_t
. The type used in thenew
expression is the type of the array elements. Regardless of the type ofi
in your example, it will be converted tosize_t
to create the array.Now on a 32-bit machine, the maximum
size_t
is around 4e+9, so making an array of size 1e+17 is right out. On a 64-bit machine,size_t
can theoretically go up to around 1e+19, but there's no way you can have anywhere near that amount of memory, so the allocation will fail.So instead you need some kind of sparse data structure as others have discussed. The key here is to decide which of your 3 values is the most common, and only store values for where the array is one of the other 2 values. You can use a std::map to hold these values (even supports using the
[index]
syntax), or a variety of others, depending on exactly what you're trying to do and the details of you're data.由于值列表是有限的,因此可以仅使用 char 数组。一个字节可以很容易地容纳三个不同的值。
价值观:
0-> 0
1-> 1
2.2-> 2
存储值:
检索值:
获取最后一个值需要额外小心,但数组会很小(1 个字节)。
数组分配:
Since there is a finite list of value it might be possible to just use a char array. One byte can hold the three different values very easily.
Values:
0 -> 0
1 -> 1
2.2 -> 2
Storing values:
Retrieving values:
A little extra care is needed to get the last value but the array will be small (1 byte).
Array Allocation: