声明一个带有微小单元的巨大动态数组 [C++]

发布于 2024-09-18 18:49:49 字数 821 浏览 11 评论 0原文

我正在做这个项目。满足以下条件

在这个项目中，我需要创建一个巨大的数组（希望我能够创建一个像 ~7.13e+17 这样大的数组，但这个目标仍然领先。）
数组内的每个单元格可以包含以下之一三个值：0,1,2
我使用 C++ 作为我的语言。

我尝试使用普通的动态数组命令

int * p;
int i;    
i=[size]; //This is calculated somewhere else.
p= new (nothrow) int[i];

，但据我了解，该数组创建的数组的可能最大大小为 int 的最大大小。如果我更改代码并使用以下代码

long long * p;
long long i;    
i=[size]; //This is calculated somewhere else.
p= new (nothrow) long long [i];

，那么数组中的每个单元格都是“long long”类型，使得数组内存非常重。有没有办法使用 long long 创建一个数组来确定数组中的单元格数量并使每个单元格的大小为 int ？

多谢，乌列尔。

编辑：了解更多信息。

这个问题主要是理论问题，是我硕士论文的一部分。我仍然希望这个程序能够尽可能地工作。
我当前的步骤是使这项工作适用于具有 2.56e+09 项的数组，快速计算表明我们正在讨论至少 0.6 GB 的数组，我的系统应该能够处理。然而，即使所需的空间量实际上为 4.5GB，我仍无法使用当前的编码解决方案实现此目标。

原文

I have this project I'm working on. The following conditions apply

In this project I need to create one huge array (hopefully I will be able to create one as big as ~7.13e+17, but this target is still up ahead.)
Each cell inside the array can contain one of the three values: 0,1,2
I'm using C++ as my language.

I tried using the normal dynamic array command

int * p;
int i;    
i=[size]; //This is calculated somewhere else.
p= new (nothrow) int[i];

But as far as I understand, this array makes an array with a possible maximum size of the maximum size of int. If I change my code and use the following code

long long * p;
long long i;    
i=[size]; //This is calculated somewhere else.
p= new (nothrow) long long [i];

Then each cell in the array is of type "long long" making the array very memory heavy.
Is there any way to create an array using long long to determine the number of cells in the array and have every cell of size int?

Thanks a lot,
Uriel.

EDIT: for further information.

This problem is mainly theoretical, it is part of my Master's thesis. I would still like this program to work as best as it can.
My current step is to make this work for an array with 2.56e+09 items, quick calculation shows we are talking about an array that's at least 0.6 Gigabytes, something my system should be able to cope with. Yet I cannot achieve this goal with my current coding solution even though the amount of space required is really at 4.5GB.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

双手揣兜 2024-09-25 18:49:50

有什么方法可以使用 long long 创建一个数组来确定数组中的单元格数量并使每个单元格的大小为 int 吗？

数组的类型没有必要与用于指定大小的变量的类型相同。因此，使用long long作为指定大小的变量，然后使用int作为数组的类型。

int * p;
long long i;    
i=[size]; //This is calculated somewhere else.
p= new (nothrow) int [i];

然而，当你说你需要创建一个“像〜7.13e + 17一样大”的数组时，我很担心。我不知道你指的是字节还是元素，但无论哪种方式对于直数组来说都非常巨大。这已经进入了 PB 级数据的领域。

在 32 位程序中，这是不可能的。理论上，您可以拥有一个高达几千兆字节的数组（尽管实际上大多数情况下要少得多）。

据我所知，在 64 位程序中理论上您可以分配这么大的数组。然而，我怀疑大多数机器是否真的可以处理它。由于这个数据量将远远超过机器中的 RAM，操作系统将被迫将该数组的大部分内容推送到页面文件中。但 PB 大小的页面文件将远远超过目前大多数典型机器上的硬盘空间。

无论哪种方式，您可能需要认真考虑一种不同的方案，而不是立即分配整个巨大的数组。

Is there any way to create an array using long long to determine the number of cells in the array and have every cell of size int?

There is no reason the type of the array has to be the same as the type of the variable used to specify the size. So use long long for the variable that specifies the size and then int for the type of the array.

int * p;
long long i;    
i=[size]; //This is calculated somewhere else.
p= new (nothrow) int [i];

However, I am concerned when you say you need to create an array "as big as ~7.13e+17". I don't know whether you mean bytes or elements but either way that is incredibly huge for a straight array. That's getting into the realm of petabytes of data.

In a 32-bit program, that's simply not possible. In theory you could have an array up to a couple gigabytes (though in practice considerably less most times).

In a 64-bit program you could in theory allocate an array that large, as far as I know. However, I'm skeptical that most machines could actually handle it. Since this amount of data would far exceed the RAM in the machine, the operating system would be forced to push most of this array into a page file. But a petabyte sized page file would far exceed the harddrive space on most typical machines right now.

Either way, you will probably need to seriously consider a different scheme than just allocating that whole huge array at once.

回复收藏 0 原文

無心 2024-09-25 18:49:50

由于您希望最大化包装密度，因此最好使用位字段：

struct item_pack { 
    char a:2;
    char b:2:
    char c:2;
    char d:2;
};

然后您可以创建这些位字段的数组并使用代理对象来支持读取和写入单个项目 - 附带条件是有一些限制你可以用代理对象做多少事情，所以你必须要小心如何使用它。稍微看一下有关 vector 的一些文章应该会提供一些合理的提示——它的大部分特征都源于这种通用的实现类型。尽管作为通用容器存在缺点，但它可以在一定范围内工作，并且比大多数明显的替代方案提供更紧密的信息打包。

Since you want a to maximize packing density, you'd probably be best off using bit fields:

struct item_pack { 
    char a:2;
    char b:2:
    char c:2;
    char d:2;
};

Then you can create an array of these and use proxy objects to support reading and writing individual items -- with the proviso that there are some limitations on how much you can do with proxy objects, so you'll have to be a bit careful about how you try to use this. A bit of looking at some of the articles about vector<bool> should provide some reasonable hints -- most of its characteristics stem from this general type of implementation. Despite the shortcomings as a general purpose container, this can work within limits, and provides much tighter packing of information than most of the obvious alternatives.

回复收藏 0 原文

一生独一 2024-09-25 18:49:50

在这个项目中，我需要创建一个巨大的数组（希望我能够创建一个像〜7.13e+17一样大的数组，但这个目标仍然遥遥领先。）

这需要创建一个专用的结构，a la 数字树（或b-tree)，键为索引，以避免进行大量分配。

大量分配，尤其是重新分配可能会导致不必要的内存碎片。如果将大数组分割成更小的块，那么不仅数组扩展变得容易，而且稀疏数组的表示也成为可能。

注意 ~7.13e+17 大约为 60 位长。您是否有可以支持那么多 RAM 的硬件？这并不是说我密切关注行业，而是我简单地听说过具有 58 位地址总线的 NUMA 架构，但对 60+ 位架构一无所知。

数组内的每个单元格都可以包含三个值之一：0、1、2.2。

如果单元格可能只包含 3 个值（2.2 可以表示为 2），那么它就是 2 位信息。这意味着您可以打包到 uint32_t 16 个值和 uint64_t 32 个值中。

您可以尝试找到一些现有的数字树实现（或自己推出）并用作索引的关键高位。原始索引的剩余位将是树叶的索引，该树叶将是具有打包值的数组。举例来说，使用 std::map 代替 trie，未经测试：

enum {
   LS_BITS = 16,
   MS_BITS = 64-LS_BITS
};

enum {
   VALUE_BITS = 2,
   VALUE_MASK = ((1<<VALUE_BITS)-1)
};

// this represents an array of `1<<LS_BITS` values
struct leaf_node {
   uint64_t packed_data[ ((1<<LS_BITS)*VALUE_BITS) / (sizeof(uint64_t)*8) ];
};

// that should be a trie, to provide faster look-up
typedef std::map< uint64_t, leaf_node > big_array_type;

void
big_array_set_value( big_array_type &b, uint64_t index, uint64_t value )
{
   leaf_node &n = b[index >> LS_BITS];
   uint64_t li = index & ((1<<LS_BITS)-1);
   li *= VALUE_BITS;   // convert into bit offset
   uint64_t &x = n.packed_data[ li / (sizeof(uint64_t)*8) ];
   li %= (sizeof(uint64_t)*8);
   x = (x & (VALUE_MASK<<li)) | (value << li);
}

int
big_array_get_value( big_array_type &b, uint64_t index, uint64_t value )
{
   leaf_node &n = b[index >> LS_BITS];
   uint64_t li = index & ((1<<LS_BITS)-1);
   li *= VALUE_BITS;   // convert into bit offset
   uint64_t &x = n.packed_data[ li / (sizeof(uint64_t)*8) ];
   li %= (sizeof(uint64_t)*8);
   return (x >> li) & VALUE_MASK;
}

这种方式仍然浪费 0.5 位信息，因为存储是 2 位，允许 4 个值，但只使用了 3 个值。这也可以得到改善，但访问性能成本要高得多。

In this project I need to create one huge array (hopefully I will be able to create one as big as ~7.13e+17, but this target is still up ahead.)

That calls to create a dedicated structure, a la digital tree (or b-tree) with key being the index, to avoid doing large allocations.

Large allocations and especially reallocations might cause unnecessary memory fragmentation. If you split large array into smaller chunks, then not only array extension becomes easy, but also presentation of sparse array becomes possible.

N.B. ~7.13e+17 is about 60 bit long. Do you even have hardware which can support that much RAM? It's not that I'm following industry closely, but I heard briefly about NUMA archs with 58 bit address bus - but nothing about 60+ bit archs.

Each cell inside the array can contain one of the three values: 0, 1, 2.2.

If cell might contain only 3 values (2.2 can be represented as 2) that makes it 2 bits of information. That means that you can pack into the uint32_t 16 values and into the uint64_t 32 values.

You can try to find some existing digital tree implementation (or roll your own) and use as the key upper bits of index. Remaining bits of the original index would be an index into the tree leaf which would be an array with packed values. To exemplify using std::map in place of the trie, untested:

enum {
   LS_BITS = 16,
   MS_BITS = 64-LS_BITS
};

enum {
   VALUE_BITS = 2,
   VALUE_MASK = ((1<<VALUE_BITS)-1)
};

// this represents an array of `1<<LS_BITS` values
struct leaf_node {
   uint64_t packed_data[ ((1<<LS_BITS)*VALUE_BITS) / (sizeof(uint64_t)*8) ];
};

// that should be a trie, to provide faster look-up
typedef std::map< uint64_t, leaf_node > big_array_type;

void
big_array_set_value( big_array_type &b, uint64_t index, uint64_t value )
{
   leaf_node &n = b[index >> LS_BITS];
   uint64_t li = index & ((1<<LS_BITS)-1);
   li *= VALUE_BITS;   // convert into bit offset
   uint64_t &x = n.packed_data[ li / (sizeof(uint64_t)*8) ];
   li %= (sizeof(uint64_t)*8);
   x = (x & (VALUE_MASK<<li)) | (value << li);
}

int
big_array_get_value( big_array_type &b, uint64_t index, uint64_t value )
{
   leaf_node &n = b[index >> LS_BITS];
   uint64_t li = index & ((1<<LS_BITS)-1);
   li *= VALUE_BITS;   // convert into bit offset
   uint64_t &x = n.packed_data[ li / (sizeof(uint64_t)*8) ];
   li %= (sizeof(uint64_t)*8);
   return (x >> li) & VALUE_MASK;
}

This way one still wastes 0.5 bit of information since storage is 2 bits what allows to 4 values, yet only 3 are used. That can be also improved, but at much higher access performance cost.

回复收藏 0 原文

胡渣熟男 2024-09-25 18:49:50

由于所有值都小于 255，因此您可能希望将其设为 char 数组。
无论如何，指针类型并不决定其最大可分配大小。

回复收藏 0 原文

野鹿林 2024-09-25 18:49:50

但据我了解，这个数组创建了一个可能的最大大小为 int 最大大小的数组。如果我更改代码并使用以下代码

那绝对是错误的！数组的大小与数组类型的最大值完全无关。

所以不需要将其设为long long数组。相反，您甚至应该将其设为 char 数组，甚至小于该数组。

如果您只需要存储三个不同的值，则应该使用 char 或任何其他类型中的位。然后制作一个数组。

一个 char 通常是 1 个字节，即 8 位。要存储 3 个值，需要 2 位；因此，您可以在 char 中存储 4 个值。

使用二进制掩码，您应该找到一种优化方法。

回复收藏 0 原文

春风十里 2024-09-25 18:49:50

用于指定数组大小的大小需要为 size_t 类型。 new 表达式中使用的类型是数组元素的类型。无论示例中的 i 类型如何，它都会转换为 size_t 以创建数组。

现在在 32 位机器上，最大 size_t 约为 4e+9，因此创建大小为 1e+17 的数组是正确的。在 64 位计算机上，size_t 理论上可以达到 1e+19 左右，但是您不可能拥有接近该数量的内存，因此分配将会失败。

因此，正如其他人讨论的那样，您需要某种稀疏数据结构。这里的关键是确定 3 个值中哪一个最常见，并且仅存储数组是其他 2 个值之一的值。您可以使用 std::map 来保存这些值（甚至支持使用 [index] 语法），或各种其他值，具体取决于您想要执行的操作以及详细信息你是数据。

回复收藏 0 原文

时光暖心i 2024-09-25 18:49:50

由于值列表是有限的，因此可以仅使用 char 数组。一个字节可以很容易地容纳三个不同的值。

价值观：
0-> 0
1-> 1
2.2-> 2

存储值：

char values[i];
values[i] = 0;
values[i] = 1;
values[i] = 2;  // really the 2.2 value

检索值：

int zero = values[i] - 0;
int one  = values[i] - 0;
double two_point_two values[i] - 0;
if (two_point_two == 2)
    two_point_tow = 2.2;

获取最后一个值需要额外小心，但数组会很小（1 个字节）。

数组分配：

int main ()
{   
    // static allocation requires a const size
    const int static_array_size = 100;
    char static_array[static_array_size];
    std::cout << "static array size is:" << sizeof(static_array) << std::endl;

    // heap allocation can vary in size (i.e. non const heap_array_size variable)
    int heap_array_size = 200;
    char* heap_array = new char[heap_array_size];
    std::cout << "static array size is:" << sizeof(heap_array_size) << std::endl;
}

Since there is a finite list of value it might be possible to just use a char array. One byte can hold the three different values very easily.

Values:
0 -> 0
1 -> 1
2.2 -> 2

Storing values:

char values[i];
values[i] = 0;
values[i] = 1;
values[i] = 2;  // really the 2.2 value

Retrieving values:

int zero = values[i] - 0;
int one  = values[i] - 0;
double two_point_two values[i] - 0;
if (two_point_two == 2)
    two_point_tow = 2.2;

A little extra care is needed to get the last value but the array will be small (1 byte).

Array Allocation:

int main ()
{   
    // static allocation requires a const size
    const int static_array_size = 100;
    char static_array[static_array_size];
    std::cout << "static array size is:" << sizeof(static_array) << std::endl;

    // heap allocation can vary in size (i.e. non const heap_array_size variable)
    int heap_array_size = 200;
    char* heap_array = new char[heap_array_size];
    std::cout << "static array size is:" << sizeof(heap_array_size) << std::endl;
}

回复收藏 0 原文

~没有更多了~