转换嵌套列表而不复制或丢失精度

发布于 2024-12-25 08:37:24 字数 1428 浏览 3 评论 0原文

我正在使用 Mathematica 7 来处理大型数据集。该数据集是有符号整数的三维数组。这三个级别可以被认为对应于每次射击 X 点、每次扫描 Y 次射击以及<每组扫描 Z 次。

我还有一个“归零”镜头（包含X点，它们是整数的有符号分数），我想从数据集中的每个镜头中减去它。之后，我将不再需要原始数据集。

如何执行此转换而不在此过程中创建数据集或其部分的新副本？从概念上讲，数据集位于内存中，我想扫描每个元素，并在内存中的该位置更改它，而不是将其永久复制到其他内存位置。

以下独立代码捕获了我想要做的所有方面：（

(* Create some offsetted data, and a zero data set. *)
myData = Table[Table[Table[RandomInteger[{1, 100}], {k, 500}], {j, 400}], {i, 200}];
myZero = Table[RandomInteger[{1, 9}]/RandomInteger[{1, 9}] + 50, {i, 500}];

(* Method 1 *)
myData = Table[
   f1 = myData[[i]];
   Table[
     f2 = f1[[j]];
     f2 - myZero, {j, 400}], {i, 200}];

(* Method 2 *)
Do[
 Do[
  myData[[i]][[j]] = myData[[i]][[j]] - myZero, {j, 400}], {i, 200}]

(* Method 3 *)
Attributes[Zeroing] = {HoldFirst};
Zeroing[x_] := Module[{}, 
   Do[
     Do[
       x[[i]][[j]] = x[[i]][[j]] - myZero, {j, Length[x[[1]]]}
       ], {i, Length[x]}
     ]
 ];

注意：向 Aaron Honecker 的方法 #3。）

在我的机器上（Intel Core2 Duo CPU 3.17 GHz、4 GB RAM、32 位 Windows 7），所有三种方法大约使用 1.25 GB 内存，#2 和 #3 整流罩稍好一些。

如果我不介意失去精度，在创建时将 N[ ] 包裹在 myData 和 myZero 的内部会增加它们的大小最初占用内存 150 MB，但将归零（通过上面的方法 #1-#3）所需的内存量从 1.25 GB 减少到仅 300 MB！这是我的工作解决方案，但很高兴知道处理这个问题的最佳方法。

原文

I am using Mathematica 7 to process a large data set. The data set is a three-dimensional array of signed integers. The three levels may be thought of as corresponding to X points per shot, Y shots per scan, and Z scans per set.

I also have a "zeroing" shot (containing X points, which are signed fractions of integers), which I would like to subtract from every shot in the data set. Afterwards, I will never again need the original data set.

How can I perform this transformation without creating new copies of the data set, or parts of it, in the process? Conceptually, the data set is located in memory, and I would like to scan through each element, and change it at that location in memory, without permanently copying it to some other memory location.

The following self-contained code captures all the aspects of what I am trying to do:

(* Create some offsetted data, and a zero data set. *)
myData = Table[Table[Table[RandomInteger[{1, 100}], {k, 500}], {j, 400}], {i, 200}];
myZero = Table[RandomInteger[{1, 9}]/RandomInteger[{1, 9}] + 50, {i, 500}];

(* Method 1 *)
myData = Table[
   f1 = myData[[i]];
   Table[
     f2 = f1[[j]];
     f2 - myZero, {j, 400}], {i, 200}];

(* Method 2 *)
Do[
 Do[
  myData[[i]][[j]] = myData[[i]][[j]] - myZero, {j, 400}], {i, 200}]

(* Method 3 *)
Attributes[Zeroing] = {HoldFirst};
Zeroing[x_] := Module[{}, 
   Do[
     Do[
       x[[i]][[j]] = x[[i]][[j]] - myZero, {j, Length[x[[1]]]}
       ], {i, Length[x]}
     ]
 ];

(Note: Hat tip to Aaron Honecker for Method #3.)

On my machine (Intel Core2 Duo CPU 3.17 GHz, 4 GB RAM, 32-bit Windows 7), all three methods use roughly 1.25 GB of memory, with #2 and #3 fairing slightly better.

If I don't mind losing precision, wrapping N[ ] around the innards of myData and myZero when they're being created increases their size in memory by 150 MB initially but reduces the amount of memory required for zeroing (by methods #1-#3 above) from 1.25 GB down to just 300 MB! That's my working solution, but it would be great to know the best way of handling this problem.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

弄潮 2025-01-01 08:37:24

不幸的是，我现在时间不多，所以我必须简洁......

在处理大数据时，您需要知道 Mathematica 有一种不同的存储格式，称为压缩数组，它更加紧凑，并且更加紧凑。比常规的更快，但仅适用于机器实数或整数。

如果这不会自动发生，请评估 ?Developer`*Packed* 以查看哪些函数可用于直接与它们进行相互转换。

因此，为什么我的解决方案速度快且内存效率高，其背后的简要解释是它使用打包数组。我使用 Developer`PackedArrayQ 进行了测试，我的数组永远不会被解包，并且我使用了机器实数（我将 N[] 应用于所有内容）

In[1]:= myData = N@RandomInteger[{1, 100}, {200, 400, 500}];

In[2]:= myZero = 
  Developer`ToPackedArray@
   N@Table[RandomInteger[{1, 9}]/RandomInteger[{1, 9}] + 50, {i, 500}];

In[3]:= myData = Map[# - myZero &, myData, {2}]; // Timing

Out[3]= {1.516, Null}

此外，您要求的操作（ “我想扫描每个元素，并在内存中的该位置更改它”）称为映射（请参阅 Map[] 或 /@）。

Unfortunately I have little time now, so I must be concise ...

When working with large data, you need to be aware that Mathematica has a different storage format called packed arrays which is much more compact and much faster than the regular one, but only works for machine reals or integers.

Please evaluate ?Developer`*Packed* to see what functions are available for directly converting to/from them, if this doesn't happen automatically.

So the brief explanation behind why my solution is fast and memory efficient is that it uses packed arrays. I tested using Developer`PackedArrayQ that my arrays never get unpacked, and I used machine reals (I applied N[] to everything)

In[1]:= myData = N@RandomInteger[{1, 100}, {200, 400, 500}];

In[2]:= myZero = 
  Developer`ToPackedArray@
   N@Table[RandomInteger[{1, 9}]/RandomInteger[{1, 9}] + 50, {i, 500}];

In[3]:= myData = Map[# - myZero &, myData, {2}]; // Timing

Out[3]= {1.516, Null}

Also, the operation you were asking for ("I would like to scan through each element, and change it at that location in memory") is called mapping (see Map[] or /@).

回复收藏 0 原文

一桥轻雨一伞开 2025-01-01 08:37:24

首先我要指出的是，这个答案必须被视为对 @Szabolcs 的答案的补充，在我看来，后者是更好的选择。虽然 @Szabolcs 的解决方案可能是最快和最好的整体解决方案，但它不符合原始规范，因为 Map 返回原始列表的（修改后的）副本，而不是“扫描每个元素，然后在内存中的该位置更改它”。 AFAIK 这种行为仅由 Part 命令提供。我将使用他的想法（将所有内容转换为打包数组）来显示对原始列表进行内存中更改的代码：

In[5]:= 
Do[myData[[All,All,i]]=myData[[All,All,i]]- myZero[[i]],
     {i,Last@Dimensions@myData}];//Timing

Out[5]= {4.734,Null}

这在概念上等同于问题中列出的方法 3，但运行速度更快，因为这是部分矢量化的解决方案，并且只需要一个循环。然而，这仍然比 @Szabolcs 的解决方案慢至少一个数量级。

理论上，这似乎是一个经典的速度/内存权衡：如果您需要速度并有一些备用内存，@Szabolcs 的解决方案就是最佳选择。如果您的内存要求很高，理论上这种较慢的方法将节省中间内存消耗（在@Szabolcs的方法中，原始列表在< code>myData 被分配了 Map 的结果，因此最终的内存使用量是相同的，但是在计算过程中，多了一个大小为<的数组code>myData 由以下人员维护地图）。

然而，实际上，内存消耗似乎并不小，因为在这两种情况下，由于某种原因，Out 变量中都维护了列表的额外副本在计算期间（或之后），即使输出被抑制（也可能这种效果并不在所有情况下都表现出来）。我还不太明白这一点，但我目前的结论是，@Szabolcs 的方法在中级内存消耗方面与当前基于就地列表修改的方法一样好。因此，他的方法似乎在所有情况下都是可行的，但我仍然决定发布这个答案作为补充。

Let me start by noting that this answer must be viewed as complementary to the one by @Szabolcs, with the latter being, in my conclusion, the better option. While the solution of @Szabolcs is probably the fastest and best overall, it falls short of the original spec in that Map returns a (modified) copy of the original list, rather than "scan each element, and change it at that location in memory". Such behavior, AFAIK, is only provided by Part command. I will use his ideas (converting everything into packed arrays), to show the code that does in-memory changes to the original list:

In[5]:= 
Do[myData[[All,All,i]]=myData[[All,All,i]]- myZero[[i]],
     {i,Last@Dimensions@myData}];//Timing

Out[5]= {4.734,Null}

This is conceptually equivalent to method 3 listed in the question, but runs much faster because this is a partly vectorized solution and only a single loop is needed. This is however still at least order of magnitude slower than the solution of @Szabolcs.

In theory, this seems to be a classic speed/memory tradeoff: if you need speed and have some spare memory, @Szabolcs's solution is the way to go. If your memory requirements are tough, in theory this slower method would save on intermediate memory consumption (in the method of @Szabolcs, the original list is garbage-collected after the myData is assigned the result of Map, so the final memory usage is the same, but during the computation, one extra array of the size of myData is maintained by Map).

In practice, however, the memory consumption seems to not be smaller, since an extra copy of the list is for some reason maintained in the Out variable in both cases during (or right after) the computation, even when the output is suppressed (it may also be that this effect does not manifest itself in all cases). I don't quite understand this yet, but my current conclusion is that the method of @Szabolcs is just as good in terms of intermediate memory consumption as the present one based on the in-place list modifications. Therefore, his method seems the way to go in all cases, but I still decided to publish this answer as a complementary.

回复收藏 0 原文

~没有更多了~