融化或与大数据结合的内存

发布于 2025-02-06 08:23:52 字数 3503 浏览 2 评论 0原文

当我尝试运行pd. -melt（）。
时，我发现了错误我检查了这篇文章，并试图修改代码，但仍然遇到了错误。（ link ）

这是我的原始代码：

melted = pd.melt(df, ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value').sort_values('ID')

修改后：

pivot_list = list()
chunk_size = 100000
for i in range(0, len(df), chunk_size):
    row_pivot = pd.melt(df.iloc[i:i+chunk_size], ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value')
    pivot_list.append(row_pivot)
melted = pd.concat(pivot_list).sort_values('ID')

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File /path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/path/Current_Proj/Main_Dir/Python_Program.py", line 122, in My_Function
    melted = pd.concat(pivot_list).sort_values('ID')
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 307, in concat
    return op.get_result()
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 532, in get_result
    new_data = concatenate_managers(
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 222, in concatenate_managers
    values = _concatenate_join_units(join_units, concat_axis, copy=copy)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
    to_concat = [
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 487, in <listcomp>
    ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 466, in get_reindexed_values
    values = algos.take_nd(values, indexer, axis=ax)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 108, in take_nd
    return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 149, in _take_nd_ndarray
    out = np.empty(out_shape, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File /path/Current_Proj/Main_Dir/Python_Program.py", line 222, in <module>
    result = pool.starmap(My_Function, zip(arg1, arg2, arg3))
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
numpy.core._exceptions.MemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object

我认为主要问题来自melt（）melt（）< /code>和concat（）零件。任何要处理的想法都应该感谢。

原文

I got the error when I try to run pd.melt().
I checked on this post and tried to modified the code and still got the error. (LINK)

Here is my original code:

melted = pd.melt(df, ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value').sort_values('ID')

After modifying:

pivot_list = list()
chunk_size = 100000
for i in range(0, len(df), chunk_size):
    row_pivot = pd.melt(df.iloc[i:i+chunk_size], ['ID', 'Col2', 'Col3', 'Year'], var_name='New_Var', value_name='Value')
    pivot_list.append(row_pivot)
melted = pd.concat(pivot_list).sort_values('ID')

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File /path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/path/Current_Proj/Main_Dir/Python_Program.py", line 122, in My_Function
    melted = pd.concat(pivot_list).sort_values('ID')
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 307, in concat
    return op.get_result()
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/reshape/concat.py", line 532, in get_result
    new_data = concatenate_managers(
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 222, in concatenate_managers
    values = _concatenate_join_units(join_units, concat_axis, copy=copy)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 486, in _concatenate_join_units
    to_concat = [
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 487, in <listcomp>
    ju.get_reindexed_values(empty_dtype=empty_dtype, upcasted_na=upcasted_na)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/internals/concat.py", line 466, in get_reindexed_values
    values = algos.take_nd(values, indexer, axis=ax)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 108, in take_nd
    return _take_nd_ndarray(arr, indexer, axis, fill_value, allow_fill)
  File "/path/envs/myenvs/lib/python3.9/site-packages/pandas/core/array_algos/take.py", line 149, in _take_nd_ndarray
    out = np.empty(out_shape, dtype=dtype)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File /path/Current_Proj/Main_Dir/Python_Program.py", line 222, in <module>
    result = pool.starmap(My_Function, zip(arg1, arg2, arg3))
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/path/envs/myenvs/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
numpy.core._exceptions.MemoryError: Unable to allocate 27.1 GiB for an array with shape (2, 1819900000) and data type object

I think the main issue came from melt() and concat() parts.
Any idea to deal with should be thankful.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

极度宠爱 2025-02-13 08:23:52

通常，当您获得“ MemoryError：无法分配”错误时，这将属于请求重塑操作的“用户错误”类别，该操作太大而无法适合内存。

pd.-Melt是一种内存密集型操作，不仅需要为数据中的所有元素创建新数组，而且还将数据重塑为效率较低的格式，从而为当前值创建了许多重复。结果和内存惩罚将取决于数据的结构和值列的数量。

在闭合梅尔 ID_VARS 列中所有元素的数组，并重复由value_vars指定的所有列。

例如，如果您的DataFrame具有1M行和1000列，并且所有单元格为Float32，则数据框将在内存中占用约4GB。如果您然后尝试融化并指定4 id_vars，那么您将拥有4*1M ID单元格，每个单元将重复重复（996）次（996），为您提供4*1e6*996索引。此外，您将拥有一个带有1E6*996“变量”的列，最后是相同数量的“值”。您需要知道所有列名和单元格的数据类型的长度和dtype，但是即使所有值都是相对紧凑的float32s，这个简单的示例也会导致23 GB数组。

熔体是重塑小数据范围的有用便利功能。如果您有一个数据框架，该数据框在我在此示例中所说的尺寸接近的任何地方，我主要建议您不这样做，或者您确实需要这样的重塑，那么您需要认真对了解操作并以根据您的数据大小量身定制的方式来分解数据。您可能需要以迭代的方式写出数据，而不是尝试在结尾加成数据。这不是一个可以开箱即用的事情 - 期待一些试验＆amp;错误。您还可以考虑使用核心计算工具 - dask.dataframe 具有一个熔体的端口，可以利用多个内核并与磁盘并行写入。

Usually, when you get a "MemoryError: unable to allocate" error, this falls into the "user error" category of requesting a reshape operation which is simply too large to fit into memory.

pd.melt is a memory-intensive operation which not only requires creating new arrays for all elements in your data, it also reshapes your data into a less efficient format, creating many duplicates for current values. the result and the memory penalty will depend on the structure of your data and the number of value columns.

Give the pandas docs on reshaping by melt a close read, and calculate whether you can afford to create an array of all elements in your id_vars column and repeat them for all columns specified by value_vars.

As an example, if your dataframe has 1M rows and 1000 columns, with all cells as float32, the dataframe would take up approximately 4GB in memory. If you then try to melt and specify 4 id_vars, then you'll have 4*1M id cells which will each get repeated (996) times, giving you 4*1e6*996 giving you 4Bn cells for the index. Additionally, you'll have a column with 1e6*996 "variables" and finally the same number of "values". You'd need to know the length and dtype of all the column names and the data types of the cells, but this simple example would result in a 23 GB array even if all values were relatively compact float32s.

Melt is a helpful convenience function for reshaping small dataframes. If you have a dataframe which is anywhere near the size I'm talking about in this example, I'd mostly suggest you don't do this, or if you really do need to reshape this way, then you need to get serious about understanding the operation and chunking the data in a way that is tailored to your data's size. You may want to write out the data iteratively rather than attempting to concatenate the data at the end. This isn't something that will work out of the box - expect some trial & error. You could also look into using out-of-core computation tools - dask.dataframe has a port of melt which could leverage multiple cores and write in parallel to disk.

回复收藏 0 原文

~没有更多了~