我在运行 Anomaly_detection基准测试中的脚本,该基准是Tinyml基准mlperft的一部分。我不确定原因是什么,无论我使用问题的批处理大小如何,而我的系统内存在执行过程中也永远无法算出。
我在两台机器上运行了脚本,包括32GB或RAM,我的个人PC具有6GB GPU和带有24GB GPU的服务器,均使用CUDA 11.6和TensorFlow 2.9。
在服务器上,我已经对其进行了测试,批次大小最多可达260k+,它在没有任何问题的情况下运行,从不使用超过50%的系统内存。在我的个人机器上,即使批量的大小为1,我也无法进行测试,我可能会提到这是永远运行的,也永远不会使用系统内存的很大一部分。
输出的一个示例是如下
2022-06-01 14:46:55.918879: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1806336000 exceeds 10% of free system memory.
2022-06-01 14:46:56.979387: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1806336000 exceeds 10% of free system memory.
2022-06-01 14:46:57.674953: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1806336000 exceeds 10% of free system memory.
2022-06-01 14:46:58.187186: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1806336000 exceeds 10% of free system memory.
Epoch 1/100
1374/1379 [============================>.] - ETA: 0s - loss: 96.14442022-06-01 14:47:24.023059: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 191.41MiB (rounded to 200704000)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2022-06-01 14:47:24.023086: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] BFCAllocator dump for GPU_0_bfc
2022-06-01 14:47:24.023100: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (256): Total Chunks: 50, Chunks in use: 50. 12.5KiB allocated for chunks. 12.5KiB in use in bin. 544B client-requested in use in bin.
2022-06-01 14:47:24.023110: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (512): Total Chunks: 88, Chunks in use: 88. 44.2KiB allocated for chunks. 44.2KiB in use in bin. 44.0KiB client-requested in use in bin.
2022-06-01 14:47:24.023118: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2022-06-01 14:47:24.023126: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2048): Total Chunks: 3, Chunks in use: 3. 7.5KiB allocated for chunks. 7.5KiB in use in bin. 7.5KiB client-requested in use in bin.
2022-06-01 14:47:24.023135: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4096): Total Chunks: 6, Chunks in use: 6. 24.5KiB allocated for chunks. 24.5KiB in use in bin. 24.0KiB client-requested in use in bin.
2022-06-01 14:47:24.023143: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023149: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023156: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023164: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (65536): Total Chunks: 18, Chunks in use: 18. 1.24MiB allocated for chunks. 1.24MiB in use in bin. 1.12MiB client-requested in use in bin.
2022-06-01 14:47:24.023171: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023179: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (262144): Total Chunks: 7, Chunks in use: 6. 2.19MiB allocated for chunks. 1.88MiB in use in bin. 1.88MiB client-requested in use in bin.
2022-06-01 14:47:24.023186: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (524288): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023193: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023200: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2097152): Total Chunks: 1, Chunks in use: 0. 2.44MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023207: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023214: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8388608): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023221: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023230: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023237: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023245: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (134217728): Total Chunks: 1, Chunks in use: 0. 181.04MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023253: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (268435456): Total Chunks: 2, Chunks in use: 2. 3.36GiB allocated for chunks. 3.36GiB in use in bin. 3.36GiB client-requested in use in bin.
2022-06-01 14:47:24.023262: I tensorflow/core/common_runtime/bfc_allocator.cc:1033] Bin for 191.41MiB was 128.00MiB, Chunk State:
2022-06-01 14:47:24.023275: I tensorflow/core/common_runtime/bfc_allocator.cc:1039] Size: 181.04MiB | Requested Size: 4B | in_use: 0 | bin_num: 19, prev: Size: 256B | Requested Size: 8B | in_use: 1 | bin_num: -1
2022-06-01 14:47:24.023280: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 3808755712
2022-06-01 14:47:24.023288: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000000 of size 1280 next 1
2022-06-01 14:47:24.023294: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000500 of size 256 next 2
2022-06-01 14:47:24.023300: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000600 of size 256 next 3
2022-06-01 14:47:24.023306: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000700 of size 256 next 4
2022-06-01 14:47:24.023312: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000800 of size 512 next 5
2022-06-01 14:47:24.023317: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000a00 of size 256 next 8
2022-06-01 14:47:24.023323: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000b00 of size 512 next 9
2022-06-01 14:47:24.023328: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000d00 of size 512 next 10
2022-06-01 14:47:24.023334: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000f00 of size 512 next 11
2022-06-01 14:47:24.023339: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001100 of size 512 next 12
2022-06-01 14:47:24.023345: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001300 of size 256 next 13
2022-06-01 14:47:24.023351: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001400 of size 256 next 14
2022-06-01 14:47:24.023356: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001500 of size 512 next 15
2022-06-01 14:47:24.023362: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001700 of size 512 next 16
2022-06-01 14:47:24.023367: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001900 of size 512 next 19
2022-06-01 14:47:24.023373: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001b00 of size 512 next 20
2022-06-01 14:47:24.023378: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001d00 of size 512 next 21
2022-06-01 14:47:24.023386: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001f00 of size 128768 next 17
2022-06-01 14:47:24.023392: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac021600 of size 65536 next 18
2022-06-01 14:47:24.023397: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac031600 of size 512 next 22
2022-06-01 14:47:24.023403: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac031800 of size 512 next 25
2022-06-01 14:47:24.023409: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac031a00 of size 512 next 26
2022-06-01 14:47:24.023415: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac031c00 of size 512 next 27
2022-06-01 14:47:24.023420: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac031e00 of size 512 next 28
2022-06-01 14:47:24.023426: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032000 of size 512 next 31
2022-06-01 14:47:24.023431: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032200 of size 512 next 32
2022-06-01 14:47:24.023437: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032400 of size 512 next 33
2022-06-01 14:47:24.023442: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032600 of size 512 next 34
2022-06-01 14:47:24.023448: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032800 of size 512 next 35
2022-06-01 14:47:24.023453: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032a00 of size 256 next 36
2022-06-01 14:47:24.023459: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032b00 of size 256 next 37
2022-06-01 14:47:24.023464: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032c00 of size 256 next 38
2022-06-01 14:47:24.023470: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032d00 of size 256 next 39
2022-06-01 14:47:24.023475: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032e00 of size 256 next 42
2022-06-01 14:47:24.023481: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032f00 of size 256 next 43
2022-06-01 14:47:24.023486: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033000 of size 256 next 44
2022-06-01 14:47:24.023492: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033100 of size 512 next 55
2022-06-01 14:47:24.023497: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033300 of size 512 next 56
2022-06-01 14:47:24.023503: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033500 of size 512 next 57
2022-06-01 14:47:24.023508: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033700 of size 512 next 58
2022-06-01 14:47:24.023514: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033900 of size 512 next 59
2022-06-01 14:47:24.023519: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033b00 of size 512 next 60
2022-06-01 14:47:24.023525: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033d00 of size 512 next 61
2022-06-01 14:47:24.023530: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033f00 of size 512 next 63
2022-06-01 14:47:24.023536: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034100 of size 512 next 64
2022-06-01 14:47:24.023541: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034300 of size 512 next 65
2022-06-01 14:47:24.023547: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034500 of size 512 next 66
2022-06-01 14:47:24.023552: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034700 of size 512 next 67
2022-06-01 14:47:24.023558: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034900 of size 256 next 71
2022-06-01 14:47:24.023563: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034a00 of size 256 next 72
2022-06-01 14:47:24.023569: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034b00 of size 256 next 73
2022-06-01 14:47:24.023574: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034c00 of size 256 next 40
2022-06-01 14:47:24.023580: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034d00 of size 4096 next 41
2022-06-01 14:47:24.023586: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac035d00 of size 512 next 45
2022-06-01 14:47:24.023592: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac035f00 of size 512 next 48
2022-06-01 14:47:24.023598: I tensorflow/core/common_runtime/bfc_allocator.cc:1066]
删除了一些重复的类似行以允许输出适合输出
2022-06-01 14:47:24.024066: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1f83884d00 of size 256 next 172
2022-06-01 14:47:24.024069: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free at 7f1f83884e00 of size 327680 next 232
2022-06-01 14:47:24.024073: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1f838d4e00 of size 256 next 206
2022-06-01 14:47:24.024077: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free at 7f1f838d4f00 of size 2556416 next 178
2022-06-01 14:47:24.024081: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1f83b45100 of size 256 next 179
2022-06-01 14:47:24.024085: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free at 7f1f83b45200 of size 189836800 next 18446744073709551615
2022-06-01 14:47:24.024088: I tensorflow/core/common_runtime/bfc_allocator.cc:1071] Summary of in-use Chunks by size:
2022-06-01 14:47:24.024094: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 50 Chunks of size 256 totalling 12.5KiB
2022-06-01 14:47:24.024099: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 87 Chunks of size 512 totalling 43.5KiB
2022-06-01 14:47:24.024104: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 768 totalling 768B
2022-06-01 14:47:24.024108: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 1280 totalling 1.2KiB
2022-06-01 14:47:24.024113: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 2560 totalling 7.5KiB
2022-06-01 14:47:24.024117: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 5 Chunks of size 4096 totalling 20.0KiB
2022-06-01 14:47:24.024121: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 4608 totalling 4.5KiB
2022-06-01 14:47:24.024126: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 16 Chunks of size 65536 totalling 1.00MiB
2022-06-01 14:47:24.024130: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 127232 totalling 124.2KiB
2022-06-01 14:47:24.024135: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 128768 totalling 125.8KiB
2022-06-01 14:47:24.024139: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 6 Chunks of size 327680 totalling 1.88MiB
2022-06-01 14:47:24.024144: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 1806336000 totalling 3.36GiB
2022-06-01 14:47:24.024148: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 3.37GiB
2022-06-01 14:47:24.024152: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 3808755712 memory_limit_: 3808755712 available bytes: 0 curr_region_allocation_bytes_: 7617511424
2022-06-01 14:47:24.024159: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats:
Limit: 3808755712
InUse: 3616034816
MaxInUse: 3626576640
NumAllocs: 368395
MaxAllocSize: 1806336000
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2022-06-01 14:47:24.024167: W tensorflow/core/common_runtime/bfc_allocator.cc:474] ************************************************************************************************____
Traceback (most recent call last):
File "/home/alxhoff/git/GitHub/tiny/benchmark/training/anomaly_detection/00_train.py", line 208, in <module>
history = model.fit(train_data[:len(train_data)],
File "/home/alxhoff/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/alxhoff/.local/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not ini
。
我发现,通过减少训练数据输入约为〜700000项目的长度( train_data [:700000]
),然后没有问题。但是我对培训过程不太了解,无法弄清楚为什么这样做。
我的理解是,此输入数据阵列将位于系统内存中,并以批处理大小的块传递到GPU进行培训,以便只要 train_data
数组可以适合系统内存,并且批处理大小可以合适在GPU中,应该没有问题。鉴于24GB GPU可以拟合约240000倍的批次,而仅具有可用内存的4倍,并且系统内存从未真正超过〜60%,所以我不明白为什么我的PC无法运行此脚本。
欢呼,
亚历克斯
I am having problems running the 00_train.py script in the anomaly_detection benchmark which is part of the TinyML benchmarks MLPerft. I am unsure what the cause is as regardless of the batch size I use the issue persists while my system memory never gets even close to running out during execution.
I am running the script on two machines, both with 32gb or RAM, my personal PC which has a 6gb GPU and a server with a 24gb GPU, both using CUDA 11.6 and Tensorflow 2.9.
On the server I have tested it with batch sizes up to an enormous 260k+ where it runs without any problems, never using more than ~50% of the system memory. On my personal machine I am unable to get the test to run, even with a batch size of 1, which I might mention took forever to run, also never using a significant portion of the system's memory.
An example of the output is as follows
2022-06-01 14:46:55.918879: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1806336000 exceeds 10% of free system memory.
2022-06-01 14:46:56.979387: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1806336000 exceeds 10% of free system memory.
2022-06-01 14:46:57.674953: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1806336000 exceeds 10% of free system memory.
2022-06-01 14:46:58.187186: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1806336000 exceeds 10% of free system memory.
Epoch 1/100
1374/1379 [============================>.] - ETA: 0s - loss: 96.14442022-06-01 14:47:24.023059: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 191.41MiB (rounded to 200704000)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2022-06-01 14:47:24.023086: I tensorflow/core/common_runtime/bfc_allocator.cc:1010] BFCAllocator dump for GPU_0_bfc
2022-06-01 14:47:24.023100: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (256): Total Chunks: 50, Chunks in use: 50. 12.5KiB allocated for chunks. 12.5KiB in use in bin. 544B client-requested in use in bin.
2022-06-01 14:47:24.023110: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (512): Total Chunks: 88, Chunks in use: 88. 44.2KiB allocated for chunks. 44.2KiB in use in bin. 44.0KiB client-requested in use in bin.
2022-06-01 14:47:24.023118: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2022-06-01 14:47:24.023126: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2048): Total Chunks: 3, Chunks in use: 3. 7.5KiB allocated for chunks. 7.5KiB in use in bin. 7.5KiB client-requested in use in bin.
2022-06-01 14:47:24.023135: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4096): Total Chunks: 6, Chunks in use: 6. 24.5KiB allocated for chunks. 24.5KiB in use in bin. 24.0KiB client-requested in use in bin.
2022-06-01 14:47:24.023143: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023149: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023156: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023164: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (65536): Total Chunks: 18, Chunks in use: 18. 1.24MiB allocated for chunks. 1.24MiB in use in bin. 1.12MiB client-requested in use in bin.
2022-06-01 14:47:24.023171: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023179: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (262144): Total Chunks: 7, Chunks in use: 6. 2.19MiB allocated for chunks. 1.88MiB in use in bin. 1.88MiB client-requested in use in bin.
2022-06-01 14:47:24.023186: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (524288): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023193: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023200: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2097152): Total Chunks: 1, Chunks in use: 0. 2.44MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023207: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023214: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8388608): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023221: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023230: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023237: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023245: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (134217728): Total Chunks: 1, Chunks in use: 0. 181.04MiB allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-06-01 14:47:24.023253: I tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (268435456): Total Chunks: 2, Chunks in use: 2. 3.36GiB allocated for chunks. 3.36GiB in use in bin. 3.36GiB client-requested in use in bin.
2022-06-01 14:47:24.023262: I tensorflow/core/common_runtime/bfc_allocator.cc:1033] Bin for 191.41MiB was 128.00MiB, Chunk State:
2022-06-01 14:47:24.023275: I tensorflow/core/common_runtime/bfc_allocator.cc:1039] Size: 181.04MiB | Requested Size: 4B | in_use: 0 | bin_num: 19, prev: Size: 256B | Requested Size: 8B | in_use: 1 | bin_num: -1
2022-06-01 14:47:24.023280: I tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 3808755712
2022-06-01 14:47:24.023288: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000000 of size 1280 next 1
2022-06-01 14:47:24.023294: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000500 of size 256 next 2
2022-06-01 14:47:24.023300: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000600 of size 256 next 3
2022-06-01 14:47:24.023306: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000700 of size 256 next 4
2022-06-01 14:47:24.023312: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000800 of size 512 next 5
2022-06-01 14:47:24.023317: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000a00 of size 256 next 8
2022-06-01 14:47:24.023323: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000b00 of size 512 next 9
2022-06-01 14:47:24.023328: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000d00 of size 512 next 10
2022-06-01 14:47:24.023334: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac000f00 of size 512 next 11
2022-06-01 14:47:24.023339: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001100 of size 512 next 12
2022-06-01 14:47:24.023345: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001300 of size 256 next 13
2022-06-01 14:47:24.023351: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001400 of size 256 next 14
2022-06-01 14:47:24.023356: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001500 of size 512 next 15
2022-06-01 14:47:24.023362: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001700 of size 512 next 16
2022-06-01 14:47:24.023367: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001900 of size 512 next 19
2022-06-01 14:47:24.023373: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001b00 of size 512 next 20
2022-06-01 14:47:24.023378: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001d00 of size 512 next 21
2022-06-01 14:47:24.023386: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac001f00 of size 128768 next 17
2022-06-01 14:47:24.023392: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac021600 of size 65536 next 18
2022-06-01 14:47:24.023397: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac031600 of size 512 next 22
2022-06-01 14:47:24.023403: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac031800 of size 512 next 25
2022-06-01 14:47:24.023409: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac031a00 of size 512 next 26
2022-06-01 14:47:24.023415: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac031c00 of size 512 next 27
2022-06-01 14:47:24.023420: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac031e00 of size 512 next 28
2022-06-01 14:47:24.023426: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032000 of size 512 next 31
2022-06-01 14:47:24.023431: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032200 of size 512 next 32
2022-06-01 14:47:24.023437: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032400 of size 512 next 33
2022-06-01 14:47:24.023442: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032600 of size 512 next 34
2022-06-01 14:47:24.023448: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032800 of size 512 next 35
2022-06-01 14:47:24.023453: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032a00 of size 256 next 36
2022-06-01 14:47:24.023459: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032b00 of size 256 next 37
2022-06-01 14:47:24.023464: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032c00 of size 256 next 38
2022-06-01 14:47:24.023470: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032d00 of size 256 next 39
2022-06-01 14:47:24.023475: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032e00 of size 256 next 42
2022-06-01 14:47:24.023481: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac032f00 of size 256 next 43
2022-06-01 14:47:24.023486: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033000 of size 256 next 44
2022-06-01 14:47:24.023492: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033100 of size 512 next 55
2022-06-01 14:47:24.023497: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033300 of size 512 next 56
2022-06-01 14:47:24.023503: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033500 of size 512 next 57
2022-06-01 14:47:24.023508: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033700 of size 512 next 58
2022-06-01 14:47:24.023514: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033900 of size 512 next 59
2022-06-01 14:47:24.023519: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033b00 of size 512 next 60
2022-06-01 14:47:24.023525: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033d00 of size 512 next 61
2022-06-01 14:47:24.023530: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac033f00 of size 512 next 63
2022-06-01 14:47:24.023536: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034100 of size 512 next 64
2022-06-01 14:47:24.023541: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034300 of size 512 next 65
2022-06-01 14:47:24.023547: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034500 of size 512 next 66
2022-06-01 14:47:24.023552: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034700 of size 512 next 67
2022-06-01 14:47:24.023558: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034900 of size 256 next 71
2022-06-01 14:47:24.023563: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034a00 of size 256 next 72
2022-06-01 14:47:24.023569: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034b00 of size 256 next 73
2022-06-01 14:47:24.023574: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034c00 of size 256 next 40
2022-06-01 14:47:24.023580: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac034d00 of size 4096 next 41
2022-06-01 14:47:24.023586: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac035d00 of size 512 next 45
2022-06-01 14:47:24.023592: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1eac035f00 of size 512 next 48
2022-06-01 14:47:24.023598: I tensorflow/core/common_runtime/bfc_allocator.cc:1066]
REMOVED SOME REPEATING SIMILAR LINES TO ALLOW OUTPUT TO FIT
2022-06-01 14:47:24.024066: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1f83884d00 of size 256 next 172
2022-06-01 14:47:24.024069: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free at 7f1f83884e00 of size 327680 next 232
2022-06-01 14:47:24.024073: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1f838d4e00 of size 256 next 206
2022-06-01 14:47:24.024077: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free at 7f1f838d4f00 of size 2556416 next 178
2022-06-01 14:47:24.024081: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f1f83b45100 of size 256 next 179
2022-06-01 14:47:24.024085: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Free at 7f1f83b45200 of size 189836800 next 18446744073709551615
2022-06-01 14:47:24.024088: I tensorflow/core/common_runtime/bfc_allocator.cc:1071] Summary of in-use Chunks by size:
2022-06-01 14:47:24.024094: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 50 Chunks of size 256 totalling 12.5KiB
2022-06-01 14:47:24.024099: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 87 Chunks of size 512 totalling 43.5KiB
2022-06-01 14:47:24.024104: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 768 totalling 768B
2022-06-01 14:47:24.024108: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 1280 totalling 1.2KiB
2022-06-01 14:47:24.024113: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 3 Chunks of size 2560 totalling 7.5KiB
2022-06-01 14:47:24.024117: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 5 Chunks of size 4096 totalling 20.0KiB
2022-06-01 14:47:24.024121: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 4608 totalling 4.5KiB
2022-06-01 14:47:24.024126: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 16 Chunks of size 65536 totalling 1.00MiB
2022-06-01 14:47:24.024130: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 127232 totalling 124.2KiB
2022-06-01 14:47:24.024135: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 1 Chunks of size 128768 totalling 125.8KiB
2022-06-01 14:47:24.024139: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 6 Chunks of size 327680 totalling 1.88MiB
2022-06-01 14:47:24.024144: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] 2 Chunks of size 1806336000 totalling 3.36GiB
2022-06-01 14:47:24.024148: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 3.37GiB
2022-06-01 14:47:24.024152: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 3808755712 memory_limit_: 3808755712 available bytes: 0 curr_region_allocation_bytes_: 7617511424
2022-06-01 14:47:24.024159: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats:
Limit: 3808755712
InUse: 3616034816
MaxInUse: 3626576640
NumAllocs: 368395
MaxAllocSize: 1806336000
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2022-06-01 14:47:24.024167: W tensorflow/core/common_runtime/bfc_allocator.cc:474] ************************************************************************************************____
Traceback (most recent call last):
File "/home/alxhoff/git/GitHub/tiny/benchmark/training/anomaly_detection/00_train.py", line 208, in <module>
history = model.fit(train_data[:len(train_data)],
File "/home/alxhoff/.local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/alxhoff/.local/lib/python3.8/site-packages/tensorflow/python/framework/constant_op.py", line 106, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not ini
I should mention that the suggested environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' didn't make a difference.
I found that by reducing the size of the train_data training data input to a length of around ~700000 items (train_data[:700000]
) then it ran no problem. But I don't know enough about the training process to figure out why this is so.
My understanding is that this input data array would sit in system memory and be passed in batch sized chunks to the GPU for training so as long as the train_data
array can fit in system memory and the batch size can fit into the GPU then there shouldn't be a problem. Given that the 24GB GPU can fit ~240000 times more batches while only having 4 times the available memory and that system memory never really goes above ~60% I don't understand why my PC cannot run this script.
Cheers,
Alex
发布评论