批处理批处理将镶木转换为带有分类值的箭头：箭头IPC文件仅支持给定字段的单个非delta词典，跨越给定字段

发布于 2025-02-03 03:33:04 字数 1705 浏览 4 评论 0 原文

我有一个大型镶木文件，它的架构中具有分类/字典值（ dictionary＆lt; values = string，indices = int32，ordered = 0＆gt; ），我正在尝试将parquet文件转换为pyarrow ipc ipc使用RecordBatchFileWriter格式化，但是我会遇到此错误：

pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches.

由于我正在通过批处理箭头阅读批次箭头IPC无法找出字典值（这很有意义）。但是我该如何解决这个问题？我无法将数据集读为RAM（太大），但是我很高兴单独阅读字典列并预先计算分类 - 但我不知道该怎么做。

我最好的猜测是以某种方式基于镶木quet文件预先计算分类值，但是如何呢？

>>> parquet_file = pq.ParquetFile('data.parq')

>>> local = fs.LocalFileSystem()
>>> with local.open_output_stream("data.arrow") as file:
>>> with pa.RecordBatchFileWriter(file, parquet_file.schema_arrow) as writer:
...     for record_batch in parquet_file.iter_batches():
...         writer.write_batch(record_batch)
...

Traceback (most recent call last):
  File "arrow-playground.py", line 47, in <module>
    writer.write_batch(record_batch)
  File "pyarrow/ipc.pxi", line 483, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches.

问候， Niklas

我已经阅读了我能找到的所有内容，例如[email protected]/msg01316.html and

原文

I have a large parquet file with categorical/dictionary values in it's schema (dictionary<values=string, indices=int32, ordered=0>) and I'm trying to convert the parquet file into pyarrow IPC format using RecordBatchFileWriter, but I'm getting this error:

pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches.

Since I'm reading batch by batch Arrow IPC can't figure out the dictionary values (which makes perfect sense). But how can I work around this? I can't read the dataset into ram (too large) but I'm happy to read the dictionary columns separately one by one and pre-compute the categorical - but I can't figure out how to do that.

My best guess would be to somehow pre-compute the categorical values based on the parquet file, but how?

>>> parquet_file = pq.ParquetFile('data.parq')

>>> local = fs.LocalFileSystem()
>>> with local.open_output_stream("data.arrow") as file:
>>> with pa.RecordBatchFileWriter(file, parquet_file.schema_arrow) as writer:
...     for record_batch in parquet_file.iter_batches():
...         writer.write_batch(record_batch)
...

Traceback (most recent call last):
  File "arrow-playground.py", line 47, in <module>
    writer.write_batch(record_batch)
  File "pyarrow/ipc.pxi", line 483, in pyarrow.lib._CRecordBatchWriter.write_batch
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single non-delta dictionary for a given field across all batches.

Regards,
Niklas