如何在使用Pyarrow编写镶木quet文件时提供镶木架模式

发布于 2025-01-23 08:04:57 字数 375 浏览 2 评论 0原文

我有一个原始的输入CSV数据,其中所有字段均为字符串类型。我想将此CSV转换为镶木格式。但是,在转换为Parquet时,我想通过为数据提供自定义模式来编写它。我正在使用Pyarrow进行CSV进行镶木转换。

在使用Pyarrow编写文件时,如何提供自定义架构?

这是我使用的代码:

import pyarrow as pa 
import pyarrow.parquet as pq

# records is a list of lists containing the rows of the csv
table = pa.Table.from_pylist(records)
pq.write_table(table,"sample.parquet")

I have a raw input csv data where all the fields are of string type. I want to convert this csv to parquet format. However on conversion to parquet I want to write it by providing a custom schema to the data. I am using PyArrow for csv to parquet conversion.

How can I provide a custom schema while writing the file to parquet using PyArrow?

Here is the code I used:

import pyarrow as pa 
import pyarrow.parquet as pq

# records is a list of lists containing the rows of the csv
table = pa.Table.from_pylist(records)
pq.write_table(table,"sample.parquet")

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

万劫不复 2025-01-30 08:04:57

您能举一个记录例子吗?如果我尝试使用TU使用建议的失败列表:

>>> pa.Table.from_pylist([["1", "2"], ["first", "second"]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3682, in pyarrow.lib.Table.from_pylist
    return _from_pylist(cls=Table,
  File "pyarrow/table.pxi", line 5199, in pyarrow.lib._from_pylist
    names = list(mapping[0].keys())
AttributeError: 'list' object has no attribute 'keys'

我希望记录是文档中的命令列表。

    data = [{'strs': '', 'floats': 4.5},
        {'strs': 'foo', 'floats': 5},
        {'strs': 'bar', 'floats': None}]
    table = pa.Table.from_pylist(data)

在这种情况下,您可以在PY_LIST构建表时使用该模式:

schema = pa.schema([('a', pa.int64()),
                    ('c', pa.int32()),
                    ('d', pa.int16())
                    ])
table = pa.Table.from_pylist(
    [{'a': 1, 'b': 3}, {'a': 2, 'b': 4}, {'a': 3, 'b': 5}],
    schema=schema
)
data = [{'a': 1, 'c': None, 'd': None},
        {'a': 2, 'c': None, 'd': None},
        {'a': 3, 'c': None, 'd': None}]
assert table.schema == schema
assert table.to_pylist() == data

Could you give an example of records? If I try tu use a list of lists as suggested fails:

>>> pa.Table.from_pylist([["1", "2"], ["first", "second"]])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 3682, in pyarrow.lib.Table.from_pylist
    return _from_pylist(cls=Table,
  File "pyarrow/table.pxi", line 5199, in pyarrow.lib._from_pylist
    names = list(mapping[0].keys())
AttributeError: 'list' object has no attribute 'keys'

I would expect records to be a list of dicts from the documentation.

    data = [{'strs': '', 'floats': 4.5},
        {'strs': 'foo', 'floats': 5},
        {'strs': 'bar', 'floats': None}]
    table = pa.Table.from_pylist(data)

You can use the schema when building the table from py_list, on this case:

schema = pa.schema([('a', pa.int64()),
                    ('c', pa.int32()),
                    ('d', pa.int16())
                    ])
table = pa.Table.from_pylist(
    [{'a': 1, 'b': 3}, {'a': 2, 'b': 4}, {'a': 3, 'b': 5}],
    schema=schema
)
data = [{'a': 1, 'c': None, 'd': None},
        {'a': 2, 'c': None, 'd': None},
        {'a': 3, 'c': None, 'd': None}]
assert table.schema == schema
assert table.to_pylist() == data
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文