将词典与熊猫/pyarrow一起使用天然钥匙

发布于 2025-02-06 10:02:39 字数 1199 浏览 1 评论 0原文

我正在使用Parquet文件格式存储以CSV格式收到的结构化版本的原始数据。为此，我使用熊猫执行数据框架操作，然后Pyarrow以镶木式格式保存数据框架。我还使用Pyarrow架构作为执行某种级别数据验证的手段。

我在这里的挑战是将字符串列转换为分类数据类型，最好是要在原始数据中维护原始代码。例如，如果原始数据具有列性别{“ m”，“ f”}，我想使用“ m”作为的代码， /code>和“ f”作为的代码“女性”。我不想使用pd.categorical ie 0，1的数值代码。从Kimball方法论角度来看，这可能不是最好的做法，该角度提倡使用代理密钥，而不是天然钥匙我现在想实现的目标。

在下面的示例中，我具有imployeecategory {“ sr”，“ c”，“ m”}的代码数据。是否可以仅存储原始的implyeecategory，但将其表示为分类并将其映射存储在某个地方，以作为parquet metadata？

import pandas as pd
import pyarrow as pa

df = pd.DataFrame(
{
    'Name': ['John', 'James', 'Jack', 'Jonathan', 'Jerry', 'Joseph'],
    'EmployeeCategory': ["SR","C","M","SR","M",pd.NA]
})

employeeCategoryMap = {
    "SR": "Senior",
    "C": "Consultant",
    "M": "Manager"
}

df['EmployeeCategoryDescription'] = df['EmployeeCategory'].map(employeeCategoryMap).fillna(pd.NA)

schema = pa.schema([
    pa.field('Name', pa.string()),
    pa.field('EmployeeCategory', pa.string()),
    pa.field('EmployeeCategoryDescription', pa.string())
])

tbl = pa.Table.from_pandas(df, schema)

pq.write_table(tbl, 'test.parquet')

原文

I am using Parquet file format to store the structured version of raw data, received in CSV format. For this, I use Pandas to perform DataFrame manipulation, and Pyarrow to save the DataFrame in Parquet format. I also use pyarrow schema as a means to perform some level of data validation.

My challenge here is to convert a string column to categorical data type, and preferably I want to maintain the original codes in raw data. For example, if original data has column Gender { "M", "F" } , I want to use "M" as code for "Male", and "F" as code for "Female". I don't want to use assigned numerical codes from pd.Categorical i.e. 0, 1. This may not be best practice from Kimball methodology perspective which advocates use of surrogate keys as opposed to natural keys, but it's what I want to achieve now.

In example below, I have the codes for EmployeeCategory { "SR", "C", "M" } and I have to create another column EmployeeCategoryDescription and store it together with the data. Is it possible to only store the original EmployeeCategory, but denotes it as a categorical and store its mapping somewhere as Parquet metadata?

import pandas as pd
import pyarrow as pa

df = pd.DataFrame(
{
    'Name': ['John', 'James', 'Jack', 'Jonathan', 'Jerry', 'Joseph'],
    'EmployeeCategory': ["SR","C","M","SR","M",pd.NA]
})

employeeCategoryMap = {
    "SR": "Senior",
    "C": "Consultant",
    "M": "Manager"
}

df['EmployeeCategoryDescription'] = df['EmployeeCategory'].map(employeeCategoryMap).fillna(pd.NA)

schema = pa.schema([
    pa.field('Name', pa.string()),
    pa.field('EmployeeCategory', pa.string()),
    pa.field('EmployeeCategoryDescription', pa.string())
])

tbl = pa.Table.from_pandas(df, schema)

pq.write_table(tbl, 'test.parquet')

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

无风消散 2025-02-13 10:02:39

我尚不清楚员工类别地图的来源。但是，如果您的目标是将其与列一起存储以供以后使用，则可以使用现场元数据。在这里是一个完整的示例。

相关的作品是：

将元数据存储在现场元数据中。

table = pa.Table.from_pandas(df)
// Field metadata is a map from byte string to byte string
// so we need to serialize the map somehow.  Assuming it is
// a fairly simple map then json should work fine.
mapJson = json.dumps(employeeCategoryMap).encode('utf8')
// Fields and tables are immutable so we get a new field and then
// use it to create a new table.  I used 'categories' as a key but
// you can use whatever you want and you might want to namespace it
// with your project / company name for uniqueness
annotated_field = table.field('EmployeeCategory').with_metadata({b'categories': mapJson})
// Replacing a field requires the index but you could easily find that
// by name
annotated_table = table.set_column(1, annotated_field, table.column(1))

检索和应用类别

// We could grab the field directly but I'm pretending you have
// multiple fields with descriptions and don't know ahead of time
// which fields they are
for field in table_from_file.schema:
    field_metadata = field.metadata
    // Make sure to use b'' strings as keys
    if field_metadata is not None and b'categories' in field_metadata:
        // Deserialize the category information
        categories = json.loads(field_metadata[b'categories'])
        df[f'{field.name}Description'] = df[field.name].map(categories).fillna(pd.NA)

It's not clear to me where the employee category map is coming from. However, if your goal is to store that alongside a column for later use then you could use field metadata. Here is a complete example.

The relevant pieces are:

Storing the metadata in the field metadata.

table = pa.Table.from_pandas(df)
// Field metadata is a map from byte string to byte string
// so we need to serialize the map somehow.  Assuming it is
// a fairly simple map then json should work fine.
mapJson = json.dumps(employeeCategoryMap).encode('utf8')
// Fields and tables are immutable so we get a new field and then
// use it to create a new table.  I used 'categories' as a key but
// you can use whatever you want and you might want to namespace it
// with your project / company name for uniqueness
annotated_field = table.field('EmployeeCategory').with_metadata({b'categories': mapJson})
// Replacing a field requires the index but you could easily find that
// by name
annotated_table = table.set_column(1, annotated_field, table.column(1))

Retrieving and applying the categories

// We could grab the field directly but I'm pretending you have
// multiple fields with descriptions and don't know ahead of time
// which fields they are
for field in table_from_file.schema:
    field_metadata = field.metadata
    // Make sure to use b'' strings as keys
    if field_metadata is not None and b'categories' in field_metadata:
        // Deserialize the category information
        categories = json.loads(field_metadata[b'categories'])
        df[f'{field.name}Description'] = df[field.name].map(categories).fillna(pd.NA)

回复收藏 0 原文

~没有更多了~