将词典与熊猫/pyarrow一起使用天然钥匙
我正在使用Parquet文件格式存储以CSV格式收到的结构化版本的原始数据。为此,我使用熊猫执行数据框架操作,然后Pyarrow以镶木式格式保存数据框架。我还使用Pyarrow架构作为执行某种级别数据验证的手段。
我在这里的挑战是将字符串列转换为分类数据类型,最好是要在原始数据中维护原始代码。例如,如果原始数据具有列性别{“ m”,“ f”}
,我想使用“ m”
作为的代码, /code>和
“ f”
作为的代码“女性”
。我不想使用pd.categorical
ie 0,1的数值代码。从Kimball方法论角度来看,这可能不是最好的做法,该角度提倡使用代理密钥,而不是天然钥匙我现在想实现的目标。
在下面的示例中,我具有imployeecategory {“ sr”,“ c”,“ m”}
的代码数据。是否可以仅存储原始的implyeecategory
,但将其表示为分类并将其映射存储在某个地方,以作为parquet metadata?
import pandas as pd
import pyarrow as pa
df = pd.DataFrame(
{
'Name': ['John', 'James', 'Jack', 'Jonathan', 'Jerry', 'Joseph'],
'EmployeeCategory': ["SR","C","M","SR","M",pd.NA]
})
employeeCategoryMap = {
"SR": "Senior",
"C": "Consultant",
"M": "Manager"
}
df['EmployeeCategoryDescription'] = df['EmployeeCategory'].map(employeeCategoryMap).fillna(pd.NA)
schema = pa.schema([
pa.field('Name', pa.string()),
pa.field('EmployeeCategory', pa.string()),
pa.field('EmployeeCategoryDescription', pa.string())
])
tbl = pa.Table.from_pandas(df, schema)
pq.write_table(tbl, 'test.parquet')
I am using Parquet file format to store the structured version of raw data, received in CSV format. For this, I use Pandas to perform DataFrame manipulation, and Pyarrow to save the DataFrame in Parquet format. I also use pyarrow schema as a means to perform some level of data validation.
My challenge here is to convert a string column to categorical data type, and preferably I want to maintain the original codes in raw data. For example, if original data has column Gender { "M", "F" }
, I want to use "M"
as code for "Male"
, and "F"
as code for "Female"
. I don't want to use assigned numerical codes from pd.Categorical
i.e. 0, 1. This may not be best practice from Kimball methodology perspective which advocates use of surrogate keys as opposed to natural keys, but it's what I want to achieve now.
In example below, I have the codes for EmployeeCategory { "SR", "C", "M" }
and I have to create another column EmployeeCategoryDescription
and store it together with the data. Is it possible to only store the original EmployeeCategory
, but denotes it as a categorical and store its mapping somewhere as Parquet metadata?
import pandas as pd
import pyarrow as pa
df = pd.DataFrame(
{
'Name': ['John', 'James', 'Jack', 'Jonathan', 'Jerry', 'Joseph'],
'EmployeeCategory': ["SR","C","M","SR","M",pd.NA]
})
employeeCategoryMap = {
"SR": "Senior",
"C": "Consultant",
"M": "Manager"
}
df['EmployeeCategoryDescription'] = df['EmployeeCategory'].map(employeeCategoryMap).fillna(pd.NA)
schema = pa.schema([
pa.field('Name', pa.string()),
pa.field('EmployeeCategory', pa.string()),
pa.field('EmployeeCategoryDescription', pa.string())
])
tbl = pa.Table.from_pandas(df, schema)
pq.write_table(tbl, 'test.parquet')
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我尚不清楚员工类别地图的来源。但是,如果您的目标是将其与列一起存储以供以后使用,则可以使用现场元数据。 在这里是一个完整的示例。
相关的作品是:
将元数据存储在现场元数据中。
检索和应用类别
It's not clear to me where the employee category map is coming from. However, if your goal is to store that alongside a column for later use then you could use field metadata. Here is a complete example.
The relevant pieces are:
Storing the metadata in the field metadata.
Retrieving and applying the categories