使用枚举以获取来自DataFrame的分区列
我试图将所有列及其数据类型放入一个变量中,也只有将分区列纳入Python列表类型的另一个变量。
从描述扩展中获取细节。
df = spark.sql("describe extended schema_name.table_name")
+----------------------------------------------------------+
|col_name |data_type |
+----------------------------+-----------------------------+
|col1 |string |
|col2 |int
|col3 |string
|col4 |int
|col5 |string
|# Partition Information | |
|# col_name |data_type |
|col4 |int |
|col5 |string |
| | |
|# Detailed Table Information| |
|Database |schema_name |
|Table |table_name |
|Owner |owner.name |
将结果转换为列表。
des_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()
这是我尝试获取所有列(直到#分区信息之前的所有项目)的方式。
all_cols_name_type=[]
for index,item in enumerate(des_list):
if item[0]=='# Partition Information':
all_cols_name_type.append(des_list[:index])
对于分区,我想在项目“#col_name”和行之间获得所有内容
。
I am trying to get all columns and their datatypes into a variable, also only the partition columns into another variable of list type in python.
Getting details from describe extended.
df = spark.sql("describe extended schema_name.table_name")
+----------------------------------------------------------+
|col_name |data_type |
+----------------------------+-----------------------------+
|col1 |string |
|col2 |int
|col3 |string
|col4 |int
|col5 |string
|# Partition Information | |
|# col_name |data_type |
|col4 |int |
|col5 |string |
| | |
|# Detailed Table Information| |
|Database |schema_name |
|Table |table_name |
|Owner |owner.name |
Converting result into a list.
des_list=df.select(df.col_name,df.data_type).rdd.map(lambda x:(x[0],x[1])).collect()
Here is how I am trying to get all columns(all items until before # Partition Information).
all_cols_name_type=[]
for index,item in enumerate(des_list):
if item[0]=='# Partition Information':
all_cols_name_type.append(des_list[:index])
For partitions, i would like to get everything between the items '# col_name' and line before ''(line before # Detailed Table Information)
Any help is appreciated to be able to get this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以尝试使用以下答案或在Scala中等效:
如果表格未定义(EG Reading Parquet Dataset中)直接使用
spark.read.parquet(“ s3:// path/...”)
),然后您可以在scala中使用以下片段:You can try using the following answer or equivalent in Scala:
If the table is not defined in the catalog (e.g reading parquet dataset directly from s3 using
spark.read.parquet("s3://path/...")
) then you can use the following snippet in Scala:这样做有一个技巧:您可以使用
nootonalyy_increasing_id
要给每一行,找到具有#col_name
的行,然后获取该索引。像这样的东西我的示例表
很棘手的部分
There is a trick to do so: You can use
monotonically_increasing_id
to give each row a number, find the row that has# col_name
and get that index. Something like thisMy sample table
tricky part