在Pyspark中处理umla的角色
我遇到了一个问题,在我想从火花DF创建的熊猫DF中,以了解umlaed角色。
这是一个最小的可重复的例子:
from pyspark.sql.types import StructType,StructField, StringType
data =[("Citroën",)]
schema = StructType([ \
StructField("car",StringType(),True), \
])
df = spark.createDataFrame(data=data,schema=schema)
火花DF看起来像这样,
+--------
car |
+--------
|Citroën|
我想将火花DF转换为熊猫DF。我通过df.topandas()
尝试了一下,这些是我得到的一些输出:
pdf = df.toPandas()
print(pdf)
print(pdf["car"].unique())
0 Citro??n
[u'Citro\xc3\xabn']
问题:如何让大熊猫理解这些特殊字符?
我试图在论坛上浏览,因此本身。找不到任何对我有用的东西。我已经尝试设置pythonioCoding = utf8
,如。还尝试将# - * - 编码添加到.py文件中。
更新1
将熊猫DF转换回火花:
test_sdf = spark.createDataFrame(pdf)
test_sdf.show()
+--------+
| car|
+--------+
|Citroën|
+--------+
I have struck an issue where in I want pandas df created from a spark df, to understand Umlauted characters.
This is a minimal reproducible example:
from pyspark.sql.types import StructType,StructField, StringType
data =[("Citroën",)]
schema = StructType([ \
StructField("car",StringType(),True), \
])
df = spark.createDataFrame(data=data,schema=schema)
The spark df looks like this
+--------
car |
+--------
|Citroën|
I want to convert the spark df into a pandas df. I try this via df.toPandas()
and these are some outputs I get:
pdf = df.toPandas()
print(pdf)
print(pdf["car"].unique())
0 Citro??n
[u'Citro\xc3\xabn']
Question: How do I get Pandas to understand these special characters?
I tried to browse on forums and SO itself. Cannot find anything that works for me. I have tried setting PYTHONIOENCODING=utf8
as suggested by this. Have also tried adding #-*- coding: UTF-8 -*-
to the .py file.
UPDATE 1
Converting the pandas df back to spark:
test_sdf = spark.createDataFrame(pdf)
test_sdf.show()
+--------+
| car|
+--------+
|Citroën|
+--------+
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为编码应该很好。要检查您可以只用常规字母尝试一个单词。
但是我认为问题是数据结构本身。尝试移动逗号,因此数据包含一个元组列表。括号本身不会使元组造成元组,但是将逗号放在那里会迫使它进入列表中的元组。
我没有大熊猫理解这些角色的任何问题 - 这可能只是您系统显示输出的方式。您可以通过转换回Spark并查看它是否与以前相同来进行测试。
编辑 - 显示熊猫的工作...
这对我来说很好:
您可以尝试:
I think the encoding should be fine. To check you could try a word with just regular letters in.
But I think the problem is the data structure itself. Try moving the comma, so data contains a list of one tuple. The parentheses by themselves won't make a tuple, but putting the comma in there will force it into a tuple in the list.
I don't have any issues with pandas understanding these characters - it may just be the way your system is displaying the output. You could test this by converting back to spark and see if it looks the same as before.
Edit - showing pandas working...
This works fine for me:
You could try: