在Pyspark中处理umla的角色

发布于 2025-02-01 19:11:52 字数 925 浏览 3 评论 0原文

我遇到了一个问题，在我想从火花DF创建的熊猫DF中，以了解umlaed角色。

这是一个最小的可重复的例子：

from pyspark.sql.types import StructType,StructField, StringType
data =[("Citroën",)]
schema = StructType([ \
    StructField("car",StringType(),True), \
  ])
df = spark.createDataFrame(data=data,schema=schema)

火花DF看起来像这样，

+--------
car     |
+--------
|Citroën|

我想将火花DF转换为熊猫DF。我通过df.topandas（）尝试了一下，这些是我得到的一些输出：

pdf = df.toPandas()
print(pdf)
print(pdf["car"].unique())

0  Citro??n
[u'Citro\xc3\xabn']

问题：如何让大熊猫理解这些特殊字符？

我试图在论坛上浏览，因此本身。找不到任何对我有用的东西。我已经尝试设置pythonioCoding = utf8，如。还尝试将＃ - * - 编码添加到.py文件中。

更新1

将熊猫DF转换回火花：

test_sdf = spark.createDataFrame(pdf)
test_sdf.show()
+--------+
|     car|
+--------+
|CitroÃ«n|
+--------+

原文

I have struck an issue where in I want pandas df created from a spark df, to understand Umlauted characters.

This is a minimal reproducible example:

from pyspark.sql.types import StructType,StructField, StringType
data =[("Citroën",)]
schema = StructType([ \
    StructField("car",StringType(),True), \
  ])
df = spark.createDataFrame(data=data,schema=schema)

The spark df looks like this

+--------
car     |
+--------
|Citroën|

I want to convert the spark df into a pandas df. I try this via df.toPandas() and these are some outputs I get:

pdf = df.toPandas()
print(pdf)
print(pdf["car"].unique())

0  Citro??n
[u'Citro\xc3\xabn']

Question: How do I get Pandas to understand these special characters?

I tried to browse on forums and SO itself. Cannot find anything that works for me. I have tried setting PYTHONIOENCODING=utf8 as suggested by this. Have also tried adding #-*- coding: UTF-8 -*- to the .py file.

UPDATE 1

Converting the pandas df back to spark:

test_sdf = spark.createDataFrame(pdf)
test_sdf.show()
+--------+
|     car|
+--------+
|CitroÃ«n|
+--------+

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

习惯成性 2025-02-08 19:11:52

我认为编码应该很好。要检查您可以只用常规字母尝试一个单词。

但是我认为问题是数据结构本身。尝试移动逗号，因此数据包含一个元组列表。括号本身不会使元组造成元组，但是将逗号放在那里会迫使它进入列表中的元组。

data =[("Citroën",)]

我没有大熊猫理解这些角色的任何问题 - 这可能只是您系统显示输出的方式。您可以通过转换回Spark并查看它是否与以前相同来进行测试。

编辑 - 显示熊猫的工作...
这对我来说很好：

import pandas as pd
print(pd.DataFrame({'car':['Citroën']}))

您可以尝试：

pdf["car"] = pdf["car"].str.decode('utf-8')

I think the encoding should be fine. To check you could try a word with just regular letters in.

But I think the problem is the data structure itself. Try moving the comma, so data contains a list of one tuple. The parentheses by themselves won't make a tuple, but putting the comma in there will force it into a tuple in the list.

data =[("Citroën",)]

I don't have any issues with pandas understanding these characters - it may just be the way your system is displaying the output. You could test this by converting back to spark and see if it looks the same as before.

Edit - showing pandas working...
This works fine for me:

import pandas as pd
print(pd.DataFrame({'car':['Citroën']}))

You could try:

pdf["car"] = pdf["car"].str.decode('utf-8')

回复收藏 0 原文

~没有更多了~

关于作者

二货你真萌

暂无简介

文章

29 人气

关注发私信

友情链接

文江博客

在Pyspark中处理umla的角色

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

在Pyspark中处理umla的角色

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。