删除Unicode U后,无法重命名/替换数据框中的类别

发布于 2025-01-17 21:38:13 字数 933 浏览 3 评论 0原文

由于该方法删除了文本中的另一个u,因此我尝试使用.replace('u',',Regex)方法删除Unicode U之后,将其重命名为数据帧中的类别。我尝试使用替换和rename_categories方法将类别更改为所需的格式使用词典以映射,但是在删除Unicode u后它保持不变。有人知道我可以解决这个问题吗?我已经附上了与我正在合作的CSV的链接。

enter image description here

'''uploaded = files.upload()
yelpdf = pd.read_csv(io.BytesIO(uploaded['yelp_reviews.csv']))
print(yelpdf['NoiseLevel'].value_counts())
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].astype(str)
update_NoiseLevel = {'average': 'Average', 'lod': 'Loud', 'qiet': 'Quiet', 'very_lod': 'Very Loud'}
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].replace('u','',regex=True)
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].astype('category')
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].cat.rename_categories(update_NoiseLevel)
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].replace(update_NoiseLevel)

print(yelpdf['NoiseLevel'].value_counts())'''

I am trying to rename the categories in a dataframe after removing the unicode u with a .replace('u','',regex) method due to the method removing the other 'u's in the text as well. I have tried using the replace, and the rename_categories method to change the categories into desired format using a dictionary to map but it remains unchanged after removing the unicode u. Does anyone know a better way I can approach this? I have attached a link to the CSV I am working with.

enter image description here

'''uploaded = files.upload()
yelpdf = pd.read_csv(io.BytesIO(uploaded['yelp_reviews.csv']))
print(yelpdf['NoiseLevel'].value_counts())
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].astype(str)
update_NoiseLevel = {'average': 'Average', 'lod': 'Loud', 'qiet': 'Quiet', 'very_lod': 'Very Loud'}
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].replace('u','',regex=True)
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].astype('category')
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].cat.rename_categories(update_NoiseLevel)
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].replace(update_NoiseLevel)

print(yelpdf['NoiseLevel'].value_counts())'''

its a CSV file with yelp data and this issue is occurring within the NoiseLevel column

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

っ左 2025-01-24 21:38:13

在创建类别之前尝试 str.extract (如果需要)

df = pd.read_excel('yelp_reviews.xlsx')
df['NoiseLevel'] = df['NoiseLevel'].str.extract("(?:u')?([^']*)")

输出:

>>> df['NoiseLevel'].unique()
array(['average', 'quiet', nan, 'loud', 'very_loud'], dtype=object)

>>> df['NoiseLevel'].head(10)
0    average
1    average
2    average
3    average
4    average
5    average
6    average
7      quiet
8        NaN
9        NaN
Name: NoiseLevel, dtype: object

Try str.extract before create category (if needed)

df = pd.read_excel('yelp_reviews.xlsx')
df['NoiseLevel'] = df['NoiseLevel'].str.extract("(?:u')?([^']*)")

Output:

>>> df['NoiseLevel'].unique()
array(['average', 'quiet', nan, 'loud', 'very_loud'], dtype=object)

>>> df['NoiseLevel'].head(10)
0    average
1    average
2    average
3    average
4    average
5    average
6    average
7      quiet
8        NaN
9        NaN
Name: NoiseLevel, dtype: object
枕头说它不想醒 2025-01-24 21:38:13

昨晚我确实发布了解决方案,但这是不正确的。
这是纠正的:

基本上所有问题都是从存储的方式中出现的。就我而言(在我的Googglecolab笔记本中),一些数据带有“ u'”,而另一些数据仅带有第一个报价。

这种不一致是所有问题的根源,尤其是这就是为什么“重命名”不起作用的原因。
您的名字也拼错了……(不?)
update_noiselevel = {'平均':'平均','lod':'大声','qiet':'quiet','leyte_lod':'southe_lod'}
应该是:
update_noiselevel = {'平均':'平均','大声':'大声','quite':'quite','lemiet','heyte_loud':''soury lod'}

在您在COLAB中上传CSV之前,您需要做一点调整对“ noings_level”列的调整,以便双方都有引号……等等
我使用(在Excel中):

FORMULA =IF(H2="","unclasssifed",IF(LEFT(H2,1)<>"u",CONCATENATE("'",H2),IF(LEFT(H2,1)="u",H2)))

H2 = Excel中的“ Noiselevel”列…,您也可以使用Python在CSV中进行此操作…
对于空的是Nan,我使用了“未分类”一词……

解决此问题(不一致)并使重命名工作的步骤:

1。 COLAB中的导入文件:

 import pandas as pd
 import io
    
 df = pd.read_csv(io.StringIO(uploaded['stelios_copy_of_yelp_reviews - yelp_reviews (4).csv'].decode('utf-8')))
 df.head()

2。创建一个函数,以删除数据中的所有括号,并在该列中的所有函数...(!)中使用!

def remove_another(x):
  string = str(x)
  aaa = string.replace("u'", "").replace("'", "")
  return aaa

df['NoiseLevel_u_removed'] = df['NoiseLevel_modified_excel'].apply(remove_another)

3。更正这个!并运行剩余的命令

update_NoiseLevel = {'average': 'Average', 'loud': 'Loud', 'quiet': 'Quiet', 'very_loud': 'Very Loud', 'unclassified': 'Unclassified'}

print(df['NoiseLevel_u_removed'].value_counts())

df['NoiseLevel_category'] = df['NoiseLevel_u_removed'].astype('category')

df['NoiseLevel_u_removed'] = df['NoiseLevel_category'].cat.rename_categories(update_NoiseLevel)
df['NoiseLevel_u_removed'][0:23]

net/ofqo8.png “

I did post a solution last night but it was not correct.
This is the CORRECTED one:

Basically all the issues arise from the way the day is stored. In my case (in my GooggleColab notebook) , some data was coming in with “u’” and some others with JUST the first quote only.

This inconsistency is the source of all the problems, and in particular this is why "renaming" is not working.
Your names are also misspelled… (No?)
update_NoiseLevel = {'average': 'Average', 'lod': 'Loud', 'qiet': 'Quiet', 'very_lod': 'Very Loud'}
it should be:
update_NoiseLevel = {'average': 'Average', 'loud': 'Loud', 'quiet': 'Quiet', 'very_loud': 'Very Loud'}

BEFORE you upload the CSV in Colab, you need to make a little adjustment to the “Noise_level”, column so that all have quotes, in both sides… etc
I used (IN EXCEL):

FORMULA =IF(H2="","unclasssifed",IF(LEFT(H2,1)<>"u",CONCATENATE("'",H2),IF(LEFT(H2,1)="u",H2)))

H2 = “NoiseLevel” column in excel…, you may do that also in CSV using Python…
For the empty ones, that is nan, I used the term “unclassified”…

Steps to resolve this (inconsistency) and make the renaming work:

1. Import file in colab:

 import pandas as pd
 import io
    
 df = pd.read_csv(io.StringIO(uploaded['stelios_copy_of_yelp_reviews - yelp_reviews (4).csv'].decode('utf-8')))
 df.head()

2. Create a function to remove ALL the brackets and "u'" all in your data, in that column ... (!) and apply it!

def remove_another(x):
  string = str(x)
  aaa = string.replace("u'", "").replace("'", "")
  return aaa

df['NoiseLevel_u_removed'] = df['NoiseLevel_modified_excel'].apply(remove_another)

3. Correct this! and RUN remaining commands

update_NoiseLevel = {'average': 'Average', 'loud': 'Loud', 'quiet': 'Quiet', 'very_loud': 'Very Loud', 'unclassified': 'Unclassified'}

4. Finally

print(df['NoiseLevel_u_removed'].value_counts())

df['NoiseLevel_category'] = df['NoiseLevel_u_removed'].astype('category')

df['NoiseLevel_u_removed'] = df['NoiseLevel_category'].cat.rename_categories(update_NoiseLevel)
df['NoiseLevel_u_removed'][0:23]

I hope this helps you !

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文