删除Unicode U后，无法重命名/替换数据框中的类别

发布于 2025-01-17 21:38:13 字数 933 浏览 3 评论 0原文

由于该方法删除了文本中的另一个u，因此我尝试使用.replace（'u'，'，Regex）方法删除Unicode U之后，将其重命名为数据帧中的类别。我尝试使用替换和rename_categories方法将类别更改为所需的格式使用词典以映射，但是在删除Unicode u后它保持不变。有人知道我可以解决这个问题吗？我已经附上了与我正在合作的CSV的链接。

enter image description here

'''uploaded = files.upload()
yelpdf = pd.read_csv(io.BytesIO(uploaded['yelp_reviews.csv']))
print(yelpdf['NoiseLevel'].value_counts())
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].astype(str)
update_NoiseLevel = {'average': 'Average', 'lod': 'Loud', 'qiet': 'Quiet', 'very_lod': 'Very Loud'}
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].replace('u','',regex=True)
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].astype('category')
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].cat.rename_categories(update_NoiseLevel)
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].replace(update_NoiseLevel)

print(yelpdf['NoiseLevel'].value_counts())'''

原文

I am trying to rename the categories in a dataframe after removing the unicode u with a .replace('u','',regex) method due to the method removing the other 'u's in the text as well. I have tried using the replace, and the rename_categories method to change the categories into desired format using a dictionary to map but it remains unchanged after removing the unicode u. Does anyone know a better way I can approach this? I have attached a link to the CSV I am working with.

enter image description here

'''uploaded = files.upload()
yelpdf = pd.read_csv(io.BytesIO(uploaded['yelp_reviews.csv']))
print(yelpdf['NoiseLevel'].value_counts())
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].astype(str)
update_NoiseLevel = {'average': 'Average', 'lod': 'Loud', 'qiet': 'Quiet', 'very_lod': 'Very Loud'}
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].replace('u','',regex=True)
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].astype('category')
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].cat.rename_categories(update_NoiseLevel)
yelpdf['NoiseLevel'] = yelpdf['NoiseLevel'].replace(update_NoiseLevel)

print(yelpdf['NoiseLevel'].value_counts())'''

its a CSV file with yelp data and this issue is occurring within the NoiseLevel column

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

っ左 2025-01-24 21:38:13

在创建类别之前尝试 str.extract （如果需要）

df = pd.read_excel('yelp_reviews.xlsx')
df['NoiseLevel'] = df['NoiseLevel'].str.extract("(?:u')?([^']*)")

输出：

>>> df['NoiseLevel'].unique()
array(['average', 'quiet', nan, 'loud', 'very_loud'], dtype=object)

>>> df['NoiseLevel'].head(10)
0    average
1    average
2    average
3    average
4    average
5    average
6    average
7      quiet
8        NaN
9        NaN
Name: NoiseLevel, dtype: object

Try str.extract before create category (if needed)

df = pd.read_excel('yelp_reviews.xlsx')
df['NoiseLevel'] = df['NoiseLevel'].str.extract("(?:u')?([^']*)")

Output:

>>> df['NoiseLevel'].unique()
array(['average', 'quiet', nan, 'loud', 'very_loud'], dtype=object)

>>> df['NoiseLevel'].head(10)
0    average
1    average
2    average
3    average
4    average
5    average
6    average
7      quiet
8        NaN
9        NaN
Name: NoiseLevel, dtype: object

回复收藏 0 原文

枕头说它不想醒 2025-01-24 21:38:13

昨晚我确实发布了解决方案，但这是不正确的。
这是纠正的：

基本上所有问题都是从存储的方式中出现的。就我而言（在我的Googglecolab笔记本中），一些数据带有“ u'”，而另一些数据仅带有第一个报价。

这种不一致是所有问题的根源，尤其是这就是为什么“重命名”不起作用的原因。
您的名字也拼错了……（不？）
update_noiselevel = {'平均'：'平均'，'lod'：'大声'，'qiet'：'quiet'，'leyte_lod'：'southe_lod'}
应该是：
update_noiselevel = {'平均'：'平均'，'大声'：'大声'，'quite'：'quite'，'lemiet'，'heyte_loud'：''soury lod'}

在您在COLAB中上传CSV之前，您需要做一点调整对“ noings_level”列的调整，以便双方都有引号……等等
我使用（在Excel中）：

FORMULA =IF(H2="","unclasssifed",IF(LEFT(H2,1)<>"u",CONCATENATE("'",H2),IF(LEFT(H2,1)="u",H2)))

H2 = Excel中的“ Noiselevel”列…，您也可以使用Python在CSV中进行此操作…
对于空的是Nan，我使用了“未分类”一词……

解决此问题（不一致）并使重命名工作的步骤：

1。 COLAB中的导入文件：

 import pandas as pd
 import io
    
 df = pd.read_csv(io.StringIO(uploaded['stelios_copy_of_yelp_reviews - yelp_reviews (4).csv'].decode('utf-8')))
 df.head()

2。创建一个函数，以删除数据中的所有括号，并在该列中的所有函数...（！）中使用！

def remove_another(x):
  string = str(x)
  aaa = string.replace("u'", "").replace("'", "")
  return aaa

df['NoiseLevel_u_removed'] = df['NoiseLevel_modified_excel'].apply(remove_another)

3。更正这个！并运行剩余的命令

update_NoiseLevel = {'average': 'Average', 'loud': 'Loud', 'quiet': 'Quiet', 'very_loud': 'Very Loud', 'unclassified': 'Unclassified'}

print(df['NoiseLevel_u_removed'].value_counts())

df['NoiseLevel_category'] = df['NoiseLevel_u_removed'].astype('category')

df['NoiseLevel_u_removed'] = df['NoiseLevel_category'].cat.rename_categories(update_NoiseLevel)
df['NoiseLevel_u_removed'][0:23]

net/ofqo8.png “

I did post a solution last night but it was not correct.
This is the CORRECTED one:

Basically all the issues arise from the way the day is stored. In my case (in my GooggleColab notebook) , some data was coming in with “u’” and some others with JUST the first quote only.

This inconsistency is the source of all the problems, and in particular this is why "renaming" is not working.
Your names are also misspelled… (No?)
update_NoiseLevel = {'average': 'Average', 'lod': 'Loud', 'qiet': 'Quiet', 'very_lod': 'Very Loud'}
it should be:
update_NoiseLevel = {'average': 'Average', 'loud': 'Loud', 'quiet': 'Quiet', 'very_loud': 'Very Loud'}

BEFORE you upload the CSV in Colab, you need to make a little adjustment to the “Noise_level”, column so that all have quotes, in both sides… etc
I used (IN EXCEL):

FORMULA =IF(H2="","unclasssifed",IF(LEFT(H2,1)<>"u",CONCATENATE("'",H2),IF(LEFT(H2,1)="u",H2)))

H2 = “NoiseLevel” column in excel…, you may do that also in CSV using Python…
For the empty ones, that is nan, I used the term “unclassified”…

Steps to resolve this (inconsistency) and make the renaming work:

1. Import file in colab:

 import pandas as pd
 import io
    
 df = pd.read_csv(io.StringIO(uploaded['stelios_copy_of_yelp_reviews - yelp_reviews (4).csv'].decode('utf-8')))
 df.head()

2. Create a function to remove ALL the brackets and "u'" all in your data, in that column ... (!) and apply it!

def remove_another(x):
  string = str(x)
  aaa = string.replace("u'", "").replace("'", "")
  return aaa

df['NoiseLevel_u_removed'] = df['NoiseLevel_modified_excel'].apply(remove_another)

3. Correct this! and RUN remaining commands

update_NoiseLevel = {'average': 'Average', 'loud': 'Loud', 'quiet': 'Quiet', 'very_loud': 'Very Loud', 'unclassified': 'Unclassified'}

print(df['NoiseLevel_u_removed'].value_counts())

df['NoiseLevel_category'] = df['NoiseLevel_u_removed'].astype('category')

df['NoiseLevel_u_removed'] = df['NoiseLevel_category'].cat.rename_categories(update_NoiseLevel)
df['NoiseLevel_u_removed'][0:23]

I hope this helps you !

回复收藏 0 原文

~没有更多了~