模糊修复列表中基于列的正确值

发布于 2025-01-09 22:08:04 字数 839 浏览 3 评论 0原文

我有一个脏数据框，如下所示

地区
柏林
慕尼黑
柏林-斯潘道
施潘道-柏林
商店-慕尼黑
慕尼黑-休息
法兰克福

我也有包含干净信息的列表

城市 = ['柏林','慕尼黑','法兰克福']

我需要帮助在数据框中创建一个包含清洁城市的新列，如图所示

Region	Clean Region
Berlin	Berlin 德国
慕尼黑	慕尼黑
Berlin-Spandau	柏林
Spandau-Berlin	Berlin
商店-慕尼黑	慕尼黑
慕尼黑-休息	慕尼黑
法兰克福-pla	法兰克福

我不知道如何创建这个专栏。需要 python 帮助

原文

I have a dirty dataframe as shown below

Region
Berlin
Munich
Berlin-Spandau
Spandau-Berlin
Shop-Munich
munich-rest
Frankfurt

I also have list with the clean information

city = ['Berlin','Munich','Frankfurt']

I need help creating a new column in the data frame with clean cities as shown

Region	Clean Region
Berlin	Berlin
Munich	Munich
Berlin-Spandau	Berlin
Spandau-Berlin	Berlin
Shop-Munich	Munich
munich-rest	Munich
Frankfurt-pla	Frankfurt

I am not sure how to create this column. Need help in python

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

默嘫て 2025-01-16 22:08:04

您可以使用 lambda 函数。

city = ["Munich", "Berlin"]
def func(x):
    for k in city:
        if k.lower() in x.lower():
            return k
    return x
df["Clean Region"] = df['Region'].apply(lambda x: func(x))

You can use lambda function.

city = ["Munich", "Berlin"]
def func(x):
    for k in city:
        if k.lower() in x.lower():
            return k
    return x
df["Clean Region"] = df['Region'].apply(lambda x: func(x))

回复收藏 0 原文

作业与我同在 2025-01-16 22:08:04

假设您从城市列表开始，您可以首先使用 .str.contains 查找每个单元格中包含哪个城市：

>>> cities = ["Berlin", "Munich", "Frankfurt"]
>>> for city in cities:
        df[city] = df["Region"].str.lower().str.contains(city.lower())

>>> df
           Region  Berlin  Munich  Frankfurt
0          Berlin    True   False      False
1          Munich   False    True      False
2  Berlin-Spandau    True   False      False
3  Spandau-Berlin    True   False      False
4     Shop-Munich   False    True      False
5     munich-rest   False    True      False
6       Frankfurt   False   False       True

现在，您可以使用 .melt 和然后 .loc 将这些 True 值转换为字符串，然后仅选择这些行：

>>> df = df.melt(id_vars=["Region"], value_vars=["Berlin", "Munich", "Frankfurt"], var_name="Clean Region")
>>> df = df.loc[x["value"], ["Region", "Clean Region"]]
>>> df
        Region Clean Region
0           Berlin       Berlin
2   Berlin-Spandau       Berlin
3   Spandau-Berlin       Berlin
8           Munich       Munich
11     Shop-Munich       Munich
12     munich-rest       Munich
20       Frankfurt    Frankfurt

Assuming that you start from a list of cities, you could first use .str.contains to find which city is included in each cell:

>>> cities = ["Berlin", "Munich", "Frankfurt"]
>>> for city in cities:
        df[city] = df["Region"].str.lower().str.contains(city.lower())

>>> df
           Region  Berlin  Munich  Frankfurt
0          Berlin    True   False      False
1          Munich   False    True      False
2  Berlin-Spandau    True   False      False
3  Spandau-Berlin    True   False      False
4     Shop-Munich   False    True      False
5     munich-rest   False    True      False
6       Frankfurt   False   False       True

Now, you can use .melt and then .loc to transform those True values into a string and then select only those rows:

>>> df = df.melt(id_vars=["Region"], value_vars=["Berlin", "Munich", "Frankfurt"], var_name="Clean Region")
>>> df = df.loc[x["value"], ["Region", "Clean Region"]]
>>> df
        Region Clean Region
0           Berlin       Berlin
2   Berlin-Spandau       Berlin
3   Spandau-Berlin       Berlin
8           Munich       Munich
11     Shop-Munich       Munich
12     munich-rest       Munich
20       Frankfurt    Frankfurt

回复收藏 0 原文

~没有更多了~