在replace()函数中使用RegEx根据字典键/值对重新映射Pandas列中的值

发布于 2025-01-11 12:43:08 字数 1514 浏览 0 评论 0原文

我有以下 Pandas 数据框：

foo = {
    "first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
    "last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
    "salary" : [101000, 50000, 56943, 330532, 92750],
}

df = pd.DataFrame(foo)

我希望能够使用 RegEx 模式验证列数据，然后在验证失败时替换为 NaN。

为此，我在 .replace() 方法中使用以下硬编码正则表达式模式：

df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)

此方法有效。但是，我有 15-20 列。因此，这种方法将很难维护。

我想设置一个如下所示的字典：

regex_patterns = {
    'last_name' : '[^A-Za-z \/\-\.\']',
    'first_name' : '[^A-Za-z \/\-\.\']',
    'salary' : '[^0-9 ]'
}

然后，我想根据 df 中的列名称将值传递给 .replace() 函数。它看起来如下：

df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)

我如何引用 df 列的名称，然后使用它来查找字典中的键并获取其关联的值？

例如，查找 first_name，然后访问其字典值 [^A-Za-z \/\-\.\'] 并将该值传递给 .replace()？

谢谢！

PS如果有更优雅的方法，我洗耳恭听。

原文

I have the following Pandas dataframe:

foo = {
    "first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
    "last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
    "salary" : [101000, 50000, 56943, 330532, 92750],
}

df = pd.DataFrame(foo)

I'd like to be able to validate column data using a RegEx pattern, then replace with NaN if the validation fails.

To do this, I use the following hard-coded RegEx patterns in the .replace() method:

df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)

This approach works. But, I have 15-20 columns. So, this approach is going to be difficult to maintain.

I'd like to set up a dictionary that looks as follows:

regex_patterns = {
    'last_name' : '[^A-Za-z \/\-\.\']',
    'first_name' : '[^A-Za-z \/\-\.\']',
    'salary' : '[^0-9 ]'
}

Then, I'd like to pass a value to the .replace() function based on the name of the column in the df. It would look as follows:

df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)

How would I reference the name of the df column, then use that to look up the key in the dictionary and get its associated value?

For example, look up first_name, then access its dictionary value [^A-Za-z \/\-\.\'] and pass this value into .replace()?

Thanks!

P.S. if there is a more elegant approach, I'm all ears.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

说好的呢 2025-01-18 12:43:09

您实际上可以将 {'col': {'match': 'replacement'}} 形式的嵌套字典传递给 replace

在您的情况下：

d = {k:{v:np.nan} for k,v in regex_patterns.items()}

df.replace(d, regex=True)

You can actually pass a nested dictionary of the form {'col': {'match': 'replacement'}} to replace

In your case:

d = {k:{v:np.nan} for k,v in regex_patterns.items()}

df.replace(d, regex=True)

回复收藏 0 原文

怎樣才叫好 2025-01-18 12:43:08

一种方法是使用 columns 属性：

regex_patterns = {
    'last_name' : '[^A-Za-z \/\-\.\']',
    'first_name' : '[^A-Za-z \/\-\.\']',
    'salary' : '[^0-9 ]'
}
for column in df.columns:
  df[column] = df[[column]].replace(regex_pattern[column], np.NaN, regex=True)

One approach would be using columns attribute:

regex_patterns = {
    'last_name' : '[^A-Za-z \/\-\.\']',
    'first_name' : '[^A-Za-z \/\-\.\']',
    'salary' : '[^0-9 ]'
}
for column in df.columns:
  df[column] = df[[column]].replace(regex_pattern[column], np.NaN, regex=True)

回复收藏 0 原文

~没有更多了~