在replace()函数中使用RegEx根据字典键/值对重新映射Pandas列中的值

发布于 2025-01-11 12:43:08 字数 1514 浏览 0 评论 0原文

我有以下 Pandas 数据框:

foo = {
    "first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
    "last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
    "salary" : [101000, 50000, 56943, 330532, 92750],
}

df = pd.DataFrame(foo)

我希望能够使用 RegEx 模式验证列数据,然后在验证失败时替换为 NaN。

为此,我在 .replace() 方法中使用以下硬编码正则表达式模式:

df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)

此方法有效。但是,我有 15-20 列。因此,这种方法将很难维护。

我想设置一个如下所示的字典:

regex_patterns = {
    'last_name' : '[^A-Za-z \/\-\.\']',
    'first_name' : '[^A-Za-z \/\-\.\']',
    'salary' : '[^0-9 ]'
}

然后,我想根据 df 中的列名称将值传递给 .replace() 函数。它看起来如下:

df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)

我如何引用 df 列的名称,然后使用它来查找字典中的键并获取其关联的值?

例如,查找 first_name,然后访问其字典值 [^A-Za-z \/\-\.\'] 并将该值传递给 .replace()?

谢谢!

PS如果有更优雅的方法,我洗耳恭听。

I have the following Pandas dataframe:

foo = {
    "first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
    "last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
    "salary" : [101000, 50000, 56943, 330532, 92750],
}

df = pd.DataFrame(foo)

I'd like to be able to validate column data using a RegEx pattern, then replace with NaN if the validation fails.

To do this, I use the following hard-coded RegEx patterns in the .replace() method:

df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)

This approach works. But, I have 15-20 columns. So, this approach is going to be difficult to maintain.

I'd like to set up a dictionary that looks as follows:

regex_patterns = {
    'last_name' : '[^A-Za-z \/\-\.\']',
    'first_name' : '[^A-Za-z \/\-\.\']',
    'salary' : '[^0-9 ]'
}

Then, I'd like to pass a value to the .replace() function based on the name of the column in the df. It would look as follows:

df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)

How would I reference the name of the df column, then use that to look up the key in the dictionary and get its associated value?

For example, look up first_name, then access its dictionary value [^A-Za-z \/\-\.\'] and pass this value into .replace()?

Thanks!

P.S. if there is a more elegant approach, I'm all ears.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

说好的呢 2025-01-18 12:43:09

您实际上可以将 {'col': {'match': 'replacement'}} 形式的嵌套字典传递给 replace

在您的情况下:

d = {k:{v:np.nan} for k,v in regex_patterns.items()}

df.replace(d, regex=True)

You can actually pass a nested dictionary of the form {'col': {'match': 'replacement'}} to replace

In your case:

d = {k:{v:np.nan} for k,v in regex_patterns.items()}

df.replace(d, regex=True)
怎樣才叫好 2025-01-18 12:43:08

一种方法是使用 columns 属性:

regex_patterns = {
    'last_name' : '[^A-Za-z \/\-\.\']',
    'first_name' : '[^A-Za-z \/\-\.\']',
    'salary' : '[^0-9 ]'
}
for column in df.columns:
  df[column] = df[[column]].replace(regex_pattern[column], np.NaN, regex=True)

One approach would be using columns attribute:

regex_patterns = {
    'last_name' : '[^A-Za-z \/\-\.\']',
    'first_name' : '[^A-Za-z \/\-\.\']',
    'salary' : '[^0-9 ]'
}
for column in df.columns:
  df[column] = df[[column]].replace(regex_pattern[column], np.NaN, regex=True)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文