在replace()函数中使用RegEx根据字典键/值对重新映射Pandas列中的值
我有以下 Pandas 数据框:
foo = {
"first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
"last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
"salary" : [101000, 50000, 56943, 330532, 92750],
}
df = pd.DataFrame(foo)
我希望能够使用 RegEx 模式验证列数据,然后在验证失败时替换为 NaN。
为此,我在 .replace()
方法中使用以下硬编码正则表达式模式:
df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)
此方法有效。但是,我有 15-20 列。因此,这种方法将很难维护。
我想设置一个如下所示的字典:
regex_patterns = {
'last_name' : '[^A-Za-z \/\-\.\']',
'first_name' : '[^A-Za-z \/\-\.\']',
'salary' : '[^0-9 ]'
}
然后,我想根据 df 中的列名称将值传递给 .replace()
函数。它看起来如下:
df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
我如何引用 df 列的名称,然后使用它来查找字典中的键并获取其关联的值?
例如,查找 first_name
,然后访问其字典值 [^A-Za-z \/\-\.\']
并将该值传递给 .replace()?
谢谢!
PS如果有更优雅的方法,我洗耳恭听。
I have the following Pandas dataframe:
foo = {
"first_name" : ["John", "Sally", "Mark", "Jane", "Phil"],
"last_name" : ["O'Connor", "Jones P.", "Williams", "Connors", "Lee"],
"salary" : [101000, 50000, 56943, 330532, 92750],
}
df = pd.DataFrame(foo)
I'd like to be able to validate column data using a RegEx pattern, then replace with NaN
if the validation fails.
To do this, I use the following hard-coded RegEx patterns in the .replace()
method:
df[['first_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('[^A-Za-z \/\-\.\']', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('[^0-9 ]', np.NaN, regex=True)
This approach works. But, I have 15-20 columns. So, this approach is going to be difficult to maintain.
I'd like to set up a dictionary that looks as follows:
regex_patterns = {
'last_name' : '[^A-Za-z \/\-\.\']',
'first_name' : '[^A-Za-z \/\-\.\']',
'salary' : '[^0-9 ]'
}
Then, I'd like to pass a value to the .replace()
function based on the name of the column in the df. It would look as follows:
df[['first_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['last_name']] = df[['last_name']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
df[['salary']] = df[['salary']].replace('<reference_to_regex_patterns_dictionary>', np.NaN, regex=True)
How would I reference the name of the df
column, then use that to look up the key in the dictionary and get its associated value?
For example, look up first_name
, then access its dictionary value [^A-Za-z \/\-\.\']
and pass this value into .replace()
?
Thanks!
P.S. if there is a more elegant approach, I'm all ears.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您实际上可以将
{'col': {'match': 'replacement'}}
形式的嵌套字典传递给replace
在您的情况下:
You can actually pass a nested dictionary of the form
{'col': {'match': 'replacement'}}
toreplace
In your case:
一种方法是使用
columns
属性:One approach would be using
columns
attribute: