python - 替换数据框中不包含某些单词的所有元素

发布于 2025-01-18 09:02:35 字数 1176 浏览 3 评论 0原文

我有一个非常大的数据框,我想用 NaN 替换所有不包含特定单词的元素(同时保持第一个“id”列不变)。

例如:

index  id    text1                        text2                        ...
1      123   {'"key'": '"living_space'"   '"value'": '"01.04.2022'"}   ...
2      124   {'"key'": '"rooms'"          '"value'": '"3'"}            ...
3      125   23                           {'"key'": '"rooms'"          ...
4      126   45                           Apartment sold               ...

我想保留数据框中包含单词 keyvalue 的所有元素,并用 nan 替换所有其他元素,所以我会得到一个像这样的数据框:

index  id    text1                        text2                        ...
1      123   {'"key'": '"living_space'"   '"value'": '"01.04.2022'"}   ...
2      124   {'"key'": '"rooms'"          '"value'": '"3'"}            ...
3      125   nan                          {'"key'": '"rooms'"          ...
4      126   nan                          nan                          ...

我尝试使用以下代码,但它只是清除整个数据集。

l1 = ['key', 'value']
df.iloc[:,1:] = df.iloc[:,1:].applymap(lambda x: x if set(x.split()).intersection(l1) else '')

提前致谢。

I have a very large dataframe and I want to substitute all elements that do not contain a specific word with NaN (while keeping the first "id" column unchanged).

For example:

index  id    text1                        text2                        ...
1      123   {'"key'": '"living_space'"   '"value'": '"01.04.2022'"}   ...
2      124   {'"key'": '"rooms'"          '"value'": '"3'"}            ...
3      125   23                           {'"key'": '"rooms'"          ...
4      126   45                           Apartment sold               ...

I want to keep all elements in the dataframe that contain the words key or value and substitute all else with nan, so I would get a dataframe like:

index  id    text1                        text2                        ...
1      123   {'"key'": '"living_space'"   '"value'": '"01.04.2022'"}   ...
2      124   {'"key'": '"rooms'"          '"value'": '"3'"}            ...
3      125   nan                          {'"key'": '"rooms'"          ...
4      126   nan                          nan                          ...

I have tried using the following code, but it is just clears the whole dataset.

l1 = ['key', 'value']
df.iloc[:,1:] = df.iloc[:,1:].applymap(lambda x: x if set(x.split()).intersection(l1) else '')

Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

在巴黎塔顶看东京樱花 2025-01-25 09:02:35

考虑以下方法来解决问题。它由2个部分组成。 (1)在函数substring_filter中实现了决定是否保留或删除数据的逻辑 - 我们只需检查target> target字符串即将包含Words words string /代码>。 (2)实际过滤是用np.Where - Numpy的非常令人信服的辅助功能。

import numpy as np
import pandas as pd


def substring_filter(target, words):
    for word in words:
        if word in target:
            return True
    return False


if __name__ == '__main__':

    df = pd.DataFrame({
        'A': [1, 2, 3, 4],
        'B': [True, False, False, True],
        'C': ['{"key": 1}', '{"value": 2}', 'text', 'abc']})

    words_to_search = ('key', 'value')
    df.loc[:, 'C'] = np.where(
        df.loc[:, 'C'].apply(lambda x: substring_filter(x, words_to_search)),
        df.loc[:, 'C'],
        None)
    print(df)

结果是:

   A      B             C
0  1   True    {"key": 1}
1  2  False  {"value": 2}
2  3  False          None
3  4   True          None

Consider the following approach to solve the problem. It consists of 2 parts. (1) The logic to decide whether to keep or to erase data is implemented in the function substring_filter - we simply check if target string contains any word from words. (2) Actual filtering is performed with np.where - very convinient helper function from numpy.

import numpy as np
import pandas as pd


def substring_filter(target, words):
    for word in words:
        if word in target:
            return True
    return False


if __name__ == '__main__':

    df = pd.DataFrame({
        'A': [1, 2, 3, 4],
        'B': [True, False, False, True],
        'C': ['{"key": 1}', '{"value": 2}', 'text', 'abc']})

    words_to_search = ('key', 'value')
    df.loc[:, 'C'] = np.where(
        df.loc[:, 'C'].apply(lambda x: substring_filter(x, words_to_search)),
        df.loc[:, 'C'],
        None)
    print(df)

Result is:

   A      B             C
0  1   True    {"key": 1}
1  2  False  {"value": 2}
2  3  False          None
3  4   True          None
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文