查找熊猫数据框列的所有独特组合

发布于 2025-02-09 19:33:45 字数 593 浏览 2 评论 0原文

我有一个数据平衡问题，其中我的图像具有多个类，即每个图像都可以有多个类或一个类。我有标签文件，该文件将所有从A到G和FN（图像名称）命名为列的类。每列具有一个值0或1，其中0表示图像中没有类，而1表示图像中存在特定类。现在，我想以一种方式将数据帧子集成，以使我获得不同类别的不同类别 << img src =“ https://i.sstatic.net/woux2.png” alt =“ labels dataframe”>

问题是，如果我将多个条件与dataframe命令（例如表示dataframe：

pp_A_B=pp[(pp['A']==1) & (pp['B']==1) & (pp['C']==0) & (pp['D']==0) & (x['E']==0) & (x['F']==0) &(pp['G']==0)]

在这里，pp_a_b为我提供了只有A和B类的图像。

我将不得不编写多个变量以了解各种组合。方式。

原文

I have a data balancing problem at hand wherein I have images which have multiple classes i.e. each image can have multiple class or one class. I have the label file which has all the classes named from A to G and fn(image name) as the columns. Each column has a value 0 or 1,wherein 0 means that class is absent in image and 1 means that particular class is present in the image. Now, I want to subset the dataframe in such a manner that I get different dataframes each with combinations of different classes

The issue is if I use the multiple conditions with the dataframe command such as (here pp is used to denote dataframe :

pp_A_B=pp[(pp['A']==1) & (pp['B']==1) & (pp['C']==0) & (pp['D']==0) & (x['E']==0) & (x['F']==0) &(pp['G']==0)]

Here,pp_A_B gives me the dataframe having images which have only A and B classes.

I will have to write multiple variables to know about the various combinations.Kindly help how can we automate it to get all the possible combinations in a faster manner.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

亢潮 2025-02-16 19:33:45

嗨，您应该使用groupyby和get_group方法来提取所需的元素。

如果您想获取数据= 0＆amp; b = 0：

#Simulation of your datas
nb_rows = 10000
nb_colums = 5
df_array = np.random.randint(0,2, size =(nb_rows, nb_colums))
df = pd.DataFrame(df_array)
df.columns = ["A", "B", "C", "D", "E"]
df["infos"] = [f"Exemples of data {i}" for i in range(len(df))]

更新：

现在使用上述方法：

df.groupby(["A", "B"]).get_group((0, 0))

在这里您可以轻松找到所有满足a = 0的数据＆amp; B = 0.

Now you can iterate thought all of your targeted columns combinations this way :

columns_to_explore = ["A", "B", "C"]
k = [0]*len(columns_to_explore)
for i in range(2**len(columns_to_explore)):
    i_binary = str(bin(i)[2:])
    i_binary = "".join(["0" for _ in range(len(columns_to_explore)-len(i_binary))]) + i_binary
    list_values = [int(x) for x in i_binary]
    df_selected = df.groupby(columns_to_explore).get_group(tuple(list_values))
    #Do something then ...

Hi you should use the groupyby and get_group methods to extract the desired elements.

Here is an example if you are trying to get datas where A = 0 & B= 0 :

#Simulation of your datas
nb_rows = 10000
nb_colums = 5
df_array = np.random.randint(0,2, size =(nb_rows, nb_colums))
df = pd.DataFrame(df_array)
df.columns = ["A", "B", "C", "D", "E"]
df["infos"] = [f"Exemples of data {i}" for i in range(len(df))]

UPDATE :

And now the use of the mentioned methods :

df.groupby(["A", "B"]).get_group((0, 0))

Here you easily find all the data that meet A = 0 & B = 0.

Now you can iterate thought all of your targeted columns combinations this way :

columns_to_explore = ["A", "B", "C"]
k = [0]*len(columns_to_explore)
for i in range(2**len(columns_to_explore)):
    i_binary = str(bin(i)[2:])
    i_binary = "".join(["0" for _ in range(len(columns_to_explore)-len(i_binary))]) + i_binary
    list_values = [int(x) for x in i_binary]
    df_selected = df.groupby(columns_to_explore).get_group(tuple(list_values))
    #Do something then ...

回复收藏 0 原文

少钕鈤記 2025-02-16 19:33:45

让我们假设您有以下数据框架：

import pandas as pd
import random


attr = [0, 1]
N = 10000
rg = range(N)

df = pd.DataFrame(
    {
        'A': [random.choice(attr) for i in rg],
        'B': [random.choice(attr) for i in rg],
        'C': [random.choice(attr) for i in rg],
        'D': [random.choice(attr) for i in rg],
        'E': [random.choice(attr) for i in rg],
        'F': [random.choice(attr) for i in rg],
        'G': [random.choice(attr) for i in rg],
    }
)

并且要将所有数据框架组合存储在列表中。，您可以编写以下功能以获取与0和1相同组合相对应的所有索引

import random
from numba import njit

@njit
def _get_index_combinations(possible_combinations, values):
    index_outpus = []
    for combination in possible_combinations:
        mask = values == combination
        _temp = [i for i in range(len(mask)) if mask[i].all()]
        index_outpus.append(_temp)
    return index_outpus

possible_combinations = df.drop_duplicates().values
index_outpus = _get_index_combinations(possible_combinations, df.values)

然后所有索引组合：

sliced_dfs = [df.loc[df.index.isin(index)] for index in index_outpus]

例如，如果您运行，

print(sliced_dfs[0])

您将获得一个可能组合的查询。

注意：

对于所有可能的组合，您甚至可以进一步创建几个数据帧（未存储在列表中）。如果您弄脏并使用这样的东西：

col_names = "ABCDEFG"
final_output = {"all_names": [], "all_querys": []}
for numerator, i in enumerate(possible_combinations):
    df_name = ""
    col_pos = np.where(i)[0]
    for pos in col_pos:
        df_name += col_names[pos]
    final_output["all_names"].append(f"df_{df_name}")
    query_code = f"df_{df_name} = df.loc[df.index.isin({index_outpus[numerator]})]"
    final_output["all_querys"].append(query_code)
    exec(query_code)

它会创建一个名为final_output的字典。在那里，存储了所有创建的数据帧的名称。例如：

{'all_names': ['df_ABG', 'df_G', 'df_AC', ...], 'all_querys': [...]}

然后，您只需在all_names中打印所有帧，例如df_abg，它返回您：

      A  B  C  D  E  F  G
0     1  1  0  0  0  0  1
59    1  1  0  0  0  0  1
92    1  1  0  0  0  0  1
207   1  1  0  0  0  0  1
211   1  1  0  0  0  0  1
284   1  1  0  0  0  0  1
321   1  1  0  0  0  0  1
387   1  1  0  0  0  0  1
415   1  1  0  0  0  0  1
637   1  1  0  0  0  0  1
....

Let us suppose you have the following data frame:

import pandas as pd
import random


attr = [0, 1]
N = 10000
rg = range(N)

df = pd.DataFrame(
    {
        'A': [random.choice(attr) for i in rg],
        'B': [random.choice(attr) for i in rg],
        'C': [random.choice(attr) for i in rg],
        'D': [random.choice(attr) for i in rg],
        'E': [random.choice(attr) for i in rg],
        'F': [random.choice(attr) for i in rg],
        'G': [random.choice(attr) for i in rg],
    }
)

and that you want to store all data frame combinations in a list. Then, you can write the following function to get all indices that correspond to the same combination of 0 and 1:

import random
from numba import njit

@njit
def _get_index_combinations(possible_combinations, values):
    index_outpus = []
    for combination in possible_combinations:
        mask = values == combination
        _temp = [i for i in range(len(mask)) if mask[i].all()]
        index_outpus.append(_temp)
    return index_outpus

possible_combinations = df.drop_duplicates().values
index_outpus = _get_index_combinations(possible_combinations, df.values)

Finally, you can decompose the data frame in chunks by iterating over all index combinations:

sliced_dfs = [df.loc[df.index.isin(index)] for index in index_outpus]

If you then, for instance, run

print(sliced_dfs[0])

you will get a query for one possible combination.

Note:

You can even go further an really create several data frames (not stored in a list) for all possible combinations. If you go dirty and use something like this:

col_names = "ABCDEFG"
final_output = {"all_names": [], "all_querys": []}
for numerator, i in enumerate(possible_combinations):
    df_name = ""
    col_pos = np.where(i)[0]
    for pos in col_pos:
        df_name += col_names[pos]
    final_output["all_names"].append(f"df_{df_name}")
    query_code = f"df_{df_name} = df.loc[df.index.isin({index_outpus[numerator]})]"
    final_output["all_querys"].append(query_code)
    exec(query_code)

It creates you a dictionary named final_output. There, the names of all created data frames are stored. For example:

{'all_names': ['df_ABG', 'df_G', 'df_AC', ...], 'all_querys': [...]}

You can then just print all frames in all_names, for example df_ABG, which returns you:

      A  B  C  D  E  F  G
0     1  1  0  0  0  0  1
59    1  1  0  0  0  0  1
92    1  1  0  0  0  0  1
207   1  1  0  0  0  0  1
211   1  1  0  0  0  0  1
284   1  1  0  0  0  0  1
321   1  1  0  0  0  0  1
387   1  1  0  0  0  0  1
415   1  1  0  0  0  0  1
637   1  1  0  0  0  0  1
....

回复收藏 0 原文

~没有更多了~