从包含每个标签的dataframe中提取行

发布于 2025-02-11 11:44:24 字数 14619 浏览 2 评论 0原文

在示例dataframe下面显示的示例中,我有5个标签(class_name)。共有31个文件(31行)可以使用...

我试图提取行的80%(可以是可变)(= 24行(Integer))。但是,我想确保我从每个class_name中提取至少1行,

以尝试我,只能手动执行此操作。这种方法变得乏味,因为class_name比10要多。您能帮助我提取正确的%行以包含每个标签中至少1个条目(class_name)。

Here is my attempt:

import math
import pandas as pd

base_path = 'G:/PandasFileSeperation'
original_df = pd.read_csv(f'{base_path}/Book2.csv')

original_df = original_df.astype(str)
length = original_df.class_name.count()
length

# Get number of groups
dfg = original_df.groupby('class_name')
numgroups = dfg.ngroups
numgroups

# Get the sizes of each group
group_size =  original_df.groupby('class_name').size()

# Get length of original dataframe
Total_dataset_size = len(original_df)

# Get number of Training samples
TrainPercent = 0.80

Train_size = int(Total_dataset_size * TrainPercent)
Train_size

# How can I change this to automatically change the label size length to give at least 1 row from each class?
Label_0_size = 5
Label_1_size = 3
Label_2_size = 7
Label_3_size = 7
Label_4_size = 2

# Training Dataset
label_percent = { 'pigs' : Label_0_size, 'goats' : Label_1_size, 'chickens' : Label_2_size, 'hens' : Label_3_size, 'sheep' : Label_4_size}

flag = True
for label, num_rows in label_percent.items():
  
  row_num = num_rows
  
  if label == 'pigs':
    row_num0 = Label_0_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num0)

  elif label == 'goats':
    row_num2 = Label_1_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num1)
    
    
  elif label == 'chickens':
    row_num2 = Label_2_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num2)
    
  elif label == 'hens':
    row_num3 = Label_3_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num3)
    
  else:
    row_num4 = Label_4_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num4)

  if flag == True:
    Train_df = df
    flag = False
  else:
    Train_df = pd.concat([Train_df, df])

Train_df.to_csv(f'{base_path}/Train_df.csv', encoding='utf-8')
Dataframe:
slice_file_namefsIDstartendsaliencefoldclassIDclass_nameoriginal_class
1-1000020520400.wav11pigs
1-100004024000001.wav11pigs
1-10000406050001.wav11pigs
1-1000050120400.wav11pigs
1- 1000050320400.wav11pigs
1-1000050520400.wav12goats
1-10000601400001000.wav12goats
1-1000060340000.wav12goats
1-100006070500.wav13chickens
1-100007020800.wav13chickens
1-100007024000001. Wav13
1-1000070320400.WAV13
1-100007050800.WAV13
1-100007064000001.WAV13
1-100040620400.WAV13Chickens
1-100060606020500.WAV13
11-100010620400.WAV3chickens
1-10006030500.wav13chickens
1-100060520400.wav14hens
1-10007020500.wav14hens
2-100070420400.wav14hens
2-100070540000.wav14hens
2-1313131313004.wav14hens
2-1313131313043.WAV14雌鸟
2-1313131313044.WAV15
2-150002020500.WAV15SHEEP
2-150002060800.WAV15SHEEP
2-150004022040001.wav
​150006014000001.WAV15SHEEP
2-150006024000001.WAV15绵羊

作为开始,我们可以做一个forloop,然后以每个班级= 1 = 1 = 1,然后递增每个班级,检查每个迭代后的总和,直到每次迭代直至总类= train_size

In the example dataframe shown below, I have 5 labels (class_name). There are 31 files (31 rows) in all to work with...

I am attempting to extract 80% (can be variable) of the rows (= 24 rows (integer)). However, I wanted to ensure that I extract at least 1 row from each class_name

For my attempt, I can only do this manually. This approach gets tedious as the class_name is much more that 10. Can you assist me in extracting the correct % rows to contain at least 1 entry from each label (class_name).

Here is my attempt:

import math
import pandas as pd

base_path = 'G:/PandasFileSeperation'
original_df = pd.read_csv(f'{base_path}/Book2.csv')

original_df = original_df.astype(str)
length = original_df.class_name.count()
length

# Get number of groups
dfg = original_df.groupby('class_name')
numgroups = dfg.ngroups
numgroups

# Get the sizes of each group
group_size =  original_df.groupby('class_name').size()

# Get length of original dataframe
Total_dataset_size = len(original_df)

# Get number of Training samples
TrainPercent = 0.80

Train_size = int(Total_dataset_size * TrainPercent)
Train_size

# How can I change this to automatically change the label size length to give at least 1 row from each class?
Label_0_size = 5
Label_1_size = 3
Label_2_size = 7
Label_3_size = 7
Label_4_size = 2

# Training Dataset
label_percent = { 'pigs' : Label_0_size, 'goats' : Label_1_size, 'chickens' : Label_2_size, 'hens' : Label_3_size, 'sheep' : Label_4_size}

flag = True
for label, num_rows in label_percent.items():
  
  row_num = num_rows
  
  if label == 'pigs':
    row_num0 = Label_0_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num0)

  elif label == 'goats':
    row_num2 = Label_1_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num1)
    
    
  elif label == 'chickens':
    row_num2 = Label_2_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num2)
    
  elif label == 'hens':
    row_num3 = Label_3_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num3)
    
  else:
    row_num4 = Label_4_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num4)

  if flag == True:
    Train_df = df
    flag = False
  else:
    Train_df = pd.concat([Train_df, df])

Train_df.to_csv(f'{base_path}/Train_df.csv', encoding='utf-8')
Dataframe:
slice_file_namefsIDstartendsaliencefoldclassIDclass_nameoriginal_class
1-1000020520400.wav11pigs
1-100004024000001.wav11pigs
1-10000406050001.wav11pigs
1-1000050120400.wav11pigs
1-1000050320400.wav11pigs
1-1000050520400.wav12goats
1-10000601400001000.wav12goats
1-1000060340000.wav12goats
1-100006070500.wav13chickens
1-100007020800.wav13chickens
1-100007024000001.wav13chickens
1-1000070320400.wav13chickens
1-100007050800.wav13chickens
1-100007064000001.wav13chickens
1-100010620400.wav13chickens
1-100040620400.wav13chickens
1-10006020500.wav13chickens
1-10006030500.wav13chickens
1-100060520400.wav14hens
1-10007020500.wav14hens
2-100070420400.wav14hens
2-100070540000.wav14hens
2-1313131313004.wav14hens
2-1313131313043.wav14hens
2-1313131313044.wav15sheep
2-150002020500.wav15sheep
2-150002060800.wav15sheep
2-150004022040001.wav15sheep
2-15000406050001.wav15sheep
2-150006014000001.wav15sheep
2-150006024000001.wav15sheep

As a starting, can we do a forloop and start with each class = 1 then increment each class, checking the sum after each iteration until the total classes = Train_size?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

俏︾媚 2025-02-18 11:44:24

我认为这应该有所帮助。对数据帧进行采样两次:一旦使用groupby为每个名称采样一行,然后从其他数据框架中随机采样以完成80%的培训集。

TrainPercent = 0.8
# sample one row for each class_name (5 rows)
one_each = df.groupby('class_name').sample(n=1)
# from the rest of the rows, sample int(0.8*len(df))-len(one_each) number of rows (19 rows)
rest = df.loc[~df.index.isin(one_each.index)].sample(n=int(TrainPercent*len(df))-len(one_each))
# concatenate the two
res = pd.concat([one_each, rest])

I think this should help. Sample the dataframe twice: Once using groupby to sample one row for each name, then sample from the rest of the dataframe randomly to complete the 80% training set.

TrainPercent = 0.8
# sample one row for each class_name (5 rows)
one_each = df.groupby('class_name').sample(n=1)
# from the rest of the rows, sample int(0.8*len(df))-len(one_each) number of rows (19 rows)
rest = df.loc[~df.index.isin(one_each.index)].sample(n=int(TrainPercent*len(df))-len(one_each))
# concatenate the two
res = pd.concat([one_each, rest])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文