从包含每个标签的dataframe中提取行

发布于 2025-02-11 11:44:24 字数 14619 浏览 2 评论 0原文

在示例dataframe下面显示的示例中，我有5个标签（class_name）。共有31个文件（31行）可以使用...

我试图提取行的80％（可以是可变）（= 24行（Integer））。但是，我想确保我从每个class_name中提取至少1行，

以尝试我，只能手动执行此操作。这种方法变得乏味，因为class_name比10要多。您能帮助我提取正确的％行以包含每个标签中至少1个条目（class_name）。

Here is my attempt:

import math
import pandas as pd

base_path = 'G:/PandasFileSeperation'
original_df = pd.read_csv(f'{base_path}/Book2.csv')

original_df = original_df.astype(str)
length = original_df.class_name.count()
length

# Get number of groups
dfg = original_df.groupby('class_name')
numgroups = dfg.ngroups
numgroups

# Get the sizes of each group
group_size =  original_df.groupby('class_name').size()

# Get length of original dataframe
Total_dataset_size = len(original_df)

# Get number of Training samples
TrainPercent = 0.80

Train_size = int(Total_dataset_size * TrainPercent)
Train_size

# How can I change this to automatically change the label size length to give at least 1 row from each class?
Label_0_size = 5
Label_1_size = 3
Label_2_size = 7
Label_3_size = 7
Label_4_size = 2

# Training Dataset
label_percent = { 'pigs' : Label_0_size, 'goats' : Label_1_size, 'chickens' : Label_2_size, 'hens' : Label_3_size, 'sheep' : Label_4_size}

flag = True
for label, num_rows in label_percent.items():
  
  row_num = num_rows
  
  if label == 'pigs':
    row_num0 = Label_0_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num0)

  elif label == 'goats':
    row_num2 = Label_1_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num1)
    
    
  elif label == 'chickens':
    row_num2 = Label_2_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num2)
    
  elif label == 'hens':
    row_num3 = Label_3_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num3)
    
  else:
    row_num4 = Label_4_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num4)

  if flag == True:
    Train_df = df
    flag = False
  else:
    Train_df = pd.concat([Train_df, df])

Train_df.to_csv(f'{base_path}/Train_df.csv', encoding='utf-8')

Dataframe:

slice_file_name	fold	classID	class_name
1-1000020520400.wav	1	1	pigs
1-100004024000001.wav	1	1	pigs
1-10000406050001.wav	1	1	pigs
1-1000050120400.wav	1	1	pigs
1- 1000050320400.wav	1	1	pigs
1-1000050520400.wav	1	2	goats
1-10000601400001000.wav	1	2	goats
1-1000060340000.wav	1	2	goats
1-100006070500.wav	1	3	chickens
1-100007020800.wav	1	3	chickens
1-100007024000001. Wav	1	3	鸡
1-1000070320400.WAV	1	3	鸡
1-100007050800.WAV	1	3	鸡
1-100007064000001.WAV	1	3	鸡
1-100040620400.WAV	1	3	Chickens
1-100060606020500.WAV	1	3	鸡
1	1-100010620400.WAV	3	chickens
1-10006030500.wav	1	3	chickens
1-100060520400.wav	1	4	hens
1-10007020500.wav	1	4	hens
2-100070420400.wav	1	4	hens
2-100070540000.wav	1	4	hens
2-1313131313004.wav	1	4	hens
2-1313131313043.WAV	1	4	雌鸟
2-1313131313044.WAV	1	5	羊
2-150002020500.WAV	1	5	SHEEP
2-150002060800.WAV	1	5	SHEEP
2-150004022040001.wav

150006014000001.WAV	1	5	SHEEP
2-150006024000001.WAV	1	5	绵羊

作为开始，我们可以做一个forloop，然后以每个班级= 1 = 1 = 1，然后递增每个班级，检查每个迭代后的总和，直到每次迭代直至总类= train_size？

原文

In the example dataframe shown below, I have 5 labels (class_name). There are 31 files (31 rows) in all to work with...

I am attempting to extract 80% (can be variable) of the rows (= 24 rows (integer)). However, I wanted to ensure that I extract at least 1 row from each class_name

For my attempt, I can only do this manually. This approach gets tedious as the class_name is much more that 10. Can you assist me in extracting the correct % rows to contain at least 1 entry from each label (class_name).

Here is my attempt:

import math
import pandas as pd

base_path = 'G:/PandasFileSeperation'
original_df = pd.read_csv(f'{base_path}/Book2.csv')

original_df = original_df.astype(str)
length = original_df.class_name.count()
length

# Get number of groups
dfg = original_df.groupby('class_name')
numgroups = dfg.ngroups
numgroups

# Get the sizes of each group
group_size =  original_df.groupby('class_name').size()

# Get length of original dataframe
Total_dataset_size = len(original_df)

# Get number of Training samples
TrainPercent = 0.80

Train_size = int(Total_dataset_size * TrainPercent)
Train_size

# How can I change this to automatically change the label size length to give at least 1 row from each class?
Label_0_size = 5
Label_1_size = 3
Label_2_size = 7
Label_3_size = 7
Label_4_size = 2

# Training Dataset
label_percent = { 'pigs' : Label_0_size, 'goats' : Label_1_size, 'chickens' : Label_2_size, 'hens' : Label_3_size, 'sheep' : Label_4_size}

flag = True
for label, num_rows in label_percent.items():
  
  row_num = num_rows
  
  if label == 'pigs':
    row_num0 = Label_0_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num0)

  elif label == 'goats':
    row_num2 = Label_1_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num1)
    
    
  elif label == 'chickens':
    row_num2 = Label_2_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num2)
    
  elif label == 'hens':
    row_num3 = Label_3_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num3)
    
  else:
    row_num4 = Label_4_size
    df = original_df[original_df['class_name'] == label]
    df = df.head(row_num4)

  if flag == True:
    Train_df = df
    flag = False
  else:
    Train_df = pd.concat([Train_df, df])

Train_df.to_csv(f'{base_path}/Train_df.csv', encoding='utf-8')

Dataframe:

slice_file_name	fold	classID	class_name
1-1000020520400.wav	1	1	pigs
1-100004024000001.wav	1	1	pigs
1-10000406050001.wav	1	1	pigs
1-1000050120400.wav	1	1	pigs
1-1000050320400.wav	1	1	pigs
1-1000050520400.wav	1	2	goats
1-10000601400001000.wav	1	2	goats
1-1000060340000.wav	1	2	goats
1-100006070500.wav	1	3	chickens
1-100007020800.wav	1	3	chickens
1-100007024000001.wav	1	3	chickens
1-1000070320400.wav	1	3	chickens
1-100007050800.wav	1	3	chickens
1-100007064000001.wav	1	3	chickens
1-100010620400.wav	1	3	chickens
1-100040620400.wav	1	3	chickens
1-10006020500.wav	1	3	chickens
1-10006030500.wav	1	3	chickens
1-100060520400.wav	1	4	hens
1-10007020500.wav	1	4	hens
2-100070420400.wav	1	4	hens
2-100070540000.wav	1	4	hens
2-1313131313004.wav	1	4	hens
2-1313131313043.wav	1	4	hens
2-1313131313044.wav	1	5	sheep
2-150002020500.wav	1	5	sheep
2-150002060800.wav	1	5	sheep
2-150004022040001.wav	1	5	sheep
2-15000406050001.wav	1	5	sheep
2-150006014000001.wav	1	5	sheep
2-150006024000001.wav	1	5	sheep

As a starting, can we do a forloop and start with each class = 1 then increment each class, checking the sum after each iteration until the total classes = Train_size?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

俏︾媚 2025-02-18 11:44:24

我认为这应该有所帮助。对数据帧进行采样两次：一旦使用groupby为每个名称采样一行，然后从其他数据框架中随机采样以完成80％的培训集。

TrainPercent = 0.8
# sample one row for each class_name (5 rows)
one_each = df.groupby('class_name').sample(n=1)
# from the rest of the rows, sample int(0.8*len(df))-len(one_each) number of rows (19 rows)
rest = df.loc[~df.index.isin(one_each.index)].sample(n=int(TrainPercent*len(df))-len(one_each))
# concatenate the two
res = pd.concat([one_each, rest])

I think this should help. Sample the dataframe twice: Once using groupby to sample one row for each name, then sample from the rest of the dataframe randomly to complete the 80% training set.

TrainPercent = 0.8
# sample one row for each class_name (5 rows)
one_each = df.groupby('class_name').sample(n=1)
# from the rest of the rows, sample int(0.8*len(df))-len(one_each) number of rows (19 rows)
rest = df.loc[~df.index.isin(one_each.index)].sample(n=int(TrainPercent*len(df))-len(one_each))
# concatenate the two
res = pd.concat([one_each, rest])

回复收藏 0 原文

~没有更多了~