从包含每个标签的dataframe中提取行
在示例dataframe
下面显示的示例中,我有5个标签(class_name
)。共有31个文件(31行)可以使用...
我试图提取行的80%(可以是可变)(= 24行(Integer))。但是,我想确保我从每个class_name
中提取至少1行,
以尝试我,只能手动执行此操作。这种方法变得乏味,因为class_name
比10要多。您能帮助我提取正确的%行以包含每个标签中至少1个条目(class_name
)。
Here is my attempt:
import math
import pandas as pd
base_path = 'G:/PandasFileSeperation'
original_df = pd.read_csv(f'{base_path}/Book2.csv')
original_df = original_df.astype(str)
length = original_df.class_name.count()
length
# Get number of groups
dfg = original_df.groupby('class_name')
numgroups = dfg.ngroups
numgroups
# Get the sizes of each group
group_size = original_df.groupby('class_name').size()
# Get length of original dataframe
Total_dataset_size = len(original_df)
# Get number of Training samples
TrainPercent = 0.80
Train_size = int(Total_dataset_size * TrainPercent)
Train_size
# How can I change this to automatically change the label size length to give at least 1 row from each class?
Label_0_size = 5
Label_1_size = 3
Label_2_size = 7
Label_3_size = 7
Label_4_size = 2
# Training Dataset
label_percent = { 'pigs' : Label_0_size, 'goats' : Label_1_size, 'chickens' : Label_2_size, 'hens' : Label_3_size, 'sheep' : Label_4_size}
flag = True
for label, num_rows in label_percent.items():
row_num = num_rows
if label == 'pigs':
row_num0 = Label_0_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num0)
elif label == 'goats':
row_num2 = Label_1_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num1)
elif label == 'chickens':
row_num2 = Label_2_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num2)
elif label == 'hens':
row_num3 = Label_3_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num3)
else:
row_num4 = Label_4_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num4)
if flag == True:
Train_df = df
flag = False
else:
Train_df = pd.concat([Train_df, df])
Train_df.to_csv(f'{base_path}/Train_df.csv', encoding='utf-8')
Dataframe:
slice_file_name | fsID | start | end | salience | fold | classID | class_name | original_class |
---|---|---|---|---|---|---|---|---|
1-1000020520400.wav | 1 | 1 | pigs | |||||
1-100004024000001.wav | 1 | 1 | pigs | |||||
1-10000406050001.wav | 1 | 1 | pigs | |||||
1-1000050120400.wav | 1 | 1 | pigs | |||||
1- 1000050320400.wav | 1 | 1 | pigs | |||||
1-1000050520400.wav | 1 | 2 | goats | |||||
1-10000601400001000.wav | 1 | 2 | goats | |||||
1-1000060340000.wav | 1 | 2 | goats | |||||
1-100006070500.wav | 1 | 3 | chickens | |||||
1-100007020800.wav | 1 | 3 | chickens | |||||
1-100007024000001. Wav | 1 | 3 | 鸡 | |||||
1-1000070320400.WAV | 1 | 3 | 鸡 | |||||
1-100007050800.WAV | 1 | 3 | 鸡 | |||||
1-100007064000001.WAV | 1 | 3 | 鸡 | |||||
1-100040620400.WAV | 1 | 3 | Chickens | |||||
1-100060606020500.WAV | 1 | 3 | 鸡 | |||||
1 | 1-100010620400.WAV | 3 | chickens | |||||
1-10006030500.wav | 1 | 3 | chickens | |||||
1-100060520400.wav | 1 | 4 | hens | |||||
1-10007020500.wav | 1 | 4 | hens | |||||
2-100070420400.wav | 1 | 4 | hens | |||||
2-100070540000.wav | 1 | 4 | hens | |||||
2-1313131313004.wav | 1 | 4 | hens | |||||
2-1313131313043.WAV | 1 | 4 | 雌鸟 | |||||
2-1313131313044.WAV | 1 | 5 | 羊 | |||||
2-150002020500.WAV | 1 | 5 | SHEEP | |||||
2-150002060800.WAV | 1 | 5 | SHEEP | |||||
2-150004022040001.wav | | | | |||||
| | | | |||||
150006014000001.WAV | 1 | 5 | SHEEP | |||||
2-150006024000001.WAV | 1 | 5 | 绵羊 |
作为开始,我们可以做一个forloop
,然后以每个班级= 1 = 1 = 1,然后递增每个班级,检查每个迭代后的总和,直到每次迭代直至总类= train_size
?
In the example dataframe
shown below, I have 5 labels (class_name
). There are 31 files (31 rows) in all to work with...
I am attempting to extract 80% (can be variable) of the rows (= 24 rows (integer)). However, I wanted to ensure that I extract at least 1 row from each class_name
For my attempt, I can only do this manually. This approach gets tedious as the class_name
is much more that 10. Can you assist me in extracting the correct % rows to contain at least 1 entry from each label (class_name
).
Here is my attempt:
import math
import pandas as pd
base_path = 'G:/PandasFileSeperation'
original_df = pd.read_csv(f'{base_path}/Book2.csv')
original_df = original_df.astype(str)
length = original_df.class_name.count()
length
# Get number of groups
dfg = original_df.groupby('class_name')
numgroups = dfg.ngroups
numgroups
# Get the sizes of each group
group_size = original_df.groupby('class_name').size()
# Get length of original dataframe
Total_dataset_size = len(original_df)
# Get number of Training samples
TrainPercent = 0.80
Train_size = int(Total_dataset_size * TrainPercent)
Train_size
# How can I change this to automatically change the label size length to give at least 1 row from each class?
Label_0_size = 5
Label_1_size = 3
Label_2_size = 7
Label_3_size = 7
Label_4_size = 2
# Training Dataset
label_percent = { 'pigs' : Label_0_size, 'goats' : Label_1_size, 'chickens' : Label_2_size, 'hens' : Label_3_size, 'sheep' : Label_4_size}
flag = True
for label, num_rows in label_percent.items():
row_num = num_rows
if label == 'pigs':
row_num0 = Label_0_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num0)
elif label == 'goats':
row_num2 = Label_1_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num1)
elif label == 'chickens':
row_num2 = Label_2_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num2)
elif label == 'hens':
row_num3 = Label_3_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num3)
else:
row_num4 = Label_4_size
df = original_df[original_df['class_name'] == label]
df = df.head(row_num4)
if flag == True:
Train_df = df
flag = False
else:
Train_df = pd.concat([Train_df, df])
Train_df.to_csv(f'{base_path}/Train_df.csv', encoding='utf-8')
Dataframe:
slice_file_name | fsID | start | end | salience | fold | classID | class_name | original_class |
---|---|---|---|---|---|---|---|---|
1-1000020520400.wav | 1 | 1 | pigs | |||||
1-100004024000001.wav | 1 | 1 | pigs | |||||
1-10000406050001.wav | 1 | 1 | pigs | |||||
1-1000050120400.wav | 1 | 1 | pigs | |||||
1-1000050320400.wav | 1 | 1 | pigs | |||||
1-1000050520400.wav | 1 | 2 | goats | |||||
1-10000601400001000.wav | 1 | 2 | goats | |||||
1-1000060340000.wav | 1 | 2 | goats | |||||
1-100006070500.wav | 1 | 3 | chickens | |||||
1-100007020800.wav | 1 | 3 | chickens | |||||
1-100007024000001.wav | 1 | 3 | chickens | |||||
1-1000070320400.wav | 1 | 3 | chickens | |||||
1-100007050800.wav | 1 | 3 | chickens | |||||
1-100007064000001.wav | 1 | 3 | chickens | |||||
1-100010620400.wav | 1 | 3 | chickens | |||||
1-100040620400.wav | 1 | 3 | chickens | |||||
1-10006020500.wav | 1 | 3 | chickens | |||||
1-10006030500.wav | 1 | 3 | chickens | |||||
1-100060520400.wav | 1 | 4 | hens | |||||
1-10007020500.wav | 1 | 4 | hens | |||||
2-100070420400.wav | 1 | 4 | hens | |||||
2-100070540000.wav | 1 | 4 | hens | |||||
2-1313131313004.wav | 1 | 4 | hens | |||||
2-1313131313043.wav | 1 | 4 | hens | |||||
2-1313131313044.wav | 1 | 5 | sheep | |||||
2-150002020500.wav | 1 | 5 | sheep | |||||
2-150002060800.wav | 1 | 5 | sheep | |||||
2-150004022040001.wav | 1 | 5 | sheep | |||||
2-15000406050001.wav | 1 | 5 | sheep | |||||
2-150006014000001.wav | 1 | 5 | sheep | |||||
2-150006024000001.wav | 1 | 5 | sheep |
As a starting, can we do a forloop
and start with each class = 1 then increment each class, checking the sum after each iteration until the total classes = Train_size
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为这应该有所帮助。对数据帧进行采样两次:一旦使用
groupby
为每个名称采样一行,然后从其他数据框架中随机采样以完成80%的培训集。I think this should help. Sample the dataframe twice: Once using
groupby
to sample one row for each name, then sample from the rest of the dataframe randomly to complete the 80% training set.