使用SMOTE技术处理不平衡对象数据集
在这里,我有SDF文件,这是由3个功能组成的培训数据,最后一个功能是我的输出。
我使用此功能读取数据集。
def read_sdf(file):
with open(file, 'r') as rf:
content = rf.read()
samples = content.split('$$$$')
def parse_sample(s):
lines = s.splitlines()
links = []
nodes = []
label = 0
for l in lines:
if l.strip() == '1.0':
label = 1
if l.strip() == '0.0':
label = 0
if l.startswith(' '):
feature = l.split()
node = feature[3]
nodes.append(node)
elif l.startswith(' '):
lnk = l.split()
# edge: (from, to,) (1-based index)
if int(lnk[0]) - 1 < len(nodes):
links.append((
int(lnk[0])-1,
int(lnk[1])-1, # zero-based index
# int(lnk[2]) ignore edge weight
))
return nodes, np.array(links), label
return [parse_sample(s) for s in tqdm(samples) if len(s[0]) > 0]
training_set = np.array(read_sdf('../input/gcn-data/train.sdf'),dtype=object)
#print the first sample from the dataset
print(training_set[0])
输出是
[list(['S', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'])
array([[ 0, 8],
[ 0, 14],
[ 1, 10],
[ 2, 11],
[ 3, 7],
[ 4, 7],
[ 5, 9],
[ 5, 14],
[ 6, 14],
[ 6, 17],
[ 7, 22],
[ 8, 9],
[ 8, 10],
[ 9, 11],
[10, 12],
[11, 13],
[12, 13],
[12, 15],
[13, 16],
[15, 18],
[16, 19],
[17, 20],
[17, 21],
[18, 19],
[20, 23],
[21, 24],
[22, 23],
[22, 24]]), 0]
我的问题是这个不平衡的数据集。 它有23806个样本的0和1218个样本。
因此,我尝试使用 smote 技术来解决此问题,
oversample = SMOTE()
training_set[:,0:-1],training_set[:,-1] = oversample.fit_resample(training_set[:,0:-1],training_set[:,-1])
但后来我遇到了这个错误,我认为这是因为这里的2个输入功能是对象类型。
ValueError: Unknown label type: 'unknown'
因此,这里的任何解决方案都可以过采样此数据集。
编辑1:不要通过阅读和理解 read_sdf 函数来打扰自己,它无能为力,而是阅读SDF文件,并且没有任何问题。
Here I have sdf file which is my training data consisting of 3 features and the last feature is my output.
I read my dataset using this function.
def read_sdf(file):
with open(file, 'r') as rf:
content = rf.read()
samples = content.split('$$')
def parse_sample(s):
lines = s.splitlines()
links = []
nodes = []
label = 0
for l in lines:
if l.strip() == '1.0':
label = 1
if l.strip() == '0.0':
label = 0
if l.startswith(' '):
feature = l.split()
node = feature[3]
nodes.append(node)
elif l.startswith(' '):
lnk = l.split()
# edge: (from, to,) (1-based index)
if int(lnk[0]) - 1 < len(nodes):
links.append((
int(lnk[0])-1,
int(lnk[1])-1, # zero-based index
# int(lnk[2]) ignore edge weight
))
return nodes, np.array(links), label
return [parse_sample(s) for s in tqdm(samples) if len(s[0]) > 0]
training_set = np.array(read_sdf('../input/gcn-data/train.sdf'),dtype=object)
#print the first sample from the dataset
print(training_set[0])
And the output was
[list(['S', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'])
array([[ 0, 8],
[ 0, 14],
[ 1, 10],
[ 2, 11],
[ 3, 7],
[ 4, 7],
[ 5, 9],
[ 5, 14],
[ 6, 14],
[ 6, 17],
[ 7, 22],
[ 8, 9],
[ 8, 10],
[ 9, 11],
[10, 12],
[11, 13],
[12, 13],
[12, 15],
[13, 16],
[15, 18],
[16, 19],
[17, 20],
[17, 21],
[18, 19],
[20, 23],
[21, 24],
[22, 23],
[22, 24]]), 0]
My problem is that this imbalanced dataset.
it has 23806 samples of 0 and 1218 samples of 1.
So I tried to solve this problem using SMOTE technique
oversample = SMOTE()
training_set[:,0:-1],training_set[:,-1] = oversample.fit_resample(training_set[:,0:-1],training_set[:,-1])
But then I got this error and I think it's because the 2 input features here are an object type.
ValueError: Unknown label type: 'unknown'
So any solutions here to oversampling this dataset.
Edit 1: Don't bother yourself by reading and understanding the read_sdf function it doesn't do anything but read the sdf file and there isn't any problem with it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论