使用SMOTE技术处理不平衡对象数据集

发布于 2025-01-19 16:00:51 字数 2373 浏览 1 评论 0原文

在这里，我有SDF文件，这是由3个功能组成的培训数据，最后一个功能是我的输出。

我使用此功能读取数据集。

def read_sdf(file):
   with open(file, 'r') as rf:
       content = rf.read()
   samples = content.split('$$$$')

   def parse_sample(s):
       lines = s.splitlines()
       links = []
       nodes = []
       label = 0
       for l in lines:
           if l.strip() == '1.0':
               label = 1
           if l.strip() == '0.0':
               label = 0
           if l.startswith('    '):
               feature = l.split()
               node = feature[3]
               nodes.append(node)
           elif l.startswith(' '):
               lnk = l.split()
               # edge: (from, to,) (1-based index)
               if int(lnk[0]) - 1 < len(nodes):
                   links.append((
                       int(lnk[0])-1, 
                       int(lnk[1])-1, # zero-based index
                       # int(lnk[2]) ignore edge weight
                   ))
       return nodes, np.array(links), label

   return [parse_sample(s) for s in tqdm(samples) if len(s[0]) > 0]

training_set = np.array(read_sdf('../input/gcn-data/train.sdf'),dtype=object)

#print the first sample from the dataset
print(training_set[0])

输出是

[list(['S', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'])
 array([[ 0,  8],
        [ 0, 14],
        [ 1, 10],
        [ 2, 11],
        [ 3,  7],
        [ 4,  7],
        [ 5,  9],
        [ 5, 14],
        [ 6, 14],
        [ 6, 17],
        [ 7, 22],
        [ 8,  9],
        [ 8, 10],
        [ 9, 11],
        [10, 12],
        [11, 13],
        [12, 13],
        [12, 15],
        [13, 16],
        [15, 18],
        [16, 19],
        [17, 20],
        [17, 21],
        [18, 19],
        [20, 23],
        [21, 24],
        [22, 23],
        [22, 24]]), 0]

我的问题是这个不平衡的数据集。它有23806个样本的0和1218个样本。

因此，我尝试使用 smote 技术来解决此问题，

oversample = SMOTE()
training_set[:,0:-1],training_set[:,-1] = oversample.fit_resample(training_set[:,0:-1],training_set[:,-1])

但后来我遇到了这个错误，我认为这是因为这里的2个输入功能是对象类型。

ValueError: Unknown label type: 'unknown'

因此，这里的任何解决方案都可以过采样此数据集。

编辑1：不要通过阅读和理解 read_sdf 函数来打扰自己，它无能为力，而是阅读SDF文件，并且没有任何问题。

原文

Here I have sdf file which is my training data consisting of 3 features and the last feature is my output.

I read my dataset using this function.

def read_sdf(file):
   with open(file, 'r') as rf:
       content = rf.read()
   samples = content.split('$$')

   def parse_sample(s):
       lines = s.splitlines()
       links = []
       nodes = []
       label = 0
       for l in lines:
           if l.strip() == '1.0':
               label = 1
           if l.strip() == '0.0':
               label = 0
           if l.startswith('    '):
               feature = l.split()
               node = feature[3]
               nodes.append(node)
           elif l.startswith(' '):
               lnk = l.split()
               # edge: (from, to,) (1-based index)
               if int(lnk[0]) - 1 < len(nodes):
                   links.append((
                       int(lnk[0])-1, 
                       int(lnk[1])-1, # zero-based index
                       # int(lnk[2]) ignore edge weight
                   ))
       return nodes, np.array(links), label

   return [parse_sample(s) for s in tqdm(samples) if len(s[0]) > 0]

training_set = np.array(read_sdf('../input/gcn-data/train.sdf'),dtype=object)

#print the first sample from the dataset
print(training_set[0])

And the output was

[list(['S', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'])
 array([[ 0,  8],
        [ 0, 14],
        [ 1, 10],
        [ 2, 11],
        [ 3,  7],
        [ 4,  7],
        [ 5,  9],
        [ 5, 14],
        [ 6, 14],
        [ 6, 17],
        [ 7, 22],
        [ 8,  9],
        [ 8, 10],
        [ 9, 11],
        [10, 12],
        [11, 13],
        [12, 13],
        [12, 15],
        [13, 16],
        [15, 18],
        [16, 19],
        [17, 20],
        [17, 21],
        [18, 19],
        [20, 23],
        [21, 24],
        [22, 23],
        [22, 24]]), 0]

My problem is that this imbalanced dataset.
it has 23806 samples of 0 and 1218 samples of 1.

So I tried to solve this problem using SMOTE technique

oversample = SMOTE()
training_set[:,0:-1],training_set[:,-1] = oversample.fit_resample(training_set[:,0:-1],training_set[:,-1])

But then I got this error and I think it's because the 2 input features here are an object type.

ValueError: Unknown label type: 'unknown'

So any solutions here to oversampling this dataset.

Edit 1: Don't bother yourself by reading and understanding the read_sdf function it doesn't do anything but read the sdf file and there isn't any problem with it.

分享到QQ

分享到微博