使用SMOTE技术处理不平衡对象数据集

发布于 2025-01-19 16:00:51 字数 2373 浏览 1 评论 0原文

在这里,我有SDF文件,这是由3个功能组成的培训数据,最后一个功能是我的输出。

我使用此功能读取数据集。

def read_sdf(file):
   with open(file, 'r') as rf:
       content = rf.read()
   samples = content.split('$$$$')

   def parse_sample(s):
       lines = s.splitlines()
       links = []
       nodes = []
       label = 0
       for l in lines:
           if l.strip() == '1.0':
               label = 1
           if l.strip() == '0.0':
               label = 0
           if l.startswith('    '):
               feature = l.split()
               node = feature[3]
               nodes.append(node)
           elif l.startswith(' '):
               lnk = l.split()
               # edge: (from, to,) (1-based index)
               if int(lnk[0]) - 1 < len(nodes):
                   links.append((
                       int(lnk[0])-1, 
                       int(lnk[1])-1, # zero-based index
                       # int(lnk[2]) ignore edge weight
                   ))
       return nodes, np.array(links), label

   return [parse_sample(s) for s in tqdm(samples) if len(s[0]) > 0]

training_set = np.array(read_sdf('../input/gcn-data/train.sdf'),dtype=object)

#print the first sample from the dataset
print(training_set[0])

输出是

[list(['S', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'])
 array([[ 0,  8],
        [ 0, 14],
        [ 1, 10],
        [ 2, 11],
        [ 3,  7],
        [ 4,  7],
        [ 5,  9],
        [ 5, 14],
        [ 6, 14],
        [ 6, 17],
        [ 7, 22],
        [ 8,  9],
        [ 8, 10],
        [ 9, 11],
        [10, 12],
        [11, 13],
        [12, 13],
        [12, 15],
        [13, 16],
        [15, 18],
        [16, 19],
        [17, 20],
        [17, 21],
        [18, 19],
        [20, 23],
        [21, 24],
        [22, 23],
        [22, 24]]), 0]

我的问题是这个不平衡的数据集。 它有23806个样本的0和1218个样本。

因此,我尝试使用 smote 技术来解决此问题,

oversample = SMOTE()
training_set[:,0:-1],training_set[:,-1] = oversample.fit_resample(training_set[:,0:-1],training_set[:,-1])

但后来我遇到了这个错误,我认为这是因为这里的2个输入功能是对象类型。

ValueError: Unknown label type: 'unknown'

因此,这里的任何解决方案都可以过采样此数据集。

编辑1:不要通过阅读和理解 read_sdf 函数来打扰自己,它无能为力,而是阅读SDF文件,并且没有任何问题。

Here I have sdf file which is my training data consisting of 3 features and the last feature is my output.

I read my dataset using this function.

def read_sdf(file):
   with open(file, 'r') as rf:
       content = rf.read()
   samples = content.split('$$')

   def parse_sample(s):
       lines = s.splitlines()
       links = []
       nodes = []
       label = 0
       for l in lines:
           if l.strip() == '1.0':
               label = 1
           if l.strip() == '0.0':
               label = 0
           if l.startswith('    '):
               feature = l.split()
               node = feature[3]
               nodes.append(node)
           elif l.startswith(' '):
               lnk = l.split()
               # edge: (from, to,) (1-based index)
               if int(lnk[0]) - 1 < len(nodes):
                   links.append((
                       int(lnk[0])-1, 
                       int(lnk[1])-1, # zero-based index
                       # int(lnk[2]) ignore edge weight
                   ))
       return nodes, np.array(links), label

   return [parse_sample(s) for s in tqdm(samples) if len(s[0]) > 0]

training_set = np.array(read_sdf('../input/gcn-data/train.sdf'),dtype=object)

#print the first sample from the dataset
print(training_set[0])

And the output was

[list(['S', 'O', 'O', 'O', 'O', 'N', 'N', 'N', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'])
 array([[ 0,  8],
        [ 0, 14],
        [ 1, 10],
        [ 2, 11],
        [ 3,  7],
        [ 4,  7],
        [ 5,  9],
        [ 5, 14],
        [ 6, 14],
        [ 6, 17],
        [ 7, 22],
        [ 8,  9],
        [ 8, 10],
        [ 9, 11],
        [10, 12],
        [11, 13],
        [12, 13],
        [12, 15],
        [13, 16],
        [15, 18],
        [16, 19],
        [17, 20],
        [17, 21],
        [18, 19],
        [20, 23],
        [21, 24],
        [22, 23],
        [22, 24]]), 0]

My problem is that this imbalanced dataset.
it has 23806 samples of 0 and 1218 samples of 1.

So I tried to solve this problem using SMOTE technique

oversample = SMOTE()
training_set[:,0:-1],training_set[:,-1] = oversample.fit_resample(training_set[:,0:-1],training_set[:,-1])

But then I got this error and I think it's because the 2 input features here are an object type.

ValueError: Unknown label type: 'unknown'

So any solutions here to oversampling this dataset.

Edit 1: Don't bother yourself by reading and understanding the read_sdf function it doesn't do anything but read the sdf file and there isn't any problem with it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文