如何基于数据集创建合成数据，并具有混合数据类型以解决分类问题？

发布于 01-22 17:36 字数 1407 浏览 1 评论 0 原文

我正在尝试建立一个分类模型，但我没有足够的数据。如果我具有数值和分类功能，可以根据现有数据集创建合成数据的最合适方法是什么？我看着在这里使用葡萄藤：但是，采样此类copulas即使是我想成为整数（标签编码的分类特征）的列，即使是列的浮雕）。然后，我不知道如何将这些浮子转换回一个分类功能。我问题的示例玩具代码低于

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y

# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df =  X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)

# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']

label_encoders = {}
for c in cat_cols:
    cat_proc = preprocessing.LabelEncoder()
    col_proc = cat_proc.fit_transform(df[c])
    df[c] = col_proc
    label_encoders[c] = cat_proc

# Fit a copula
copula = VineCopula('regular')
copula.fit(df)

# Sample synthetic data
df_synthetic = copula.sample(1000)

df_synthetic 的所有列。如何将这些转换回这些可以映射到分类功能的INT？有其他方法可以增加这种数据集吗？如果表现效果，我可以采样7000-10000新的合成条目。上面5列的玩具问题花了约1分钟来采样1000行，但是我的真正问题有27列，我认为这需要更长的时间。

原文

I am trying to build a classification model, but I don't have enough data. What would be the most appropriate way to create synthetic data based on my existing dataset if I have numerical and categorical features?
I looked at using Vine copulas like here: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Vine-Copulas but sampling such copulas gives floats even for the columns that I would like to be integers (label-encoded categorical features). And then I dont know how to convert such floats back to a categorical features.
Sample toy code of my problem is below

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y

# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df =  X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)

# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']

label_encoders = {}
for c in cat_cols:
    cat_proc = preprocessing.LabelEncoder()
    col_proc = cat_proc.fit_transform(df[c])
    df[c] = col_proc
    label_encoders[c] = cat_proc

# Fit a copula
copula = VineCopula('regular')
copula.fit(df)

# Sample synthetic data
df_synthetic = copula.sample(1000)

All the columns of df_synthetic are floats. How would I convert those back to ints that I can map back to categorical features?
Is there another way to augment this sort of dataset? Would be even better, if it's performant and I can sample 7000-10000 new synthetic entries. The toy problem with 5 columns above took ~1mins to sample 1000 rows, but my real problem has 27 columns, which I imagine would take a lot longer.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

天冷不及心凉 2025-01-29 17:36:55

要使您的列转换为 int s，请使用round .astype（int）：

df_synthetic["sex"] = round(df_synthetic["sex"]).astype(int)
df_synthetic["embarked"] = round(df_synthetic["embarked"]).astype(int)
df_synthetic["label"] = round(df_synthetic["label"]).astype(int)

您可能必须手动调整值（例如，cap sex 在[0,1]中，如果生成了一些较大/较小的值），但这将在很大程度上取决于您的数据特征。

To have your columns converted to ints, use round and then .astype(int):

df_synthetic["sex"] = round(df_synthetic["sex"]).astype(int)
df_synthetic["embarked"] = round(df_synthetic["embarked"]).astype(int)
df_synthetic["label"] = round(df_synthetic["label"]).astype(int)

You might have to adjust values manually (ex. cap sex in [0,1] if some larger/smaller value has been generated), but that will strongly depend on your data characteristics.

回复收藏 0 原文

~没有更多了~