我正在尝试建立一个分类模型,但我没有足够的数据。如果我具有数值和分类功能,可以根据现有数据集创建合成数据的最合适方法是什么?
我看着在这里使用葡萄藤:但是,采样此类copulas即使是我想成为整数(标签编码的分类特征)的列,即使是列的浮雕)。然后,我不知道如何将这些浮子转换回一个分类功能。
我问题的示例玩具代码低于
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y
# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df = X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)
# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']
label_encoders = {}
for c in cat_cols:
cat_proc = preprocessing.LabelEncoder()
col_proc = cat_proc.fit_transform(df[c])
df[c] = col_proc
label_encoders[c] = cat_proc
# Fit a copula
copula = VineCopula('regular')
copula.fit(df)
# Sample synthetic data
df_synthetic = copula.sample(1000)
df_synthetic 的所有列。如何将这些转换回这些可以映射到分类功能的INT?
有其他方法可以增加这种数据集吗?如果表现效果,我可以采样7000-10000新的合成条目。上面5列的玩具问题花了约1分钟来采样1000行,但是我的真正问题有27列,我认为这需要更长的时间。
I am trying to build a classification model, but I don't have enough data. What would be the most appropriate way to create synthetic data based on my existing dataset if I have numerical and categorical features?
I looked at using Vine copulas like here: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Vine-Copulas but sampling such copulas gives floats even for the columns that I would like to be integers (label-encoded categorical features). And then I dont know how to convert such floats back to a categorical features.
Sample toy code of my problem is below
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y
# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df = X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)
# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']
label_encoders = {}
for c in cat_cols:
cat_proc = preprocessing.LabelEncoder()
col_proc = cat_proc.fit_transform(df[c])
df[c] = col_proc
label_encoders[c] = cat_proc
# Fit a copula
copula = VineCopula('regular')
copula.fit(df)
# Sample synthetic data
df_synthetic = copula.sample(1000)
All the columns of df_synthetic are floats. How would I convert those back to ints that I can map back to categorical features?
Is there another way to augment this sort of dataset? Would be even better, if it's performant and I can sample 7000-10000 new synthetic entries. The toy problem with 5 columns above took ~1mins to sample 1000 rows, but my real problem has 27 columns, which I imagine would take a lot longer.
要使您的列转换为
int
s,请使用round .astype(int):您可能必须手动调整值(例如,cap
sex 在[0,1]中,如果生成了一些较大/较小的值),但这将在很大程度上取决于您的数据特征。
To have your columns converted to
int
s, use round and then.astype(int)
:You might have to adjust values manually (ex. cap
sex
in [0,1] if some larger/smaller value has been generated), but that will strongly depend on your data characteristics.