如何基于数据集创建合成数据,并具有混合数据类型以解决分类问题?

发布于 01-22 17:36 字数 1407 浏览 1 评论 0 原文

我正在尝试建立一个分类模型,但我没有足够的数据。如果我具有数值和分类功能,可以根据现有数据集创建合成数据的最合适方法是什么? 我看着在这里使用葡萄藤:但是,采样此类copulas即使是我想成为整数(标签编码的分类特征)的列,即使是列的浮雕)。然后,我不知道如何将这些浮子转换回一个分类功能。 我问题的示例玩具代码低于

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y

# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df =  X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)

# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']

label_encoders = {}
for c in cat_cols:
    cat_proc = preprocessing.LabelEncoder()
    col_proc = cat_proc.fit_transform(df[c])
    df[c] = col_proc
    label_encoders[c] = cat_proc

# Fit a copula
copula = VineCopula('regular')
copula.fit(df)

# Sample synthetic data
df_synthetic = copula.sample(1000)

df_synthetic 的所有列。如何将这些转换回这些可以映射到分类功能的INT? 有其他方法可以增加这种数据集吗?如果表现效果,我可以采样7000-10000新的合成条目。上面5列的玩具问题花了约1分钟来采样1000行,但是我的真正问题有27列,我认为这需要更长的时间。

I am trying to build a classification model, but I don't have enough data. What would be the most appropriate way to create synthetic data based on my existing dataset if I have numerical and categorical features?
I looked at using Vine copulas like here: https://sdv.dev/Copulas/tutorials/03_Multivariate_Distributions.html#Vine-Copulas but sampling such copulas gives floats even for the columns that I would like to be integers (label-encoded categorical features). And then I dont know how to convert such floats back to a categorical features.
Sample toy code of my problem is below

import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.datasets import fetch_openml
from copulas.multivariate import VineCopula, GaussianMultivariate

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X['label'] = y

# reducing features and removing nulls to keep things simple
X = X[['sex', 'age', 'fare', 'embarked', 'label']]
row_keep = X.isnull().sum(axis=1) == 0
df =  X.loc[row_keep, :].copy()
df.reset_index(drop=True, inplace=True)

# encoding columns
cat_cols = ['sex', 'embarked', 'label']
num_cols = ['age', 'fare']

label_encoders = {}
for c in cat_cols:
    cat_proc = preprocessing.LabelEncoder()
    col_proc = cat_proc.fit_transform(df[c])
    df[c] = col_proc
    label_encoders[c] = cat_proc

# Fit a copula
copula = VineCopula('regular')
copula.fit(df)

# Sample synthetic data
df_synthetic = copula.sample(1000)

All the columns of df_synthetic are floats. How would I convert those back to ints that I can map back to categorical features?
Is there another way to augment this sort of dataset? Would be even better, if it's performant and I can sample 7000-10000 new synthetic entries. The toy problem with 5 columns above took ~1mins to sample 1000 rows, but my real problem has 27 columns, which I imagine would take a lot longer.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

天冷不及心凉 2025-01-29 17:36:55

要使您的列转换为 int s,请使用round .astype(int):

df_synthetic["sex"] = round(df_synthetic["sex"]).astype(int)
df_synthetic["embarked"] = round(df_synthetic["embarked"]).astype(int)
df_synthetic["label"] = round(df_synthetic["label"]).astype(int)

您可能必须手动调整值(例如,cap sex 在[0,1]中,如果生成了一些较大/较小的值),但这将在很大程度上取决于您的数据特征。

To have your columns converted to ints, use round and then .astype(int):

df_synthetic["sex"] = round(df_synthetic["sex"]).astype(int)
df_synthetic["embarked"] = round(df_synthetic["embarked"]).astype(int)
df_synthetic["label"] = round(df_synthetic["label"]).astype(int)

You might have to adjust values manually (ex. cap sex in [0,1] if some larger/smaller value has been generated), but that will strongly depend on your data characteristics.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文