熊猫根据目标变量将熊猫分层分为火车，测试和验证集

发布于 2025-01-20 19:32:38 字数 1355 浏览 3 评论 0原文

我有一个具有某些功能的数据框，一个属于{0,1}的目标列。我需要将此数据集分为培训，测试和验证集。验证部分必须是数据集的20％，其余的80％必须分开，以便80％进入培训集。这可以通过Sklearn的train_test_split轻松实现，

我的问题是，必须根据clusters 进行分层的方式进行分裂。两个目标值。

为了计算簇，我将两个目标的条目分为两个子集，例如

ones = df[df_numerical['Target'] == 1].copy()
zeroes = df[df_numerical['Target'] == 1].copy()

，每个子集，我使用Kmeans来计算其群集，并将簇添加到数据帧中，例如：

# the number of clusters for both variables is not the same
clusters_1 = kmeans_1.predict(ones[NUMERICAL_FEATURES])
ones['Cluster'] = clusters_1

clusters_0 = kmeans_0.predict(zeroes[NUMERICAL_FEATURES])
zeroes['Cluster'] = clusters_0

现在如何将数据集分开，以使它们按它们进行分层，以使它们通过它们进行分层。群集大小？

我需要以这种方式进行分裂：假设有100个记录，第1级和0级的80个记录，我需要以70 /30％的速度分配此记录，因此我需要有56个（70％ 80）第1类和14类（70％）的记录0。我知道可以使用strate> strate> strate> train> triar_test_split的参数完成，但是我的问题是除此之外，必须对分裂进行分层，还必须将每个目标值的群集与群集进行分层。

我认为的一种解决方案是为两个类提取元素的索引，将它们列入列表，从它们中提取正确数量的元素，然后重组数据框架：

cluster_indices_0 = zeroes.groupby(['Cluster']).apply(lambda x: x.index)
cluster_indices_1 = ones.groupby(['Cluster']).apply(lambda x: x.index)

但是，这样我就必须手动计算，对于每个集群，要弹出的元素数量，我正在寻找一种自动执行此操作的方法。

在Sklearn或Pandas中是否有一个功能可以实现我要寻找的东西，而没有在计算要提取的元素数量中列出的列表？

原文

I have a dataframe with some features and a target column belonging to {0,1}.
I need to split this dataset into training, testing and validation sets. The validation part must be the 20% of the dataset, and the remaining 80% must be split so that the 80% of it goes into the training set. And this can be easily achieved with sklearn's train_test_split

My problem is that the splitting must be done in a stratified way based on the clusters I computed for both target values.

To compute the clusters I separated the entries for both targets into two subsets e.g.

ones = df[df_numerical['Target'] == 1].copy()
zeroes = df[df_numerical['Target'] == 1].copy()

Then for each subset I used kmeans to compute their clusters, and added the clusters to the dataframe, e.g.:

# the number of clusters for both variables is not the same
clusters_1 = kmeans_1.predict(ones[NUMERICAL_FEATURES])
ones['Cluster'] = clusters_1

clusters_0 = kmeans_0.predict(zeroes[NUMERICAL_FEATURES])
zeroes['Cluster'] = clusters_0

Now how can I split the datasets such that they are stratified by cluster size?

The splitting I need must be done in this way: assuming of having 100 records, 80 of class 1 and 20 of class 0, I need to split this records in a 70 / 30 %, so I need to have 56 (70% of 80) records of class 1 and 14 (70% of 20) of class 0. And I know this can be done using the stratify parameter of train_test_split, but my problem is that in addition to this, the splitting must be stratified also w.r.t the clusters of each target value.

One solution I thought would be of extracting the indices of the elements for both classes, putting them into lists, extracting from them the right number of elements and then re-combine the dataframes:

cluster_indices_0 = zeroes.groupby(['Cluster']).apply(lambda x: x.index)
cluster_indices_1 = ones.groupby(['Cluster']).apply(lambda x: x.index)

But in this way I'd have to manually compute, for each cluster the number of elements to pop, and I was looking for a way to do this automatically.

Is there a function in sklearn or pandas to achieve what I'm looking for without getting list in the computation of the number of elements to extract?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜血缘 2025-01-27 19:32:38

由于您的数据已经按目标拆分，因此您只需要在每个子集上调用train_test_split，然后使用群集列进行分层即可。

train_test_0, validation_0 = train_test_split(zeroes, train_size=0.8, stratify=zeroes['Cluster'])
train_0, test_0 = train_test_split(train_test_0, train_size=0.7, stratify=train_test_0['Cluster'])

然后对目标一个进行同样的操作，然后将所有子集组合

Since you have your data already split by target, you simply need to call train_test_split on each subset and use the cluster column for stratification.

train_test_0, validation_0 = train_test_split(zeroes, train_size=0.8, stratify=zeroes['Cluster'])
train_0, test_0 = train_test_split(train_test_0, train_size=0.7, stratify=train_test_0['Cluster'])

then do the same for target one and combine all the subsets

回复收藏 0 原文

~没有更多了~