通过将一个数据帧分为两个，创建的两个数据框之间的连接

发布于 2025-02-13 00:39:55 字数 1717 浏览 1 评论 0原文

很抱歉，如果我的标题令人困惑，但我不确定如何描述我目前要理解的情况。但是基本上，当我使用 train_test_split 从Sklearn模块的过程中，我偶然发现了这个问题。

因此，让我们继续前进，我向您展示了一个示例，说明了几个小时来混淆我的原因。

让我们创建一个带有3列的简单数据框：

“字母” - 字母的字母；
“数字” - 字母的序列号；
“类型” - 数字类型。

    import pandas as pd
    data = [['A', 1, 'Odd'], ['B', 2, 'Even'], ['C', 3, 'Odd'],
    ['D', 4, 'Even'], ['E', 5, 'Odd'], ['F', 6, 'Even'], ['G', 7, 'Odd']]
    df = pd.DataFrame(data, columns=['Letter', 'Number', 'Type'])

我们可以创建4个示例来使用train_test_split：


    from sklearn.model_selection import train_test_split
    target = df['Type']
    features = df.drop('Type', axis=1)
    features_train, features_valid, target_train, target_valid = train_test_split(features,                                                                                  
                                                                 target, test_size=0.4, random_state=12)

现在，如果我们想看到 features_train_train 的行，我们可以编写以下代码：


    features_odds = features_train[target_train == 'Odd']
    features_odds

我们得到了此信息： output

，当新的dataframe完全包含与奇数数字的行时，我们将其正确。当 features_train 可以从 target_train 获得信息时，该如何工作，即使这些是两个分开的数据范围？

我认为应该有一个简单的答案，但是由于某种原因，我现在无法理解这一点的机制。

我还尝试了一种不同的方法（不使用 train_test_split ），但它同样效果很好：

    target_dummy = df['Type']
    features_dummy = df.drop('Type', axis=1)
    
    features_dumb_odds = features_dummy[target_dummy == 'Odd']
    features_dumb_odds

很欣赏并有助于理解它！

原文

I'm sorry if my title is confusing but I wasn't sure how to describe the situation that I'm currently trying to understand. But basically I stumbled upon this question when I was working with train_test_split procedure from sklearn module.

So, let's go ahead and I show you an example of what has been confusing me for couple of hours already.

Let's create a simple dataframe with 3 columns:

'Letter' - a letter from alphabet;
'Number' - serial number of the letter;
'Type' - type of the number.

    import pandas as pd
    data = [['A', 1, 'Odd'], ['B', 2, 'Even'], ['C', 3, 'Odd'],
    ['D', 4, 'Even'], ['E', 5, 'Odd'], ['F', 6, 'Even'], ['G', 7, 'Odd']]
    df = pd.DataFrame(data, columns=['Letter', 'Number', 'Type'])

We can create 4 samples to work with using train_test_split:


    from sklearn.model_selection import train_test_split
    target = df['Type']
    features = df.drop('Type', axis=1)
    features_train, features_valid, target_train, target_valid = train_test_split(features,                                                                                  
                                                                 target, test_size=0.4, random_state=12)

And now if we want to see the rows of features_train with the odd numbers we can write the following code:


    features_odds = features_train[target_train == 'Odd']
    features_odds

And we get this:
Output

And there we have it right as new dataframe contains the rows exactly with the odd numbers.
How does that work when features_train can get the info from target_train even though those are two separated dataframes?

I think there should be an easy answer but for some reason I'm not able to understand the mechanics of this right now.

I have also tried a different approach (not using train_test_split) but it works just as fine:

    target_dummy = df['Type']
    features_dummy = df.drop('Type', axis=1)
    
    features_dumb_odds = features_dummy[target_dummy == 'Odd']
    features_dumb_odds

Would appreciate and help in understanding it a lot!

分享到QQ

分享到微博