通过将一个数据帧分为两个,创建的两个数据框之间的连接
很抱歉,如果我的标题令人困惑,但我不确定如何描述我目前要理解的情况。但是基本上,当我使用 train_test_split 从Sklearn模块的过程中,我偶然发现了这个问题。
因此,让我们继续前进,我向您展示了一个示例,说明了几个小时来混淆我的原因。
让我们创建一个带有3列的简单数据框:
- “字母” - 字母的字母;
- “数字” - 字母的序列号;
- “类型” - 数字类型。
import pandas as pd
data = [['A', 1, 'Odd'], ['B', 2, 'Even'], ['C', 3, 'Odd'],
['D', 4, 'Even'], ['E', 5, 'Odd'], ['F', 6, 'Even'], ['G', 7, 'Odd']]
df = pd.DataFrame(data, columns=['Letter', 'Number', 'Type'])
我们可以创建4个示例来使用train_test_split:
from sklearn.model_selection import train_test_split
target = df['Type']
features = df.drop('Type', axis=1)
features_train, features_valid, target_train, target_valid = train_test_split(features,
target, test_size=0.4, random_state=12)
现在,如果我们想看到 features_train_train 的行,我们可以编写以下代码:
features_odds = features_train[target_train == 'Odd']
features_odds
我们得到了此信息: output
,当新的dataframe完全包含与奇数数字的行时,我们将其正确。 当 features_train 可以从 target_train 获得信息时,该如何工作,即使这些是两个分开的数据范围?
我认为应该有一个简单的答案,但是由于某种原因,我现在无法理解这一点的机制。
我还尝试了一种不同的方法(不使用 train_test_split ),但它同样效果很好:
target_dummy = df['Type']
features_dummy = df.drop('Type', axis=1)
features_dumb_odds = features_dummy[target_dummy == 'Odd']
features_dumb_odds
很欣赏并有助于理解它!
I'm sorry if my title is confusing but I wasn't sure how to describe the situation that I'm currently trying to understand. But basically I stumbled upon this question when I was working with train_test_split procedure from sklearn module.
So, let's go ahead and I show you an example of what has been confusing me for couple of hours already.
Let's create a simple dataframe with 3 columns:
- 'Letter' - a letter from alphabet;
- 'Number' - serial number of the letter;
- 'Type' - type of the number.
import pandas as pd
data = [['A', 1, 'Odd'], ['B', 2, 'Even'], ['C', 3, 'Odd'],
['D', 4, 'Even'], ['E', 5, 'Odd'], ['F', 6, 'Even'], ['G', 7, 'Odd']]
df = pd.DataFrame(data, columns=['Letter', 'Number', 'Type'])
We can create 4 samples to work with using train_test_split:
from sklearn.model_selection import train_test_split
target = df['Type']
features = df.drop('Type', axis=1)
features_train, features_valid, target_train, target_valid = train_test_split(features,
target, test_size=0.4, random_state=12)
And now if we want to see the rows of features_train with the odd numbers we can write the following code:
features_odds = features_train[target_train == 'Odd']
features_odds
And we get this:
Output
And there we have it right as new dataframe contains the rows exactly with the odd numbers.
How does that work when features_train can get the info from target_train even though those are two separated dataframes?
I think there should be an easy answer but for some reason I'm not able to understand the mechanics of this right now.
I have also tried a different approach (not using train_test_split) but it works just as fine:
target_dummy = df['Type']
features_dummy = df.drop('Type', axis=1)
features_dumb_odds = features_dummy[target_dummy == 'Odd']
features_dumb_odds
Would appreciate and help in understanding it a lot!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
target_train =='奇数'
是布尔值的系列
。作为一个系列,它也有一个索引。该索引用于与您索引中的features_train
对齐,并且是兼容的。作为探索的第一步,请从print(target_train =='Odd')开始,
考虑一下零件如何结合在一起是很好的。在这种情况下,布尔式系列以及您需要在哪里索引,以使其完全相同的索引,以免引起异常。
target_train == 'Odd'
is aSeries
of boolean values. As a Series, it also has an index. That index is used to align withfeatures_train
that you index into, and it's compatible.As a first step of exploration, start with
print(target_train == 'Odd')
It's good to think about how the pieces fit together. In this case, the boolean series and where you index into need to have exactly the same index for it to not raise an exception.