如何使用 statsmodel 解决线性回归中的多重共线性?
我不明白为什么会出现多重共线性错误。我提到了这个这个和这个。我的来源是单热编码的,所以我放弃了其中之一。同样,对于 Test_source。有 3 种相关架构 - Bart、Peg 和 Human。所以我也放弃了人类专栏。没有其他相关专栏,所以我不明白我做错了什么。请帮忙!
样本数据:
Sr.N| Num_of_texts| Length_of_Sentence| Cross_Domain| Repetition| Bart| Pegasus| Source_Xsum| Test_src_RCT| Test_src_Reddit| Test_src_SP| Test_src_Xsum|
1 11332.0 56.0 0.0 112.0 1 0 0 0 0 0 1
2 11332.0 16.0 0.0. 10.0 1 0 1 0 0 0 1
3 13368.0 40.0 0.0 78.0 0 1 0 0 0 0 0
4 13368.0 47.0 0.0 3.0 0 0 0 0 0 0 0
5 13368.0 63.0 0.0 8.0 0 0 0 0 0 0 0
6 6440.0 53.0 1.0 204.0 0 1 0 0 0 1 0
7 31091.0 17.0 1.0 24.0 0 1 1 0 1 0 0
8 6440.0 14.0 1.0 743.0 1 0 1 0 0 1 0
9 11332.0 12.0 0.0 2.0 0 0. 1 0 0 0 1
10 25146.0 26.0 1.0 141.0 0 1 0 0 1 0 0
11 13368.0 25.0 0.0 2.0 0 1 1 0 0 0 0
12 31091.0 46.0 1.0 13.0 0 1 1 0 1 0 0
13 11332.0 29.0 0.0 29.0 1 0 1 0 0 0 1
I got the following result after performing OLS:
I don't understand why I am getting the multicollinearity error. I referred to thisthis and this. My Sources were one-hot encoded so I dropped one of them. Similarly, for Test_source. There are 3 related architectures - Bart, Peg and Human. So I've dropped the human column too. There are no other related columns so I don't understand what I am doing wrong. Please help!
Sample data :
Sr.N| Num_of_texts| Length_of_Sentence| Cross_Domain| Repetition| Bart| Pegasus| Source_Xsum| Test_src_RCT| Test_src_Reddit| Test_src_SP| Test_src_Xsum|
1 11332.0 56.0 0.0 112.0 1 0 0 0 0 0 1
2 11332.0 16.0 0.0. 10.0 1 0 1 0 0 0 1
3 13368.0 40.0 0.0 78.0 0 1 0 0 0 0 0
4 13368.0 47.0 0.0 3.0 0 0 0 0 0 0 0
5 13368.0 63.0 0.0 8.0 0 0 0 0 0 0 0
6 6440.0 53.0 1.0 204.0 0 1 0 0 0 1 0
7 31091.0 17.0 1.0 24.0 0 1 1 0 1 0 0
8 6440.0 14.0 1.0 743.0 1 0 1 0 0 1 0
9 11332.0 12.0 0.0 2.0 0 0. 1 0 0 0 1
10 25146.0 26.0 1.0 141.0 0 1 0 0 1 0 0
11 13368.0 25.0 0.0 2.0 0 1 1 0 0 0 0
12 31091.0 46.0 1.0 13.0 0 1 1 0 1 0 0
13 11332.0 29.0 0.0 29.0 1 0 1 0 0 0 1
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论