ValueError: 输入包含 NaN、无穷大或在 make_column_transformer 和 make_pipeline 之后对于 dtype('float64') 来说太大的值
第一篇文章在这里,请轻松格式化。
我的 csv 中的一些数据包含“?”而不是 None
或空白:
54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South,>50K
df = pd.read_csv("adult.data", names = ["age", "workclass", "fnlwgt",
"education", "education-num", "marital", "occupation", "relationship", "race",
"sex", "capital-gain", "capital-loss", "hours/week", "native-ctry", "50k-income"])
df = pd.DataFrame(df)
test = pd.read_csv("adult.test", names = df.columns)
test = pd.DataFrame(test)
X_train = df.drop("50k-income", axis = 1)
X_train = X_train.replace("?",np.nan)
X_train = X_train.fillna("na")
#replace all ? values in object dtype cols with empty space
X_train["workclass"] = X_train["workclass"].str.replace("?", "na")
X_train["occupation"] = X_train["occupation"].str.replace("?", "na")
X_train["education"] = X_train["education"].str.replace("?", "na")
X_train["marital"] = X_train["marital"].str.replace("?", "na")
X_train["relationship"] = X_train["relationship"].str.replace("?", "na")
X_train["race"] = X_train["race"].str.replace("?", "na")
X_train["sex"] = X_train["sex"].str.replace("?", "na")
X_train["native-ctry"] = X_train["native-ctry"].str.replace("?", "na")
print(X_train.dtypes)
Y_train = df[["50k-income"]]
Y_train = pd.DataFrame(Y_train["50k-income"].str.replace("?", "na"))
X_test = test.drop("50k-income", axis = 1)
X_test = X_test.replace("?",np.nan)
X_test = X_test.fillna("na")
#replace all ? values in object dtype cols with empty space
X_test["workclass"] = X_test["workclass"].str.replace("?", "")
X_test["occupation"] = X_test["occupation"].str.replace("?", "")
X_test["education"] = X_test["education"].str.replace("?", "")
X_test["marital"] = X_test["marital"].str.replace("?", "")
X_test["relationship"] = X_test["relationship"].str.replace("?", "")
X_test["race"] = X_test["race"].str.replace("?", "")
X_test["sex"] = X_test["sex"].str.replace("?", "")
X_test["native-ctry"] = X_test["native-ctry"].str.replace("?", "")
Y_test = test[["50k-income"]]
Y_test = pd.DataFrame(Y_test["50k-income"].str.replace("50K.", "50K")) #remove . proceding 50K in test file
Y_test = Y_test["50k-income"].str.replace("?","")
Y_test = pd.DataFrame(Y_test)
features_to_encode = X_train.columns[X_train.dtypes==object].tolist()
print(features_to_encode)
income_map = {"<=50K":0, ">50K":1}
Y_train["50k-income"] = Y_train["50k-income"].map(income_map)
Y_test["50k-income"] = Y_test["50k-income"].map(income_map)
col_trans = make_column_transformer((OneHotEncoder(handle_unknown="ignore"), features_to_encode), remainder="passthrough")
rf_classifier = RandomForestClassifier(min_samples_leaf=50, oob_score=True, bootstrap=True, n_jobs=-1 ,random_state=50) #bootstrapping reduces variance, njobs = -1 uses all processor cores
clf = make_pipeline(col_trans, rf_classifier)
clf.fit(X_train, Y_train)
由于我的 DataFrame 中存在混合数据类型,因此我使用了 此答案适用于对象,常规 .replace()
适用于其他列。 ? 成功替换为空白。
我最终得到了
File "/Users/vijay/Documents/CSCE 587/HW/Homework2/hw2.py", line 102, in <module>
salary_random_forest()
File "/Users/vijay/Documents/CSCE 587/HW/Homework2/hw2.py", line 95, in salary_random_forest
clf.fit(X_train, Y_train)
File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "/usr/local/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 327, in fit
X, y = self._validate_data(
File "/usr/local/lib/python3.9/site-packages/sklearn/base.py", line 581, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 979, in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 989, in _check_y
y = check_array(
File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 800, in check_array
_assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The dtypes for X_train
are
age int64
workclass object
fnlwgt int64
education object
education-num int64
marital object
occupation object
relationship object
race object
sex object
capital-gain int64
capital-loss int64
hours/week int64
native-ctry object
并运行 X_train.isnull().sum()
给出了
age 0
workclass 0
fnlwgt 0
education 0
education-num 0
marital 0
occupation 0
relationship 0
race 0
sex 0
capital-gain 0
capital-loss 0
hours/week 0
native-ctry 0
我几天来一直在尝试解决这个问题,我无处可去。 我已将本指南用于 RF 分类器。
First post on here, go easy on me with formatting.
Some of my data in a csv contains "?" instead of None
or empty space:
54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K
df = pd.read_csv("adult.data", names = ["age", "workclass", "fnlwgt",
"education", "education-num", "marital", "occupation", "relationship", "race",
"sex", "capital-gain", "capital-loss", "hours/week", "native-ctry", "50k-income"])
df = pd.DataFrame(df)
test = pd.read_csv("adult.test", names = df.columns)
test = pd.DataFrame(test)
X_train = df.drop("50k-income", axis = 1)
X_train = X_train.replace("?",np.nan)
X_train = X_train.fillna("na")
#replace all ? values in object dtype cols with empty space
X_train["workclass"] = X_train["workclass"].str.replace("?", "na")
X_train["occupation"] = X_train["occupation"].str.replace("?", "na")
X_train["education"] = X_train["education"].str.replace("?", "na")
X_train["marital"] = X_train["marital"].str.replace("?", "na")
X_train["relationship"] = X_train["relationship"].str.replace("?", "na")
X_train["race"] = X_train["race"].str.replace("?", "na")
X_train["sex"] = X_train["sex"].str.replace("?", "na")
X_train["native-ctry"] = X_train["native-ctry"].str.replace("?", "na")
print(X_train.dtypes)
Y_train = df[["50k-income"]]
Y_train = pd.DataFrame(Y_train["50k-income"].str.replace("?", "na"))
X_test = test.drop("50k-income", axis = 1)
X_test = X_test.replace("?",np.nan)
X_test = X_test.fillna("na")
#replace all ? values in object dtype cols with empty space
X_test["workclass"] = X_test["workclass"].str.replace("?", "")
X_test["occupation"] = X_test["occupation"].str.replace("?", "")
X_test["education"] = X_test["education"].str.replace("?", "")
X_test["marital"] = X_test["marital"].str.replace("?", "")
X_test["relationship"] = X_test["relationship"].str.replace("?", "")
X_test["race"] = X_test["race"].str.replace("?", "")
X_test["sex"] = X_test["sex"].str.replace("?", "")
X_test["native-ctry"] = X_test["native-ctry"].str.replace("?", "")
Y_test = test[["50k-income"]]
Y_test = pd.DataFrame(Y_test["50k-income"].str.replace("50K.", "50K")) #remove . proceding 50K in test file
Y_test = Y_test["50k-income"].str.replace("?","")
Y_test = pd.DataFrame(Y_test)
features_to_encode = X_train.columns[X_train.dtypes==object].tolist()
print(features_to_encode)
income_map = {"<=50K":0, ">50K":1}
Y_train["50k-income"] = Y_train["50k-income"].map(income_map)
Y_test["50k-income"] = Y_test["50k-income"].map(income_map)
col_trans = make_column_transformer((OneHotEncoder(handle_unknown="ignore"), features_to_encode), remainder="passthrough")
rf_classifier = RandomForestClassifier(min_samples_leaf=50, oob_score=True, bootstrap=True, n_jobs=-1 ,random_state=50) #bootstrapping reduces variance, njobs = -1 uses all processor cores
clf = make_pipeline(col_trans, rf_classifier)
clf.fit(X_train, Y_train)
Since there are mixed dtypes in my DataFrame, I used this answer for objects and regular .replace()
for the other columns. The ?s get replaced with empty space successfully.
I end up getting
File "/Users/vijay/Documents/CSCE 587/HW/Homework2/hw2.py", line 102, in <module>
salary_random_forest()
File "/Users/vijay/Documents/CSCE 587/HW/Homework2/hw2.py", line 95, in salary_random_forest
clf.fit(X_train, Y_train)
File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
self._final_estimator.fit(Xt, y, **fit_params_last_step)
File "/usr/local/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 327, in fit
X, y = self._validate_data(
File "/usr/local/lib/python3.9/site-packages/sklearn/base.py", line 581, in _validate_data
X, y = check_X_y(X, y, **check_params)
File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 979, in check_X_y
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 989, in _check_y
y = check_array(
File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 800, in check_array
_assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
The dtypes for X_train
are
age int64
workclass object
fnlwgt int64
education object
education-num int64
marital object
occupation object
relationship object
race object
sex object
capital-gain int64
capital-loss int64
hours/week int64
native-ctry object
and running X_train.isnull().sum()
gives
age 0
workclass 0
fnlwgt 0
education 0
education-num 0
marital 0
occupation 0
relationship 0
race 0
sex 0
capital-gain 0
capital-loss 0
hours/week 0
native-ctry 0
I've been trying to figure this out for days and I'm getting nowhere. I've used this guide for RF classifier.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
事实证明,用于测试和训练的 Y 映射给出了 NaN 值。我将地图替换为
Turns out the mapping of Y for testing and training was giving
NaN
values. I replaced the map with