ValueError: 输入包含 NaN、无穷大或在 make_column_transformer 和 make_pipeline 之后对于 dtype('float64') 来说太大的值

发布于 2025-01-09 01:35:58 字数 5543 浏览 1 评论 0原文

第一篇文章在这里,请轻松格式化。

我的 csv 中的一些数据包含“?”而不是 None 或空白:

54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South,>50K

    df = pd.read_csv("adult.data", names = ["age", "workclass", "fnlwgt", 
    "education", "education-num", "marital", "occupation", "relationship", "race", 
    "sex", "capital-gain", "capital-loss", "hours/week", "native-ctry", "50k-income"])
    df = pd.DataFrame(df)
    test = pd.read_csv("adult.test", names = df.columns)
    test = pd.DataFrame(test)
    X_train = df.drop("50k-income", axis = 1)
    X_train = X_train.replace("?",np.nan)
    X_train = X_train.fillna("na")
    #replace all ? values in object dtype cols with empty space
    X_train["workclass"] = X_train["workclass"].str.replace("?", "na")
    X_train["occupation"] = X_train["occupation"].str.replace("?", "na")
    X_train["education"] = X_train["education"].str.replace("?", "na")
    X_train["marital"] = X_train["marital"].str.replace("?", "na")
    X_train["relationship"] = X_train["relationship"].str.replace("?", "na")
    X_train["race"] = X_train["race"].str.replace("?", "na")
    X_train["sex"] = X_train["sex"].str.replace("?", "na")
    X_train["native-ctry"] = X_train["native-ctry"].str.replace("?", "na")
    print(X_train.dtypes)

    Y_train = df[["50k-income"]]
    Y_train = pd.DataFrame(Y_train["50k-income"].str.replace("?", "na"))
    
    X_test = test.drop("50k-income", axis = 1)
    X_test = X_test.replace("?",np.nan)
    X_test = X_test.fillna("na")
    #replace all ? values in object dtype cols with empty space
    X_test["workclass"] = X_test["workclass"].str.replace("?", "")
    X_test["occupation"] = X_test["occupation"].str.replace("?", "")
    X_test["education"] = X_test["education"].str.replace("?", "")
    X_test["marital"] = X_test["marital"].str.replace("?", "")
    X_test["relationship"] = X_test["relationship"].str.replace("?", "")
    X_test["race"] = X_test["race"].str.replace("?", "")
    X_test["sex"] = X_test["sex"].str.replace("?", "")
    X_test["native-ctry"] = X_test["native-ctry"].str.replace("?", "")
    Y_test = test[["50k-income"]]
    Y_test = pd.DataFrame(Y_test["50k-income"].str.replace("50K.", "50K"))    #remove . proceding 50K in test file
    Y_test = Y_test["50k-income"].str.replace("?","")
    Y_test = pd.DataFrame(Y_test)


    
    features_to_encode = X_train.columns[X_train.dtypes==object].tolist()
    print(features_to_encode)
    income_map = {"<=50K":0, ">50K":1}
    Y_train["50k-income"] = Y_train["50k-income"].map(income_map)
    Y_test["50k-income"] = Y_test["50k-income"].map(income_map)

    col_trans = make_column_transformer((OneHotEncoder(handle_unknown="ignore"), features_to_encode), remainder="passthrough")
    rf_classifier = RandomForestClassifier(min_samples_leaf=50, oob_score=True, bootstrap=True, n_jobs=-1 ,random_state=50) #bootstrapping reduces variance, njobs = -1 uses all processor cores
    clf = make_pipeline(col_trans, rf_classifier)

    clf.fit(X_train, Y_train)

由于我的 DataFrame 中存在混合数据类型,因此我使用了 此答案适用于对象,常规 .replace() 适用于其他列。 ? 成功替换为空白。

我最终得到了

File "/Users/vijay/Documents/CSCE 587/HW/Homework2/hw2.py", line 102, in <module>
    salary_random_forest()
  File "/Users/vijay/Documents/CSCE 587/HW/Homework2/hw2.py", line 95, in salary_random_forest
    clf.fit(X_train, Y_train)
  File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 327, in fit
    X, y = self._validate_data(
  File "/usr/local/lib/python3.9/site-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 979, in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 989, in _check_y
    y = check_array(
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 800, in check_array
    _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite
    raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The dtypes for X_train are

age               int64
workclass        object
fnlwgt            int64
education        object
education-num     int64
marital          object
occupation       object
relationship     object
race             object
sex              object
capital-gain      int64
capital-loss      int64
hours/week        int64
native-ctry      object

并运行 X_train.isnull().sum() 给出了

age              0
workclass        0
fnlwgt           0
education        0
education-num    0
marital          0
occupation       0
relationship     0
race             0
sex              0
capital-gain     0
capital-loss     0
hours/week       0
native-ctry      0

我几天来一直在尝试解决这个问题,我无处可去。 我已将本指南用于 RF 分类器。

First post on here, go easy on me with formatting.

Some of my data in a csv contains "?" instead of None or empty space:

54, ?, 180211, Some-college, 10, Married-civ-spouse, ?, Husband, Asian-Pac-Islander, Male, 0, 0, 60, South, >50K

    df = pd.read_csv("adult.data", names = ["age", "workclass", "fnlwgt", 
    "education", "education-num", "marital", "occupation", "relationship", "race", 
    "sex", "capital-gain", "capital-loss", "hours/week", "native-ctry", "50k-income"])
    df = pd.DataFrame(df)
    test = pd.read_csv("adult.test", names = df.columns)
    test = pd.DataFrame(test)
    X_train = df.drop("50k-income", axis = 1)
    X_train = X_train.replace("?",np.nan)
    X_train = X_train.fillna("na")
    #replace all ? values in object dtype cols with empty space
    X_train["workclass"] = X_train["workclass"].str.replace("?", "na")
    X_train["occupation"] = X_train["occupation"].str.replace("?", "na")
    X_train["education"] = X_train["education"].str.replace("?", "na")
    X_train["marital"] = X_train["marital"].str.replace("?", "na")
    X_train["relationship"] = X_train["relationship"].str.replace("?", "na")
    X_train["race"] = X_train["race"].str.replace("?", "na")
    X_train["sex"] = X_train["sex"].str.replace("?", "na")
    X_train["native-ctry"] = X_train["native-ctry"].str.replace("?", "na")
    print(X_train.dtypes)

    Y_train = df[["50k-income"]]
    Y_train = pd.DataFrame(Y_train["50k-income"].str.replace("?", "na"))
    
    X_test = test.drop("50k-income", axis = 1)
    X_test = X_test.replace("?",np.nan)
    X_test = X_test.fillna("na")
    #replace all ? values in object dtype cols with empty space
    X_test["workclass"] = X_test["workclass"].str.replace("?", "")
    X_test["occupation"] = X_test["occupation"].str.replace("?", "")
    X_test["education"] = X_test["education"].str.replace("?", "")
    X_test["marital"] = X_test["marital"].str.replace("?", "")
    X_test["relationship"] = X_test["relationship"].str.replace("?", "")
    X_test["race"] = X_test["race"].str.replace("?", "")
    X_test["sex"] = X_test["sex"].str.replace("?", "")
    X_test["native-ctry"] = X_test["native-ctry"].str.replace("?", "")
    Y_test = test[["50k-income"]]
    Y_test = pd.DataFrame(Y_test["50k-income"].str.replace("50K.", "50K"))    #remove . proceding 50K in test file
    Y_test = Y_test["50k-income"].str.replace("?","")
    Y_test = pd.DataFrame(Y_test)


    
    features_to_encode = X_train.columns[X_train.dtypes==object].tolist()
    print(features_to_encode)
    income_map = {"<=50K":0, ">50K":1}
    Y_train["50k-income"] = Y_train["50k-income"].map(income_map)
    Y_test["50k-income"] = Y_test["50k-income"].map(income_map)

    col_trans = make_column_transformer((OneHotEncoder(handle_unknown="ignore"), features_to_encode), remainder="passthrough")
    rf_classifier = RandomForestClassifier(min_samples_leaf=50, oob_score=True, bootstrap=True, n_jobs=-1 ,random_state=50) #bootstrapping reduces variance, njobs = -1 uses all processor cores
    clf = make_pipeline(col_trans, rf_classifier)

    clf.fit(X_train, Y_train)

Since there are mixed dtypes in my DataFrame, I used this answer for objects and regular .replace() for the other columns. The ?s get replaced with empty space successfully.

I end up getting

File "/Users/vijay/Documents/CSCE 587/HW/Homework2/hw2.py", line 102, in <module>
    salary_random_forest()
  File "/Users/vijay/Documents/CSCE 587/HW/Homework2/hw2.py", line 95, in salary_random_forest
    clf.fit(X_train, Y_train)
  File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/usr/local/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 327, in fit
    X, y = self._validate_data(
  File "/usr/local/lib/python3.9/site-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 979, in check_X_y
    y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric)
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 989, in _check_y
    y = check_array(
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 800, in check_array
    _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
  File "/usr/local/lib/python3.9/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite
    raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The dtypes for X_train are

age               int64
workclass        object
fnlwgt            int64
education        object
education-num     int64
marital          object
occupation       object
relationship     object
race             object
sex              object
capital-gain      int64
capital-loss      int64
hours/week        int64
native-ctry      object

and running X_train.isnull().sum() gives

age              0
workclass        0
fnlwgt           0
education        0
education-num    0
marital          0
occupation       0
relationship     0
race             0
sex              0
capital-gain     0
capital-loss     0
hours/week       0
native-ctry      0

I've been trying to figure this out for days and I'm getting nowhere. I've used this guide for RF classifier.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

£冰雨忧蓝° 2025-01-16 01:35:58

事实证明,用于测试和训练的 Y 映射给出了 NaN 值。我将地图替换为

 Y_train = pd.DataFrame(Y_train["50k-income"].str.replace("<=50K", "0"))
 Y_train = pd.DataFrame(Y_train["50k-income"].str.replace(">50K", "1"))
 Y_train = Y_train.astype(str).astype(int)

Turns out the mapping of Y for testing and training was giving NaN values. I replaced the map with

 Y_train = pd.DataFrame(Y_train["50k-income"].str.replace("<=50K", "0"))
 Y_train = pd.DataFrame(Y_train["50k-income"].str.replace(">50K", "1"))
 Y_train = Y_train.astype(str).astype(int)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文