如何在分类数据中填充缺失值?

发布于 2025-02-05 02:00:33 字数 161 浏览 4 评论 0 原文

我有一个由20000名员工组成的数据集,该数据集以下有三列缺少价值的列:

  1. 通过大学
  2. 专业
  3. 的大学名称

,现在我有10000名员工从未上过大学。我的最终目标是预测他们的薪水。

在这种情况下,如何填充缺失值。

I have a dataset of 20000 employees which has following three columns with missing values:

  1. Passing year of College
  2. College specialization
  3. Name of College

Now I have 10000 employees who never went to college. My final aim is to predict their salary.

How can I fill in missing values in this case.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

顾北清歌寒 2025-02-12 02:00:34

这是一个值得考虑的选择(有很多方法可以解决此类问题)。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fruits = pd.read_csv('C:\\Fruit.csv')
fruits.head()


fruits.shape
print(fruits['fruit_name'].unique())


fruits.apply(lambda x: sum(x.isnull()),axis=0)
print(fruits.isnull().sum())

结果:

fruit_label       0
fruit_name        0
fruit_subtype    10
mass              0
width             0
height            0
color_score       0

# here's the key part of the logic
fruits = fruits.fillna(fruits.mode().iloc[0])
print(fruits.isnull().sum())

结果:

fruit_label      0
fruit_name       0
fruit_subtype    0
mass             0
width            0
height           0
color_score      0

示例数据集:

fruit_label fruit_name  fruit_subtype   mass    width   height  color_score
1   apple   granny_smith    192 8.4 7.3 0.55
1   apple   granny_smith    180 8   6.8 0.59
1   apple   granny_smith    176 7.4 7.2 0.6
2   mandarin    mandarin    86  6.2 4.7 0.8
2   mandarin    mandarin    84  6   4.6 0.79
2   mandarin    mandarin    80  5.8 4.3 0.77
2   mandarin    mandarin    80  5.9 4.3 0.81
2   mandarin    mandarin    76  5.8 4   0.81
1   apple   braeburn    178 7.1 7.8 0.92
1   apple   braeburn    172 7.4 7   0.89
1   apple   braeburn    166 6.9 7.3 0.93
1   apple   braeburn    172 7.1 7.6 0.92
1   apple   braeburn    154 7   7.1 0.88
1   apple   golden_delicious    164 7.3 7.7 0.7
1   apple   golden_delicious    152 7.6 7.3 0.69
1   apple   golden_delicious    156 7.7 7.1 0.69
1   apple   golden_delicious    156 7.6 7.5 0.67
1   apple   golden_delicious    168 7.5 7.6 0.73
1   apple   cripps_pink 162 7.5 7.1 0.83
1   apple   cripps_pink 162 7.4 7.2 0.85
1   apple   cripps_pink 160 7.5 7.5 0.86
1   apple   cripps_pink 156 7.4 7.4 0.84
1   apple   cripps_pink 140 7.3 7.1 0.87
1   apple   cripps_pink 170 7.6 7.9 0.88
3   orange  spanish_jumbo   342 9   9.4 0.75
3   orange  spanish_jumbo   356 9.2 9.2 0.75
3   orange  spanish_jumbo   362 9.6 9.2 0.74
3   orange  selected_seconds    204 7.5 9.2 0.77
3   orange  selected_seconds    140 6.7 7.1 0.72
3   orange  selected_seconds    160 7   7.4 0.81
3   orange  selected_seconds    158 7.1 7.5 0.79
3   orange  selected_seconds    210 7.8 8   0.82
3   orange  selected_seconds    164 7.2 7   0.8
3   orange  turkey_navel    190 7.5 8.1 0.74
3   orange  turkey_navel    142 7.6 7.8 0.75
3   orange  turkey_navel    150 7.1 7.9 0.75
3   orange  turkey_navel    160 7.1 7.6 0.76
3   orange  turkey_navel    154 7.3 7.3 0.79
3   orange  turkey_navel    158 7.2 7.8 0.77
3   orange  turkey_navel    144 6.8 7.4 0.75
3   orange  turkey_navel    154 7.1 7.5 0.78
3   orange  turkey_navel    180 7.6 8.2 0.79
3   orange  turkey_navel    154 7.2 7.2 0.82
4   lemon   spanish_belsan  194 7.2 10.3    0.7
4   lemon   spanish_belsan  200 7.3 10.5    0.72
4   lemon   spanish_belsan  186 7.2 9.2 0.72
4   lemon   spanish_belsan  216 7.3 10.2    0.71
4   lemon   spanish_belsan  196 7.3 9.7 0.72
4   lemon   spanish_belsan  174 7.3 10.1    0.72
4   lemon       132 5.8 8.7 0.73
4   lemon       130 6   8.2 0.71
4   lemon       116 6   7.5 0.72
4   lemon       118 5.9 8   0.72
4   lemon       120 6   8.4 0.74
4   lemon       116 6.1 8.5 0.71
4   lemon       116 6.3 7.7 0.72
4   lemon       116 5.9 8.1 0.73
4   lemon       152 6.5 8.5 0.72
4   lemon       118 6.1 8.1 0.7

查看此链接以获取更多信息。

Here's one option to consider (there are many ways to handle this kind of problem).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fruits = pd.read_csv('C:\\Fruit.csv')
fruits.head()


fruits.shape
print(fruits['fruit_name'].unique())


fruits.apply(lambda x: sum(x.isnull()),axis=0)
print(fruits.isnull().sum())

Result:

fruit_label       0
fruit_name        0
fruit_subtype    10
mass              0
width             0
height            0
color_score       0

# here's the key part of the logic
fruits = fruits.fillna(fruits.mode().iloc[0])
print(fruits.isnull().sum())

Result:

fruit_label      0
fruit_name       0
fruit_subtype    0
mass             0
width            0
height           0
color_score      0

Sample Data Set:

fruit_label fruit_name  fruit_subtype   mass    width   height  color_score
1   apple   granny_smith    192 8.4 7.3 0.55
1   apple   granny_smith    180 8   6.8 0.59
1   apple   granny_smith    176 7.4 7.2 0.6
2   mandarin    mandarin    86  6.2 4.7 0.8
2   mandarin    mandarin    84  6   4.6 0.79
2   mandarin    mandarin    80  5.8 4.3 0.77
2   mandarin    mandarin    80  5.9 4.3 0.81
2   mandarin    mandarin    76  5.8 4   0.81
1   apple   braeburn    178 7.1 7.8 0.92
1   apple   braeburn    172 7.4 7   0.89
1   apple   braeburn    166 6.9 7.3 0.93
1   apple   braeburn    172 7.1 7.6 0.92
1   apple   braeburn    154 7   7.1 0.88
1   apple   golden_delicious    164 7.3 7.7 0.7
1   apple   golden_delicious    152 7.6 7.3 0.69
1   apple   golden_delicious    156 7.7 7.1 0.69
1   apple   golden_delicious    156 7.6 7.5 0.67
1   apple   golden_delicious    168 7.5 7.6 0.73
1   apple   cripps_pink 162 7.5 7.1 0.83
1   apple   cripps_pink 162 7.4 7.2 0.85
1   apple   cripps_pink 160 7.5 7.5 0.86
1   apple   cripps_pink 156 7.4 7.4 0.84
1   apple   cripps_pink 140 7.3 7.1 0.87
1   apple   cripps_pink 170 7.6 7.9 0.88
3   orange  spanish_jumbo   342 9   9.4 0.75
3   orange  spanish_jumbo   356 9.2 9.2 0.75
3   orange  spanish_jumbo   362 9.6 9.2 0.74
3   orange  selected_seconds    204 7.5 9.2 0.77
3   orange  selected_seconds    140 6.7 7.1 0.72
3   orange  selected_seconds    160 7   7.4 0.81
3   orange  selected_seconds    158 7.1 7.5 0.79
3   orange  selected_seconds    210 7.8 8   0.82
3   orange  selected_seconds    164 7.2 7   0.8
3   orange  turkey_navel    190 7.5 8.1 0.74
3   orange  turkey_navel    142 7.6 7.8 0.75
3   orange  turkey_navel    150 7.1 7.9 0.75
3   orange  turkey_navel    160 7.1 7.6 0.76
3   orange  turkey_navel    154 7.3 7.3 0.79
3   orange  turkey_navel    158 7.2 7.8 0.77
3   orange  turkey_navel    144 6.8 7.4 0.75
3   orange  turkey_navel    154 7.1 7.5 0.78
3   orange  turkey_navel    180 7.6 8.2 0.79
3   orange  turkey_navel    154 7.2 7.2 0.82
4   lemon   spanish_belsan  194 7.2 10.3    0.7
4   lemon   spanish_belsan  200 7.3 10.5    0.72
4   lemon   spanish_belsan  186 7.2 9.2 0.72
4   lemon   spanish_belsan  216 7.3 10.2    0.71
4   lemon   spanish_belsan  196 7.3 9.7 0.72
4   lemon   spanish_belsan  174 7.3 10.1    0.72
4   lemon       132 5.8 8.7 0.73
4   lemon       130 6   8.2 0.71
4   lemon       116 6   7.5 0.72
4   lemon       118 5.9 8   0.72
4   lemon       120 6   8.4 0.74
4   lemon       116 6.1 8.5 0.71
4   lemon       116 6.3 7.7 0.72
4   lemon       116 5.9 8.1 0.73
4   lemon       152 6.5 8.5 0.72
4   lemon       118 6.1 8.1 0.7

Take a look at this link for more info.

https://www.analyticsvidhya.com/blog/2021/04/how-to-handle-missing-values-of-categorical-variables/

已下线请稍等 2025-02-12 02:00:33

丢失值可以处理多种方式,遵循的方式取决于您拥有的数据。

  • 用缺失值删除行

    行具有更多数量的列值,因为可以删除空。 (同样,完全取决于个人用例)

  • 将缺失的vlaues归纳为平均 /中位数< / p>

    对于数值列,您可以尝试通过列出列值的平均值 /中位数来替换缺失值。< / p>

  • 最常见的值:适用于您的方案

    此方法适用于我认为是您的情况的分类数据。您可以尝试在所有三列中替换所有三列中缺少的Vlaues,并在给定的列中最常发生的值。

Missing values can be dealt with number of ways, which way to follow depends on the kind of data you have.

  • Deleting the rows with missing values

    Rows with more number of column values as null could be dropped. (Again what is exactly more number depends on individual use case)

  • Imputing the missing vlaues with Mean / Median

    For the numerical Columns you can try replacing the missing values by taking Mean / Median of the column values.

  • Most frequent Values: Applicable to your Scenario

    This method is suitable for Categorical data which i assume is your case. You can try replacing missing vlaues in all three Columns with the most frequently occuring value in the given column.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文