可视化时间序列热图中的缺失值

发布于 2025-01-17 03:41:16 字数 824 浏览 3 评论 0原文

我在大数据分析方面确实是新手。 假设我有一个具有以下特征的大数据。我想可视化特定小时内每个 id 的燃料参数缺失值(无值)的百分比。我想绘制一个图表,x 轴是时间序列(时间列),y 轴是“id”,颜色将指示其缺失的燃料百分比。 我根据“id”和“小时”对数据库进行分组,

我不知道如何以良好的方式为所有 id 可视化缺失值。例如,如果特定小时内特定id的缺失值燃料的百分比是100%,则该特定时间和该“id”的颜色可以是灰色的。如果燃料中缺失值的百分比为 50%,则颜色可为浅绿色。如果燃料中缺失值的百分比为 0%,则颜色可以为深绿色。 根据 ID 和时间分组后,颜色必须基于燃料中缺失值的百分比。

    id    time                   fuel
0   1     2022-02-26 19:08:33    100
2   1     2022-02-26 20:09:35    None
3   2     2022-02-26 21:09:35    70
4   3     2022-02-26 21:10:55    60
5   4     2022-02-26 21:10:55    None
6   5     2022-02-26 22:12:43    50
7   6     2022-02-26 23:10:50    None

例如,在下面的代码中,我计算了特定 id 每小时缺失值的百分比:

df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)

有什么解决方案吗?

I am really new in big data analysing.
Let's say I have a big data with the following features. I want to visualise the the percentage of missing values (None values) of fuel parameters for every id in specific hour. I want to draw a chart that x-axis is the time series (time column), y-axis is the 'id' and the colour will indicate its missing fuel percentage.
I grouped the data base on 'id' and 'hour'

I don't know how to visualise missing value in a good way for all ids. For example if the percentage of missing value fuel of specific id in specific hour is 100% then the colour in that specific time and for that 'id' can be gray. If percentage of missing value in fuel is 50%, the colour can be light green. If percentage of missing value in fuel is 0% then the colour can be dark green.
The colour must be based to the percentage of missing value in fuel, after grouping based on id and time.

    id    time                   fuel
0   1     2022-02-26 19:08:33    100
2   1     2022-02-26 20:09:35    None
3   2     2022-02-26 21:09:35    70
4   3     2022-02-26 21:10:55    60
5   4     2022-02-26 21:10:55    None
6   5     2022-02-26 22:12:43    50
7   6     2022-02-26 23:10:50    None

So for example, in the following code I computed the percentage of the missing value for every hour for specific id:

df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)

Is there any solution?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

两个我 2025-01-24 03:41:16

更新:热图现在绘制 id时间 与 null fuel 的百分比。我在这篇文章的末尾保留了对 id vs time vs fuel 的原始答案。


我想要类似 github 风格的日历的东西。

要模仿 GitHub 贡献矩阵,请将分组的空百分比重置为数据帧和 透视为每行 1 个 id、每列 1 小时。然后使用 sns.heatmap 为每个单元格着色基于空燃料的百分比。

# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')

# compute null percentage per (id, hour)
nulls = (df.set_index('time')
           .groupby(['id', pd.Grouper(freq='H')])['fuel']
           .apply(lambda x: x.isnull().mean() * 100))

# pivot into id vs time matrix
matrix = (nulls.reset_index(name='null (%)')
               .pivot(index='id', columns='time', values='null (%)'))

# plot time series heatmap
sns.heatmap(matrix, square=True, vmin=0, vmax=100, cmap='magma_r', cbar_kws={'label': 'null (%)'},
            linewidth=1, linecolor='lightgray', clip_on=False,
            xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))


原始:这是为了按 timefuel 可视化 id

  1. 将数据透视为 >id时间矩阵。通常 pivot 是很好,但由于您的真实数据包含重复索引,请使用 pivot_table
  2. 重新采样 < code>time 列转换为每小时的方式。
  3. 使用 sns.heatmap 绘制时间序列矩阵。
# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')

# pivot into id vs time matrix
matrix = df.pivot_table(index='id', columns='time', values='fuel', dropna=False)

# resample columns into hourly means
matrix = matrix.resample('H', axis=1).mean()

# plot time series heatmap
sns.heatmap(matrix, square=True, cmap='plasma_r', vmin=0, vmax=100, cbar_kws={'label': 'fuel (%)'},
            linewidth=1, linecolor='lightgray', clip_on=False,
            xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))

Update: The heatmap now plots id vs time vs percentage of null fuel. I've kept my original answer for id vs time vs fuel at the end of this post.


I want something almost like a github style calendar.

To mimic the GitHub contribution matrix, reset the grouped null percentages into a dataframe and pivot into 1 id per row and 1 hour per column. Then use sns.heatmap to color each cell based on percentage of null fuel.

# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')

# compute null percentage per (id, hour)
nulls = (df.set_index('time')
           .groupby(['id', pd.Grouper(freq='H')])['fuel']
           .apply(lambda x: x.isnull().mean() * 100))

# pivot into id vs time matrix
matrix = (nulls.reset_index(name='null (%)')
               .pivot(index='id', columns='time', values='null (%)'))

# plot time series heatmap
sns.heatmap(matrix, square=True, vmin=0, vmax=100, cmap='magma_r', cbar_kws={'label': 'null (%)'},
            linewidth=1, linecolor='lightgray', clip_on=False,
            xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))


Original: This is for visualizing id by time by fuel:

  1. Pivot into an id vs time matrix. Normally pivot is fine, but since your real data contains duplicate indexes, use pivot_table.
  2. resample the time columns into hourly means.
  3. Plot the time series matrix using sns.heatmap.
# convert to proper dtypes
df['time'] = pd.to_datetime(df['time'])
df['fuel'] = pd.to_numeric(df['fuel'], errors='coerce')

# pivot into id vs time matrix
matrix = df.pivot_table(index='id', columns='time', values='fuel', dropna=False)

# resample columns into hourly means
matrix = matrix.resample('H', axis=1).mean()

# plot time series heatmap
sns.heatmap(matrix, square=True, cmap='plasma_r', vmin=0, vmax=100, cbar_kws={'label': 'fuel (%)'},
            linewidth=1, linecolor='lightgray', clip_on=False,
            xticklabels=matrix.columns.strftime('%b %d, %Y\n%H:%M:%S'))

煮茶煮酒煮时光 2025-01-24 03:41:16

关于缺失值可视化没有正确的答案,我想这取决于您的用途、习惯......

但首先,为了使其发挥作用,我们需要预处理您的数据帧并使其可分析,也就是确保其数据类型。

首先让我们构建数据:

import pandas as pd
from io import StringIO
    
csvfile = StringIO(
"""id   time    fuel
1   2022-02-26 19:08:33 100
2   2022-02-26 19:09:35 70
3   2022-02-26 19:10:55 60
4   2022-02-26 20:10:55 None
5   2022-02-26 21:12:43 50
6   2022-02-26 22:10:50 None""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')

df
Out[65]: 
   id                 time  fuel
0   1  2022-02-26 19:08:33   100
1   2  2022-02-26 19:09:35    70
2   3  2022-02-26 19:10:55    60
3   4  2022-02-26 20:10:55  None
4   5  2022-02-26 21:12:43    50
5   6  2022-02-26 22:10:50  None

在这个阶段,数据框中的几乎所有数据都是与字符串相关的,您需要将燃料和时间转换为非对象数据类型。

df.dtypes
Out[66]: 
id       int64
time    object
fuel    object
dtype: object

时间应转换为日期时间,id 应转换为 int,燃料应转换为 float。事实上,None 应该转换为 np.nan 来表示数值,这需要 float 数据类型。

通过地图,我们可以轻松地将所有 'None' 值更改为 np.nan。我不会在这里深入讨论,但为了简单起见,我将使用 dict 的自定义子类和 __missing__ 实现

df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")

class dict_with_missing(dict):
    def __missing__(self, key):
        return key
map_dict = dict_with_missing({'None' : np.nan})
df.fuel = df.fuel.map(map_dict).astype(np.float32)

然后我们有一个干净的数据框:

df
Out[68]: 
   id                time   fuel
0   1 2022-02-26 19:08:33  100.0
1   2 2022-02-26 19:09:35   70.0
2   3 2022-02-26 19:10:55   60.0
3   4 2022-02-26 20:10:55    NaN
4   5 2022-02-26 21:12:43   50.0
5   6 2022-02-26 22:10:50    NaN

df.dtypes
Out[69]: 
id               int64
time    datetime64[ns]
fuel           float32
dtype: object

然后,您可以轻松使用 barmatrix 或来自热图 missingno 模块

msno.bar(df)
msno.matrix(df, sparkline=False)
msno.heatmap(df, cmap="RdYlGn")

这里附注,热图在这里有点无用,因为它比较具有缺失值的列。并且您只有一列缺少值。但对于更大的数据框(〜 5/6 列缺失值),它可能很有用。

为了快速而肮脏的可视化,您还可以打印缺失值的数量(又名 np.nan,在 pandas/numpy 公式中):

df.isna().sum()
Out[72]: 
id      0
time    0
fuel    2
dtype: int64

There is no right answer concerning missing values visualization, I guess it depends on your uses, habits ...

But first, to make it works, we need to preprocess your dataframe and make it analyzable, aka ensure its dtypes.

First let's build our data :

import pandas as pd
from io import StringIO
    
csvfile = StringIO(
"""id   time    fuel
1   2022-02-26 19:08:33 100
2   2022-02-26 19:09:35 70
3   2022-02-26 19:10:55 60
4   2022-02-26 20:10:55 None
5   2022-02-26 21:12:43 50
6   2022-02-26 22:10:50 None""")
df = pd.read_csv(csvfile, sep = '\t', engine='python')

df
Out[65]: 
   id                 time  fuel
0   1  2022-02-26 19:08:33   100
1   2  2022-02-26 19:09:35    70
2   3  2022-02-26 19:10:55    60
3   4  2022-02-26 20:10:55  None
4   5  2022-02-26 21:12:43    50
5   6  2022-02-26 22:10:50  None

At this stage almost all data in our dataframe is string related, you need to convert fuel and time into a non-object dtypes.

df.dtypes
Out[66]: 
id       int64
time    object
fuel    object
dtype: object

Time should be converted as datetime, id as int and fuel as float. Indeed, None should be convert as np.nan for numeric values, which needs the float dtype.

With a map, we can easily change all 'None' values into np.nan. I won't go deeper here, but for simplicity sake, I'll use a custom subclass of dict with a __missing__ implementation

df.time = pd.to_datetime(df.time, format = "%Y/%m/%d %H:%M:%S")

class dict_with_missing(dict):
    def __missing__(self, key):
        return key
map_dict = dict_with_missing({'None' : np.nan})
df.fuel = df.fuel.map(map_dict).astype(np.float32)

Then we have a clean dataframe :

df
Out[68]: 
   id                time   fuel
0   1 2022-02-26 19:08:33  100.0
1   2 2022-02-26 19:09:35   70.0
2   3 2022-02-26 19:10:55   60.0
3   4 2022-02-26 20:10:55    NaN
4   5 2022-02-26 21:12:43   50.0
5   6 2022-02-26 22:10:50    NaN

df.dtypes
Out[69]: 
id               int64
time    datetime64[ns]
fuel           float32
dtype: object

Then, you can easily use bar, matrix or heatmap from the missingno module

msno.bar(df)
msno.matrix(df, sparkline=False)
msno.heatmap(df, cmap="RdYlGn")

A side note here, heatmap is kind of useless here, since it compares columns having missing values. And you only have one column with missing value. But for a bigger dataframe (~ 5/6 columns with missing values) it can be useful.

For a quick and dirty visualization, you can also print the number of missing value (aka np.nan, in pandas/numpy formulation) :

df.isna().sum()
Out[72]: 
id      0
time    0
fuel    2
dtype: int64
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文