可视化时间序列热图中的缺失值
我在大数据分析方面确实是新手。 假设我有一个具有以下特征的大数据。我想可视化特定小时内每个 id 的燃料参数缺失值(无值)的百分比。我想绘制一个图表,x 轴是时间序列(时间列),y 轴是“id”,颜色将指示其缺失的燃料百分比。 我根据“id”和“小时”对数据库进行分组,
我不知道如何以良好的方式为所有 id 可视化缺失值。例如,如果特定小时内特定id的缺失值燃料的百分比是100%,则该特定时间和该“id”的颜色可以是灰色的。如果燃料中缺失值的百分比为 50%,则颜色可为浅绿色。如果燃料中缺失值的百分比为 0%,则颜色可以为深绿色。 根据 ID 和时间分组后,颜色必须基于燃料中缺失值的百分比。
id time fuel
0 1 2022-02-26 19:08:33 100
2 1 2022-02-26 20:09:35 None
3 2 2022-02-26 21:09:35 70
4 3 2022-02-26 21:10:55 60
5 4 2022-02-26 21:10:55 None
6 5 2022-02-26 22:12:43 50
7 6 2022-02-26 23:10:50 None
例如,在下面的代码中,我计算了特定 id 每小时缺失值的百分比:
df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
有什么解决方案吗?
I am really new in big data analysing.
Let's say I have a big data with the following features. I want to visualise the the percentage of missing values (None values) of fuel parameters for every id in specific hour. I want to draw a chart that x-axis is the time series (time column), y-axis is the 'id' and the colour will indicate its missing fuel percentage.
I grouped the data base on 'id' and 'hour'
I don't know how to visualise missing value in a good way for all ids. For example if the percentage of missing value fuel of specific id in specific hour is 100% then the colour in that specific time and for that 'id' can be gray. If percentage of missing value in fuel is 50%, the colour can be light green. If percentage of missing value in fuel is 0% then the colour can be dark green.
The colour must be based to the percentage of missing value in fuel, after grouping based on id and time.
id time fuel
0 1 2022-02-26 19:08:33 100
2 1 2022-02-26 20:09:35 None
3 2 2022-02-26 21:09:35 70
4 3 2022-02-26 21:10:55 60
5 4 2022-02-26 21:10:55 None
6 5 2022-02-26 22:12:43 50
7 6 2022-02-26 23:10:50 None
So for example, in the following code I computed the percentage of the missing value for every hour for specific id:
df.set_index('ts').groupby(['id', pd.Grouper(freq='H')])['fuell'].apply(lambda x: x.isnull().mean() * 100)
Is there any solution?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
更新:热图现在绘制
id
与时间
与 nullfuel
的百分比。我在这篇文章的末尾保留了对id
vstime
vsfuel
的原始答案。要模仿 GitHub 贡献矩阵,请将分组的空百分比重置为数据帧和
透视
为每行 1 个id
、每列 1 小时。然后使用sns.heatmap
为每个单元格着色基于空燃料
的百分比。原始:这是为了按
time
按fuel
可视化id
:>id
与时间
矩阵。通常pivot
是很好,但由于您的真实数据包含重复索引,请使用pivot_table
。重新采样
< code>time 列转换为每小时的方式。sns.heatmap
绘制时间序列矩阵。Update: The heatmap now plots
id
vstime
vs percentage of nullfuel
. I've kept my original answer forid
vstime
vsfuel
at the end of this post.To mimic the GitHub contribution matrix, reset the grouped null percentages into a dataframe and
pivot
into 1id
per row and 1 hour per column. Then usesns.heatmap
to color each cell based on percentage of nullfuel
.Original: This is for visualizing
id
bytime
byfuel
:id
vstime
matrix. Normallypivot
is fine, but since your real data contains duplicate indexes, usepivot_table
.resample
thetime
columns into hourly means.sns.heatmap
.关于缺失值可视化没有正确的答案,我想这取决于您的用途、习惯......
但首先,为了使其发挥作用,我们需要预处理您的数据帧并使其可分析,也就是确保其数据类型。
首先让我们构建数据:
在这个阶段,数据框中的几乎所有数据都是与字符串相关的,您需要将燃料和时间转换为非对象数据类型。
时间应转换为日期时间,id 应转换为 int,燃料应转换为 float。事实上,None 应该转换为 np.nan 来表示数值,这需要 float 数据类型。
通过地图,我们可以轻松地将所有
'None'
值更改为np.nan
。我不会在这里深入讨论,但为了简单起见,我将使用 dict 的自定义子类和__missing__
实现然后我们有一个干净的数据框:
然后,您可以轻松使用
bar
、matrix
或来自的热图
missingno
模块这里附注,热图在这里有点无用,因为它比较具有缺失值的列。并且您只有一列缺少值。但对于更大的数据框(〜 5/6 列缺失值),它可能很有用。
为了快速而肮脏的可视化,您还可以打印缺失值的数量(又名 np.nan,在 pandas/numpy 公式中):
There is no right answer concerning missing values visualization, I guess it depends on your uses, habits ...
But first, to make it works, we need to preprocess your dataframe and make it analyzable, aka ensure its dtypes.
First let's build our data :
At this stage almost all data in our dataframe is string related, you need to convert fuel and time into a non-object dtypes.
Time should be converted as datetime, id as int and fuel as float. Indeed, None should be convert as np.nan for numeric values, which needs the float dtype.
With a map, we can easily change all
'None'
values intonp.nan
. I won't go deeper here, but for simplicity sake, I'll use a custom subclass of dict with a__missing__
implementationThen we have a clean dataframe :
Then, you can easily use
bar
,matrix
orheatmap
from themissingno
moduleA side note here, heatmap is kind of useless here, since it compares columns having missing values. And you only have one column with missing value. But for a bigger dataframe (~ 5/6 columns with missing values) it can be useful.
For a quick and dirty visualization, you can also print the number of missing value (aka np.nan, in pandas/numpy formulation) :