有没有一种安全有效的方法来仅在一天的特定时间范围内填充 pandas 中的 NaN?

发布于 2025-01-13 22:21:15 字数 745 浏览 1 评论 0原文

我知道 pandas 中的链式赋值绝对是一个热门话题,并且存在大量问题,但我仍然无法找到适合我的情况的解决方案。

我正在处理辐照度和光伏时间序列数据(带有 DateTimeIndex 的 pandas 数据框)。我的系列中有一些漏洞,有些是在夜间,有些是在白天。我想用零替换夜间的所有 NaN,因为这很有意义(夜间的辐照度和光伏发电为零)。

到目前为止我想出的是这样的:

hour_range = [*range(17, 24)] + [*range(0, 9)]
mask = df['irradiance'].isna() & df['irradiance'].index.hour.isin(hour_range)
df.loc[mask, 'irradiance'] = 0

我还尝试了其他解决方案,例如将 Between_time 与 fill_na 结合使用或直接使用 df.mask 与in_place 选项,但我不断收到可怕的 SettingWithCopyWarning。我决定不使用 Between_time 因为它不返回布尔系列并且不允许轻松组合多个条件。也许我在这一点上是错的。 我想修改 df in_place 以提高内存效率。 有更清洁、更安全的解决方案来解决我的问题吗? 谢谢。

I know that chained-assignment in pandas is definitely a hot topic and there are a huge amount of questions on it but I am still unable to find a solution that works in my case.

I am working with irradiance and pv time series data (pandas dataframe with DateTimeIndex). There are holes in my series, some during night-time others during day-time. I would like to replace all the NaNs during the night-time with zeros because it would make sense (irradiance and pv production during night are null).

What I came up with so far is something like:

hour_range = [*range(17, 24)] + [*range(0, 9)]
mask = df['irradiance'].isna() & df['irradiance'].index.hour.isin(hour_range)
df.loc[mask, 'irradiance'] = 0

I tried also other solutions, like combining between_time with fill_na or using directly df.mask with the in_place option but I keep getting the dreaded SettingWithCopyWarning. I decided not to use between_time because it does not return a boolean series and does not allow combinining easily multiple conditions. Maybe I am wrong on this.
I would like to modify the df in_place for memory efficiency.
Is there a cleaner and safer solution to my problem?
Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

笑,眼淚并存 2025-01-20 22:21:15

以下是如何创建时间范围(如果需要)、如何创建要操作的时间数组以及如何根据“操作时间”数组更改“数据”列的示例

import pandas as pd
import numpy as np
import datetime

#Making example data
start_date = datetime.datetime.now()
period_end_date = start_date + datetime.timedelta(hours=24)
dates = np.arange(np.datetime64(start_date), np.datetime64(period_end_date), np.timedelta64(1, 'h'), dtype='datetime64[h]')
data = np.random.randint(1, 100, 24)
df = pd.DataFrame(dates, columns = ['Dates'])
df['Data'] = data
df['Data'] = np.where(df['Data']%2 == 0, np.nan, df['Data'])

#Creating a dynamic time range and replaceing nan with "Something Else"
start_time = datetime.datetime.now() + datetime.timedelta(hours = 5)
end_time = start_time + datetime.timedelta(hours = 5)
#Creates a time range you which to manipulate
time_range = np.arange(np.datetime64(start_time), np.datetime64(end_time), np.timedelta64(1, 'h'), dtype='datetime64[h]')
#Replaces all the np.nan within the "time_range" variable with "Something Else" otherwise leave it as it is
df['Data'] = np.where((df['Dates'].isin(time_range)), df['Data'].fillna('Something Else'), df['Data'])

Here is an example of how to create a time range (if needed), how to create an array of time you wish to manipulate, and how to alter the 'Data' column based on the "time to manipulate" array

import pandas as pd
import numpy as np
import datetime

#Making example data
start_date = datetime.datetime.now()
period_end_date = start_date + datetime.timedelta(hours=24)
dates = np.arange(np.datetime64(start_date), np.datetime64(period_end_date), np.timedelta64(1, 'h'), dtype='datetime64[h]')
data = np.random.randint(1, 100, 24)
df = pd.DataFrame(dates, columns = ['Dates'])
df['Data'] = data
df['Data'] = np.where(df['Data']%2 == 0, np.nan, df['Data'])

#Creating a dynamic time range and replaceing nan with "Something Else"
start_time = datetime.datetime.now() + datetime.timedelta(hours = 5)
end_time = start_time + datetime.timedelta(hours = 5)
#Creates a time range you which to manipulate
time_range = np.arange(np.datetime64(start_time), np.datetime64(end_time), np.timedelta64(1, 'h'), dtype='datetime64[h]')
#Replaces all the np.nan within the "time_range" variable with "Something Else" otherwise leave it as it is
df['Data'] = np.where((df['Dates'].isin(time_range)), df['Data'].fillna('Something Else'), df['Data'])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文