将大型数据集中的时间戳转换为多个时区

发布于 2025-01-19 17:46:45 字数 980 浏览 0 评论 0原文

我有一个大型数据集,约有 900 万行和 4 列 - 其中之一是 utc 时间戳。该集合中的数据是从澳大利亚各地的 507 个站点记录的,并且有一个站点 ID 列。我有另一个数据集,其中包含每个站点 ID 的时区,格式为“澳大利亚/布里斯班”。我编写了一个函数来在主数据集中创建一个新列,该列是将 utc 时间戳转换为本地时间。但是,错误的新时间与 utc 时间戳相匹配,例如 2019-01-05 12:10:00+00:00 和 2019-01-13 18:55:00+11:00(时区错误)。我相信网站不会在数据中混淆,但我尝试对数据进行排序,以防出现问题。下面是我的代码和每个数据集第一行的图像,非常感谢任何帮助!

import pytz
from dateutil import tz

def update_timezone(df):
    newtimes = []
    df = df.sort_values('site_id')
    sites = df['site_id'].unique().tolist()
    for site in sites:
        timezone = solarbom.loc[solarbom['site_id'] == site].iloc[0, 1]
        dfsub = df[df['site_id'] == site].copy()
        dfsub['utc_timestamp'] = dfsub['utc_timestamp'].dt.tz_convert(timezone)
        newtimes.extend(dfsub['utc_timestamp'].tolist())
    df['newtimes'] = newtimes

主要大数据集 网站信息数据集

I have a large dataset with ~ 9 million rows and 4 columns - one of which is a utc timestamp. Data in this set has been recorded from 507 sites across Australia, and there is a site ID column. I have another dataset that has the timezones for each site ID in the format 'Australia/Brisbane'. I've written a function to create a new column in the main dataset that is the utc timestamp converted to the local time. However the wrong new time is being matched up with the utc timestamp, for example 2019-01-05 12:10:00+00:00 and 2019-01-13 18:55:00+11:00 (wrong timezone). I believe that sites are not mixed up in the data, but I've tried to sort the data incase that was the problem. Below is my code and images of the first row of each dataset, any help is much appreciated!

import pytz
from dateutil import tz

def update_timezone(df):
    newtimes = []
    df = df.sort_values('site_id')
    sites = df['site_id'].unique().tolist()
    for site in sites:
        timezone = solarbom.loc[solarbom['site_id'] == site].iloc[0, 1]
        dfsub = df[df['site_id'] == site].copy()
        dfsub['utc_timestamp'] = dfsub['utc_timestamp'].dt.tz_convert(timezone)
        newtimes.extend(dfsub['utc_timestamp'].tolist())
    df['newtimes'] = newtimes

Main large dataset
Site info dataset

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

空名 2025-01-26 17:46:45

IIUC,您希望按 ID 对数据进行分组,然后转换特定于每个 ID 的时间戳。您可以通过使用 groupby 来实现此目的,然后应用每个组的转换器功能。例如:

import pandas as pd

# dummy data:
df = pd.DataFrame({'utc_timestamp': [pd.Timestamp("2022-01-01 00:00 Z"),
                                     pd.Timestamp("2022-01-01 01:00 Z"),
                                     pd.Timestamp("2022-01-05 00:00 Z"),
                                     pd.Timestamp("2022-01-03 00:00 Z"),
                                     pd.Timestamp("2022-01-03 01:00 Z"),
                                     pd.Timestamp("2022-01-03 02:00 Z")],
                   'site_id': [1, 1, 5, 3, 3, 3],
                   'values': [11, 11, 55, 33, 33, 33]})

# time zone info for each ID:
timezdf = pd.DataFrame({'site_id': [1, 3, 5],
                        'timezone_id_x': ["Australia/Adelaide", "Australia/Perth", "Australia/Darwin"]})

### what we want:
# for row, data in timezdf.iterrows():
#     print(f"ID: {data['site_id']}, tz: {data['timezone_id_x']}")
#     print(pd.Timestamp("2022-01-01 00:00 Z"), "to", pd.Timestamp("2022-01-01 00:00 Z").tz_convert(data['timezone_id_x']))

# ID: 1, tz: Australia/Adelaide
# 2022-01-01 00:00:00+00:00 to 2022-01-01 10:30:00+10:30
# ID: 3, tz: Australia/Perth
# 2022-01-01 00:00:00+00:00 to 2022-01-01 08:00:00+08:00
# ID: 5, tz: Australia/Darwin
# 2022-01-01 00:00:00+00:00 to 2022-01-01 09:30:00+09:30
###

def converter(group, timezdf):
    # get the time zone by looking for the current group ID in timezdf
    z = timezdf.loc[timezdf["site_id"] == group["site_id"].iloc[0], 'timezone_id_x'].iloc[0]
    group["localtime"] = group["localtime"].dt.tz_convert(z)
    return group

df["localtime"] = df["utc_timestamp"]
df = df.groupby("site_id").apply(lambda g: converter(g, timezdf))

现在 df 看起来像

df
Out[71]: 
              utc_timestamp  site_id  values                  localtime
0 2022-01-01 00:00:00+00:00        1      11  2022-01-01 10:30:00+10:30
1 2022-01-01 01:00:00+00:00        1      11  2022-01-01 11:30:00+10:30
2 2022-01-05 00:00:00+00:00        5      55  2022-01-05 09:30:00+09:30
3 2022-01-03 00:00:00+00:00        3      33  2022-01-03 08:00:00+08:00
4 2022-01-03 01:00:00+00:00        3      33  2022-01-03 09:00:00+08:00
5 2022-01-03 02:00:00+00:00        3      33  2022-01-03 10:00:00+08:00

IIUC, you're looking to group your data by ID, then convert the timestamp specific to each ID. You could achieve this by using groupby, then applying a converter function to each group. Ex:

import pandas as pd

# dummy data:
df = pd.DataFrame({'utc_timestamp': [pd.Timestamp("2022-01-01 00:00 Z"),
                                     pd.Timestamp("2022-01-01 01:00 Z"),
                                     pd.Timestamp("2022-01-05 00:00 Z"),
                                     pd.Timestamp("2022-01-03 00:00 Z"),
                                     pd.Timestamp("2022-01-03 01:00 Z"),
                                     pd.Timestamp("2022-01-03 02:00 Z")],
                   'site_id': [1, 1, 5, 3, 3, 3],
                   'values': [11, 11, 55, 33, 33, 33]})

# time zone info for each ID:
timezdf = pd.DataFrame({'site_id': [1, 3, 5],
                        'timezone_id_x': ["Australia/Adelaide", "Australia/Perth", "Australia/Darwin"]})

### what we want:
# for row, data in timezdf.iterrows():
#     print(f"ID: {data['site_id']}, tz: {data['timezone_id_x']}")
#     print(pd.Timestamp("2022-01-01 00:00 Z"), "to", pd.Timestamp("2022-01-01 00:00 Z").tz_convert(data['timezone_id_x']))

# ID: 1, tz: Australia/Adelaide
# 2022-01-01 00:00:00+00:00 to 2022-01-01 10:30:00+10:30
# ID: 3, tz: Australia/Perth
# 2022-01-01 00:00:00+00:00 to 2022-01-01 08:00:00+08:00
# ID: 5, tz: Australia/Darwin
# 2022-01-01 00:00:00+00:00 to 2022-01-01 09:30:00+09:30
###

def converter(group, timezdf):
    # get the time zone by looking for the current group ID in timezdf
    z = timezdf.loc[timezdf["site_id"] == group["site_id"].iloc[0], 'timezone_id_x'].iloc[0]
    group["localtime"] = group["localtime"].dt.tz_convert(z)
    return group

df["localtime"] = df["utc_timestamp"]
df = df.groupby("site_id").apply(lambda g: converter(g, timezdf))

now df looks like

df
Out[71]: 
              utc_timestamp  site_id  values                  localtime
0 2022-01-01 00:00:00+00:00        1      11  2022-01-01 10:30:00+10:30
1 2022-01-01 01:00:00+00:00        1      11  2022-01-01 11:30:00+10:30
2 2022-01-05 00:00:00+00:00        5      55  2022-01-05 09:30:00+09:30
3 2022-01-03 00:00:00+00:00        3      33  2022-01-03 08:00:00+08:00
4 2022-01-03 01:00:00+00:00        3      33  2022-01-03 09:00:00+08:00
5 2022-01-03 02:00:00+00:00        3      33  2022-01-03 10:00:00+08:00
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文