Python Polars:将列读取为日期时间

发布于 2025-01-17 02:17:16 字数 93 浏览 0 评论 0原文

如何将 csv 读入 Polars DataFrame 并将其中一列解析为日期时间?

或者,如何将列转换为 pl.Datetime

How does one read a csv into a polars DataFrame and parse one of the columns as a datetime?

Alternatively, how does one convert a column to a pl.Datetime?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

痴骨ら 2025-01-24 02:17:17

我首先会在 read_csv 调用中尝试 try_parse_dates=True

例如,假设我们有以下数据:

import polars as pl

my_csv = (
b"""
ID,start,last_updt,end
1,2008-10-31,2020-11-28 12:48:53,12/31/2008
2,2007-10-31,2021-11-29 01:37:20,12/31/2007
3,2006-10-31,2021-11-30 23:22:05,12/31/2006
"""
)

pl.read_csv(my_csv, try_parse_dates=True)
shape: (3, 4)
┌─────┬────────────┬─────────────────────┬────────────┐
│ ID  ┆ start      ┆ last_updt           ┆ end        │
│ --- ┆ ---        ┆ ---                 ┆ ---        │
│ i64 ┆ date       ┆ datetime[μs]        ┆ str        │
╞═════╪════════════╪═════════════════════╪════════════╡
│ 1   ┆ 2008-10-31 ┆ 2020-11-28 12:48:53 ┆ 12/31/2008 │
│ 2   ┆ 2007-10-31 ┆ 2021-11-29 01:37:20 ┆ 12/31/2007 │
│ 3   ┆ 2006-10-31 ┆ 2021-11-30 23:22:05 ┆ 12/31/2006 │
└─────┴────────────┴─────────────────────┴────────────┘

start 列解析为日期,last_updt 列解析为日期时间。但请注意,end 列未解析为日期,因为它不是 ISO 8601 格式。 (我遇到过很多 csv 文件,其中日期/日期时间字段是非标准的。)

要解析此列,我们可以使用 .str.to_date() 函数并提供适当的格式。

pl.read_csv(my_csv, try_parse_dates=True).with_columns(pl.col('end').str.to_date('%m/%d/%Y'))
shape: (3, 4)
┌─────┬────────────┬─────────────────────┬────────────┐
│ ID  ┆ start      ┆ last_updt           ┆ end        │
│ --- ┆ ---        ┆ ---                 ┆ ---        │
│ i64 ┆ date       ┆ datetime[μs]        ┆ date       │
╞═════╪════════════╪═════════════════════╪════════════╡
│ 1   ┆ 2008-10-31 ┆ 2020-11-28 12:48:53 ┆ 2008-12-31 │
│ 2   ┆ 2007-10-31 ┆ 2021-11-29 01:37:20 ┆ 2007-12-31 │
│ 3   ┆ 2006-10-31 ┆ 2021-11-30 23:22:05 ┆ 2006-12-31 │
└─────┴────────────┴─────────────────────┴────────────┘

I would first try try_parse_dates=True in the read_csv call.

For example, let's say we have the following data:

import polars as pl

my_csv = (
b"""
ID,start,last_updt,end
1,2008-10-31,2020-11-28 12:48:53,12/31/2008
2,2007-10-31,2021-11-29 01:37:20,12/31/2007
3,2006-10-31,2021-11-30 23:22:05,12/31/2006
"""
)

pl.read_csv(my_csv, try_parse_dates=True)
shape: (3, 4)
┌─────┬────────────┬─────────────────────┬────────────┐
│ ID  ┆ start      ┆ last_updt           ┆ end        │
│ --- ┆ ---        ┆ ---                 ┆ ---        │
│ i64 ┆ date       ┆ datetime[μs]        ┆ str        │
╞═════╪════════════╪═════════════════════╪════════════╡
│ 1   ┆ 2008-10-31 ┆ 2020-11-28 12:48:53 ┆ 12/31/2008 │
│ 2   ┆ 2007-10-31 ┆ 2021-11-29 01:37:20 ┆ 12/31/2007 │
│ 3   ┆ 2006-10-31 ┆ 2021-11-30 23:22:05 ┆ 12/31/2006 │
└─────┴────────────┴─────────────────────┴────────────┘

The start column parsed as a Date, and the last_updt column parsed as a Datetime. But notice that the end column did not parse as a date because it is not in ISO 8601 format. (I've come across plenty of csv files where Date/Datetime fields were non-standard.)

To parse this column, we can use the .str.to_date() function and supply the appropriate format.

pl.read_csv(my_csv, try_parse_dates=True).with_columns(pl.col('end').str.to_date('%m/%d/%Y'))
shape: (3, 4)
┌─────┬────────────┬─────────────────────┬────────────┐
│ ID  ┆ start      ┆ last_updt           ┆ end        │
│ --- ┆ ---        ┆ ---                 ┆ ---        │
│ i64 ┆ date       ┆ datetime[μs]        ┆ date       │
╞═════╪════════════╪═════════════════════╪════════════╡
│ 1   ┆ 2008-10-31 ┆ 2020-11-28 12:48:53 ┆ 2008-12-31 │
│ 2   ┆ 2007-10-31 ┆ 2021-11-29 01:37:20 ┆ 2007-12-31 │
│ 3   ┆ 2006-10-31 ┆ 2021-11-30 23:22:05 ┆ 2006-12-31 │
└─────┴────────────┴─────────────────────┴────────────┘
夜访吸血鬼 2025-01-24 02:17:17

Polars 支持两种 csv 读取器,一种是内置的,一种是基于 pyarrow 的。 pyarrow reader支持直接解析日期;另请参阅 https://github.com/pola-rs/polars/issues/1330< /a>.您可以在 read_csv 中设置 use_pyarrow=True,但根据文档,只有在将其他参数输入到 read_csv 时才会使用它。

或者,读取为 Utf8(字符串),并使用 strptime 解析为字符串:https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.internals.series.StringNameSpace.strptime.html?highlight=strptime#polars.internals.series.StringNameSpace .strptime。我发现这种方法通常更容易,但根据数据的大小,可能会相对昂贵,因为您首先需要存储为 Utf8,然后进行解析。

Polars supports two csv readers, one built-in and one based on pyarrow. The pyarrow reader supports parsing dates directly; see also https://github.com/pola-rs/polars/issues/1330. You can set use_pyarrow=True in read_csv, but as per the documentation, it will only be used given also the other parameter inputs into read_csv.

Alternatively, read as Utf8 (string), and parse to string with strptime: https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.internals.series.StringNameSpace.strptime.html?highlight=strptime#polars.internals.series.StringNameSpace.strptime. This is the method I find easier typically, but may, depending on the size of your data, be relatively expensive as you first need to store as Utf8 and then parse.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文