将字符串直接转换为 IDateTime

发布于 2025-01-14 10:31:54 字数 826 浏览 0 评论 0原文

我正在使用新版本的 data.table,尤其是很棒的 fread 函数。我的文件包含作为字符串加载的日期(因为我不知道要这样做),看起来像 01APR2008:09:00:00

我需要对这些日期时间上的 data.table 进行排序,然后使排序能够有效地以 IDateTime 格式(或其他我还不知道的格式)进行转换。

> strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
[1] "2008-04-01 09:00:00"

> IDateTime(strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S"))
        idate    itime
1: 2008-04-01 09:00:00

> IDateTime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
Error in charToDate(x) : 
character string is not in a standard unambiguous format 

看起来我无法执行 DT[ , newType := IDateTime(strptime(oldType, "%d%b%Y:%H:%M:%S"))]。

那么我的问题是:

  1. 有没有办法从 fread 直接转换为 IDateTime ,以便我可以随后有效地排序?
  2. 如果没有,知道我希望能够按此日期时间列对 DT 进行排序的最有效方法是什么

I am using the new version of data.table and especially the AWESOME fread function. My files contain dates that are loaded as strings (cause I don't know to do it otherwise) looking like 01APR2008:09:00:00.

I need to sort the data.table on those datetimes and then for the sort to be efficient to cast then in the IDateTime format (or anything alse I would not know yet).

> strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
[1] "2008-04-01 09:00:00"

> IDateTime(strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S"))
        idate    itime
1: 2008-04-01 09:00:00

> IDateTime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
Error in charToDate(x) : 
character string is not in a standard unambiguous format 

It looks like I cannot do DT[ , newType := IDateTime(strptime(oldType, "%d%b%Y:%H:%M:%S"))].

My questions are then:

  1. Is there a way to cast directly to IDateTime from fread, such that I can sort afterward efficiently?
  2. If not, what is the most efficient way to go knowing that I would like to be able to sort DT by this datetime column

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

猫烠⑼条掵仅有一顆心 2025-01-21 10:31:54

不幸的是(为了提高效率)strptime 生成了一个 POSIXlt 类型,该类型不受 data.table 支持,并且始终会由于其大小(每个日期 40 字节!)和结构而受到影响。尽管 strftime 产生了更好的 POSIXct,但它仍然通过 POSIXlt 来实现。更多信息在这里:

http://stackoverflow.com/a/12788992/403310

查看诸如 as.Date 之类的基本函数,它也使用 strptime,创建一个距纪元(奇怪)存储为 double 的整数偏移量。 data.table 中的 IDate (和朋友)类旨在实现存储为整数纪元偏移量。适合通过 base::sort.list(method = "radix") 进行快速排序(这实际上是一种计数排序)。 IDate 的真正目标并不是快速(通常是一次性)转换。

因此,为了正确或错误地转换字符串日期/时间,我倾向于推出自己的辅助函数。

如果字符串日期是 "2012-12-24" 我倾向于: as.integer(gsub("-", "", col)) 并继续带有 YYYYMMDD 整数日期。同样,时间可以是 HHMMMDD 作为整数。如果您通常希望在一天内滚动,而不是前一天,则两列:datetime 可能会很有用。按月分组既简单又快速:by = date %/% 100L。添加和减去天数很麻烦,但无论如何,因为您很少想添加日历日,而是添加工作日或工作日。所以无论如何,这都是对您的工作日向量的查找。

在您的情况下,字符月份需要转换为 1:12。您的日期“01APR2008”中没有分隔符,因此 substring 是一种方式,后跟月份的 matchfmatch姓名。您可以控制文件格式吗?如果是这样,数字最好采用自然排序的明确格式,例如 %Y-%m-%d%Y%m%d

我还没有弄清楚如何在 fread 中最好地做到这一点,因此日期/时间目前保留为字符,因为我还不确定如何检测日期格式或输出哪种类型。但它确实需要输出整数或双精度日期,而不是低效的字符。我怀疑我对 YYYYMMDD 整数的使用被视为非常规,所以我有点犹豫是否将其设为默认值。它们有自己的位置,并且基于纪元的日期也有优点和缺点。我所建议的只是日期不必总是基于纪元。

你怎么认为?顺便说一句,感谢您对 fread 的鼓励;很高兴看到。

Unfortunately (for efficiency) strptime produces a POSIXlt type, which is unsupported by data.table and always will be due its size (40 bytes per date!) and structure. Although strftime produces the much better POSIXct, it still does it via POSIXlt. More info here :

http://stackoverflow.com/a/12788992/403310

Looking to base functions such as as.Date, it uses strptime too, creating an integer offset from epoch (oddly) stored as double. The IDate (and friends) class in data.table aims to achieve integer epoch offsets stored as, um, integer. Suitable for fast sorting by base::sort.list(method = "radix") (which is really a counting sort). IDate doesn't really aim to be fast at (usually one off) conversion.

So to convert string dates/times, rightly or wrongly, I tend to roll my own helper function.

If the string date is "2012-12-24" I'd lean towards: as.integer(gsub("-", "", col)) and proceed with YYYYMMDD integer dates. Similarly times can be HHMMDD as an integer. Two columns: date and time separately can be useful if you generally want to roll = TRUE within a day, but not to the previous day. Grouping by month is simple and fast: by = date %/% 100L. Adding and subtracting days is troublesome, but it is anyway because rarely do you want to add calendar days, rather weekdays or business days. So that's a lookup to your business day vector anyway.

In your case the character month would need a conversion to 1:12. There isn't a separator in your dates "01APR2008", so a substring would be one way followed by a match or fmatch on the month name. Are you in control of the file format? If so, numbers are better in an unambiguous format that sorts naturally such as %Y-%m-%d, or %Y%m%d.

I haven't yet got to how best do this in fread, so date/times are left as character currently because I'm not yet sure how to detect the date format or which type to output. It does need to output either integer or double dates though, rather than inefficient character. I suspect that my use of YYYYMMDD integers are seen as unconventional, so I'm a little hesitant to make that the default. They have their place, and there are pros and cons of epoch based dates too. Dates don't have to be always epoch based is all I'm suggesting.

What do you think? Btw, thanks for encouragement on fread; was nice to see.

装迷糊 2025-01-21 10:31:54

我不知道你的文件是如何构造的,但从你的评论来看,你想使用日期字段作为键。为什么不将其作为时间序列来阅读并在阅读时对其进行格式化?

这里我使用zoo来做到这一点。(这里我假设日期列是第一个,否则请参见index.colum参数)

ff <- function(x) as.POSIXct(strptime(x,"%d%b%Y:%H:%M:%S"))

h <- read.zoo(text = "03avril2008:09:00:00  125
                      02avril2008:09:30:00  126
                      05avril2008:09:10:00  127
                      04avril2008:09:20:00  128
                      01avril2008:09:00:00  128"
                      ,FUN=ff)

您可以以正确的格式对日期进行排序并排序。

从 POSIXct 到 IDateTime 的转换是自然的,

    IDateTime(index(h))
        idate    itime
1: 2008-04-01 09:00:00
2: 2008-04-02 09:30:00
3: 2008-04-03 09:00:00
4: 2008-04-04 09:20:00
5: 2008-04-05 09:10:00

这里确保您仍然进行 2 次转换,但是您在读取数据时进行转换,第二次转换时不处理任何格式问题。

I d'ont know how your file is structured, but from your comment you want to use the date field as a key. Why not to read it as a time series and format it when in reading?

Here I use zoo to do it.(Here I suppose that the date column is the first one,otherwise see index.colum argument)

ff <- function(x) as.POSIXct(strptime(x,"%d%b%Y:%H:%M:%S"))

h <- read.zoo(text = "03avril2008:09:00:00  125
                      02avril2008:09:30:00  126
                      05avril2008:09:10:00  127
                      04avril2008:09:20:00  128
                      01avril2008:09:00:00  128"
                      ,FUN=ff)

You get your dates sorted in the right format and sorted.

The conversion is natural from POSIXct to IDateTime

    IDateTime(index(h))
        idate    itime
1: 2008-04-01 09:00:00
2: 2008-04-02 09:30:00
3: 2008-04-03 09:00:00
4: 2008-04-04 09:20:00
5: 2008-04-05 09:10:00

Here sure you still do 2 conversions, But you do it when reading data, and the second you do it without dealing with any format problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文