将字符串直接转换为 IDateTime
我正在使用新版本的 data.table
,尤其是很棒的 fread
函数。我的文件包含作为字符串加载的日期(因为我不知道要这样做),看起来像 01APR2008:09:00:00
。
我需要对这些日期时间上的 data.table 进行排序,然后使排序能够有效地以 IDateTime
格式(或其他我还不知道的格式)进行转换。
> strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
[1] "2008-04-01 09:00:00"
> IDateTime(strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S"))
idate itime
1: 2008-04-01 09:00:00
> IDateTime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
Error in charToDate(x) :
character string is not in a standard unambiguous format
看起来我无法执行 DT[ , newType := IDateTime(strptime(oldType, "%d%b%Y:%H:%M:%S"))]。
那么我的问题是:
- 有没有办法从
fread
直接转换为IDateTime
,以便我可以随后有效地排序? - 如果没有,知道我希望能够按此日期时间列对 DT 进行排序的最有效方法是什么
I am using the new version of data.table
and especially the AWESOME fread
function. My files contain dates that are loaded as strings (cause I don't know to do it otherwise) looking like 01APR2008:09:00:00
.
I need to sort the data.table on those datetimes and then for the sort to be efficient to cast then in the IDateTime
format (or anything alse I would not know yet).
> strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
[1] "2008-04-01 09:00:00"
> IDateTime(strptime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S"))
idate itime
1: 2008-04-01 09:00:00
> IDateTime("01APR2008:09:00:00","%d%b%Y:%H:%M:%S")
Error in charToDate(x) :
character string is not in a standard unambiguous format
It looks like I cannot do DT[ , newType := IDateTime(strptime(oldType, "%d%b%Y:%H:%M:%S"))]
.
My questions are then:
- Is there a way to cast directly to
IDateTime
fromfread
, such that I can sort afterward efficiently? - If not, what is the most efficient way to go knowing that I would like to be able to sort DT by this datetime column
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不幸的是(为了提高效率)
strptime
生成了一个 POSIXlt 类型,该类型不受data.table
支持,并且始终会由于其大小(每个日期 40 字节!)和结构而受到影响。尽管strftime
产生了更好的 POSIXct,但它仍然通过 POSIXlt 来实现。更多信息在这里:查看诸如
as.Date
之类的基本函数,它也使用strptime
,创建一个距纪元(奇怪)存储为 double 的整数偏移量。data.table
中的IDate
(和朋友)类旨在实现存储为整数纪元偏移量。适合通过base::sort.list(method = "radix")
进行快速排序(这实际上是一种计数排序)。IDate
的真正目标并不是快速(通常是一次性)转换。因此,为了正确或错误地转换字符串日期/时间,我倾向于推出自己的辅助函数。
如果字符串日期是
"2012-12-24"
我倾向于:as.integer(gsub("-", "", col))
并继续带有YYYYMMDD
整数日期。同样,时间可以是HHMMMDD
作为整数。如果您通常希望在一天内滚动,而不是前一天,则两列:date
和time
可能会很有用。按月分组既简单又快速:by = date %/% 100L
。添加和减去天数很麻烦,但无论如何,因为您很少想添加日历日,而是添加工作日或工作日。所以无论如何,这都是对您的工作日向量的查找。在您的情况下,字符月份需要转换为
1:12
。您的日期“01APR2008”中没有分隔符,因此substring
是一种方式,后跟月份的match
或fmatch
姓名。您可以控制文件格式吗?如果是这样,数字最好采用自然排序的明确格式,例如%Y-%m-%d
或%Y%m%d
。我还没有弄清楚如何在
fread
中最好地做到这一点,因此日期/时间目前保留为字符,因为我还不确定如何检测日期格式或输出哪种类型。但它确实需要输出整数或双精度日期,而不是低效的字符。我怀疑我对YYYYMMDD
整数的使用被视为非常规,所以我有点犹豫是否将其设为默认值。它们有自己的位置,并且基于纪元的日期也有优点和缺点。我所建议的只是日期不必总是基于纪元。你怎么认为?顺便说一句,感谢您对
fread
的鼓励;很高兴看到。Unfortunately (for efficiency)
strptime
produces a POSIXlt type, which is unsupported bydata.table
and always will be due its size (40 bytes per date!) and structure. Althoughstrftime
produces the much better POSIXct, it still does it via POSIXlt. More info here :Looking to base functions such as
as.Date
, it usesstrptime
too, creating an integer offset from epoch (oddly) stored as double. TheIDate
(and friends) class indata.table
aims to achieve integer epoch offsets stored as, um, integer. Suitable for fast sorting bybase::sort.list(method = "radix")
(which is really a counting sort).IDate
doesn't really aim to be fast at (usually one off) conversion.So to convert string dates/times, rightly or wrongly, I tend to roll my own helper function.
If the string date is
"2012-12-24"
I'd lean towards:as.integer(gsub("-", "", col))
and proceed withYYYYMMDD
integer dates. Similarly times can beHHMMDD
as an integer. Two columns:date
andtime
separately can be useful if you generally want toroll = TRUE
within a day, but not to the previous day. Grouping by month is simple and fast:by = date %/% 100L
. Adding and subtracting days is troublesome, but it is anyway because rarely do you want to add calendar days, rather weekdays or business days. So that's a lookup to your business day vector anyway.In your case the character month would need a conversion to
1:12
. There isn't a separator in your dates "01APR2008", so asubstring
would be one way followed by amatch
orfmatch
on the month name. Are you in control of the file format? If so, numbers are better in an unambiguous format that sorts naturally such as%Y-%m-%d
, or%Y%m%d
.I haven't yet got to how best do this in
fread
, so date/times are left as character currently because I'm not yet sure how to detect the date format or which type to output. It does need to output either integer or double dates though, rather than inefficient character. I suspect that my use ofYYYYMMDD
integers are seen as unconventional, so I'm a little hesitant to make that the default. They have their place, and there are pros and cons of epoch based dates too. Dates don't have to be always epoch based is all I'm suggesting.What do you think? Btw, thanks for encouragement on
fread
; was nice to see.我不知道你的文件是如何构造的,但从你的评论来看,你想使用日期字段作为键。为什么不将其作为时间序列来阅读并在阅读时对其进行格式化?
这里我使用zoo来做到这一点。(这里我假设日期列是第一个,否则请参见
index.colum
参数)您可以以正确的格式对日期进行排序并排序。
从 POSIXct 到 IDateTime 的转换是自然的,
这里确保您仍然进行 2 次转换,但是您在读取数据时进行转换,第二次转换时不处理任何格式问题。
I d'ont know how your file is structured, but from your comment you want to use the date field as a key. Why not to read it as a time series and format it when in reading?
Here I use zoo to do it.(Here I suppose that the date column is the first one,otherwise see
index.colum
argument)You get your dates sorted in the right format and sorted.
The conversion is natural from POSIXct to IDateTime
Here sure you still do 2 conversions, But you do it when reading data, and the second you do it without dealing with any format problem.