pandas.read_spss 是否将日期时间误读为unix?

发布于 01-14 02:55 字数 734 浏览 2 评论 0原文

我有一个 sav 文件,其中包含 %m/%d/%Y 字符串格式的日期时间列。当我用 pd.read_spss() 读取它时,它似乎没有任何与日期时间相关的参数,它最终以看起来像 unix 时间的方式结束,除了时间会是一些几个世纪后的唯一值,包括 1377872640013841884800 等。

当我将读取列输入然而,pd.to_datetime 并没有被解释为我期望的日期,而是 1970 年原始 unix 日期之后的几秒钟:

pd.to_datetime(df.col)

0      1970-01-01 00:00:13.778726400
1      1970-01-01 00:00:13.841798400
2      1970-01-01 00:00:13.778726400
3      1970-01-01 00:00:13.778035200
4      1970-01-01 00:00:13.841798400

为什么 datetime 列以这种奇怪的格式读取,并且为什么 pd.to_datetime 无法将其转换回来?

(我当前的解决方法只是在 SPSS 中手动将日期列设置为字符串。然后 pyreadstat/pandas.read_spss 可以毫无问题地将其作为字符串和 pandas 读取。 to_dateime 可以转换。)

I have a sav file with a datetime column in %m/%d/%Y string format. When I read it in with pd.read_spss(), which doesn't seem to have any datetime-related arguments, it ends up in what looks like unix time, except that the time would be a few centuries from now with unique values including 13778726400, 13841884800, etc.

When I feed the read column into pd.to_datetime, however, it's not interpreted as the date I would expect, but rather a few seconds after the original unix date in 1970:

pd.to_datetime(df.col)

0      1970-01-01 00:00:13.778726400
1      1970-01-01 00:00:13.841798400
2      1970-01-01 00:00:13.778726400
3      1970-01-01 00:00:13.778035200
4      1970-01-01 00:00:13.841798400

Why is the datetime column getting read in in this weird format, and why is pd.to_datetime not able to convert it back?

(My current workaround is just to set date columns to string manually in SPSS. Then pyreadstat/pandas.read_spss has no trouble reading it as a string and pandas.to_dateime can transform.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

灵芸 2025-01-21 02:55:18

日期、时间和日期时间始终以数字形式存储在 SPSS 中,然后添加显示格式。 SPSS 不断添加新格式,同时删除其他格式。新格式必须手动添加到 pyreadstat 代码中,而旧格式保留在代码中以实现向后兼容性。所以问题是您发现了一个未在 pyreadstat 中注册的新日期/日期时间/时间格式。

另一种解决方法是在 SPSS 中打开文件并将其存储为日期/日期时间/时间,但 pyreadstat 可以识别不同的格式,例如 DATE11、DATETIME20 等(pyreadstat 接受的当前列表是 [https:// github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_parser.pyx#L52-L54])

发现此问题时最好提交github 问题描述了要添加的新格式。我刚刚添加了一些我在最新的 SPSS 文档 中找到的内容,希望是您的问题应该在下一个版本中解决(已经在开发中可用)。如果没有,请提交带有可重现示例的问题。

SPSS 用于存储日期的数字不是 Unix 时间,而是自 1582-10-14(公历 所以你需要这样的东西来手动计算:

from datetime import datetime, date, timedelta
origin = date(1582, 10, 14)
myspssvalue = 13778726400
delta = timedelta(seconds=myspssvalue) # or days=myspssvalue if date
python_date = origin + delta
print(python_date)
#datetime.date(2019, 6, 1)

另外,如果你愿意的话。假设这个数字是自 1970 年以来的秒数:

>>> datetime.fromtimestamp(13778726400)
datetime.datetime(2406, 8, 19, 2, 0)

pandas 所做的就是认为您给出的数字是自 1970-01-01 以来的纳秒数(它将其转换为 datetime64[ns]),这就是为什么您会得到一个日期非常接近 1970 年

Dates, times and datetimes are always stored in SPSS as a number and then you add a format for displaying. SPSS continuously adds new formats while removes others. New formats have to be added manually to the pyreadstat code, while old formats stay in the code for backward compatibility. So the problem is you have found a new Date/datetime/time format that is not registered in pyreadstat.

Another workaround would have been to open the file in SPSS and store it as a date/datetime/time, but with a different format pyreadstat would recognise, for example DATE11, DATETIME20 etc (the current list that pyreadstat accepts is [https://github.com/Roche/pyreadstat/blob/master/pyreadstat/_readstat_parser.pyx#L52-L54])

The best when this is found is to submit a github issue describing the new format found for it to be added. I just added a few I found in the most recent SPSS documentation, and hopefully your problem should be solved in the next release (already available on dev). If not, please submit an issue with a reproducible example.

The numbers SPSS uses to store the dates are not unix time, but either the number of seconds (in the case of datetimes or time) or days (in the case of dates) since 1582-10-14 (the start of the Gregorian Calendar. So you would need something like this to calculate it manually:

from datetime import datetime, date, timedelta
origin = date(1582, 10, 14)
myspssvalue = 13778726400
delta = timedelta(seconds=myspssvalue) # or days=myspssvalue if date
python_date = origin + delta
print(python_date)
#datetime.date(2019, 6, 1)

Also, if you would assume this number is number of seconds since 1970:

>>> datetime.fromtimestamp(13778726400)
datetime.datetime(2406, 8, 19, 2, 0)

What pandas is doing is thinking that the number you are giving is the number of nanoseconds since 1970-01-01 (it transforms it to datetime64[ns]) and that is why you get a date very close to 1970

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文