Pyarrow时间戳不断转换为1970
我正在尝试将带有所有其他数据的时间戳存储在我的数据框架中,这意味着将数据存储到磁盘中的时间,并将其存储在parquet文件中。通常,我只是将时间戳存储在熊猫数据框架本身中,但是Pyarrow不喜欢Pandas存储时间戳的方式,并抱怨说,当我运行pa.table.from_pandas(From_pandas( )
无论我做什么。解决方法是直接将时间戳作为表中的列附加,但是由于某些原因,Pyarrow一直将时间戳转换为1970年。我尝试了多个解决方法,但似乎没有任何作用。
请参阅下面的工作代码示例复制问题。在此示例中,附录实际上不是对表进行的,但是它显示了问题 - dateTime.now()返回的时间戳。Timestamp()是正确的,但是当将其转换为Pyarrow数组时,它将重置为1970
from datetime import datetime
import pyarrow as pa
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.uniform(size=(20,10)))
df = pd.DataFrame(data)
df.columns = [str(i) for i in range(data.shape[1])]
schema = [(str(i), pa.float32()) for i in range(data.shape[1])]
schema = pa.schema(schema)
ts = datetime.now().timestamp()
print('DateTime timestamp:', ts)
table = pa.Table.from_pandas(df, schema)
pa_ts = pa.array([ts] * len(table), pa.timestamp('us'))
print('PyArrow timestamp:', pa_ts)
和 1970。这是我得到的输出:
DateTime timestamp: 1650817852.093818
PyArrow timestamp: [
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852
]
I'm trying to store a timestamp with all the other data in my dataframe, signifying the time the data was stored to disk, in a Parquet file. Normally I'd just store the timestamp within the pandas dataframe itself, but pyarrow doesn't like pandas' way of storing timestamps and complains that it will lose precision converting from nanoseconds to microseconds when I run pa.Table.from_pandas()
no matter what I do. A workaround is to directly append the timestamp as a column within the table, however for some reason pyarrow keeps converting the timestamp to 1970. I have tried multiple workarounds but nothing seems to work.
See below, a working code example replicating the issue. The append isn't actually done to the table in this example, but it shows the issue - the timestamp returned by datetime.now().timestamp() is correct, but when it's converted to a pyarrow array it resets to 1970.
from datetime import datetime
import pyarrow as pa
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.uniform(size=(20,10)))
df = pd.DataFrame(data)
df.columns = [str(i) for i in range(data.shape[1])]
schema = [(str(i), pa.float32()) for i in range(data.shape[1])]
schema = pa.schema(schema)
ts = datetime.now().timestamp()
print('DateTime timestamp:', ts)
table = pa.Table.from_pandas(df, schema)
pa_ts = pa.array([ts] * len(table), pa.timestamp('us'))
print('PyArrow timestamp:', pa_ts)
And here's the output I get:
DateTime timestamp: 1650817852.093818
PyArrow timestamp: [
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852,
1970-01-01 00:27:30.817852
]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如Fobersteiner所提到的那样,这里的问题是因为我告诉Pyarrow从假定的微秒时间戳进行转换。如果将来有人遇到这个问题,就像将上面的“我们”更改为“ S”一样简单。而且,如果您想要毫秒级的时间戳,则可以这样做:
As FObersteiner mentioned, the issue here was because I was telling pyarrow to convert from an assumed microsecond-level timestamp. In case anyone encounters this issue in the future, it's as simple as changing the 'us' above to 's'. And if you want millisecond-level timestamping, you can do it like so: