Python OLE2日期格式转换
我创建了一个 python 脚本,该脚本可以从 Word 文档中的 OLE 流中提取数据,但在将 OLE2 格式的时间戳转换为更人类可读的内容时遇到问题:(
提取的时间戳是 12760233021 但我不能我将其转换为 2007 年 3 月 12 日或类似的日期,
非常感谢
编辑: 好的,我已经在我的一个 Word 文档上运行了该脚本,该文档是在 2009 年 10 月 31 日 10:05:00 创建的。 OLE DocumentSummaryInformation 流中的创建日期是 12901417500。
另一个示例是在 2009 年 10 月 27 日 15:33:00 创建的 Word 文档,在 OLE DocumentSummaryInformation 流中给出的创建日期为 12901091580。
有关这些 OLE 流属性的 MSDN 文档为 http ://msdn.microsoft.com/en-us/library/aa380376%28VS.85%29.aspx
提取这些流的 def 如下:
import OleFileIO_PL as ole
def enumerateStreams(item):
# item is an arbitrary file
if ole.isOleFile('%s' % item):
loader = ole.OleFileIO('%s' % item)
# enumerate all the OLE streams in the office file
streams = loader.listdir()
streamProps = []
for stream in streams:
if stream[0] == '\x05SummaryInformation':
# get all the properties fro the SummaryInformation OLE stream
streamProps.append(loader.getproperties(stream))
elif stream[0] == '\x05DocumentSummaryInformation':
# get all the properties from the DocumentSummaryInformation stream
streamProps.append(loader.getproperties(stream))
return streamProps
I have created a python script which pulls data out of OLE streams in Word documents, but am having trouble converting the OLE2-formatted timestamp to something more human-readable :(
The timestamp which is pulled out is 12760233021 but I cannot for the life of me convert this to a date like 12 Mar 2007 or similar.
Any help is greatly appreciated.
EDIT:
OK I have ran the script over one of my word documents, which was created on 31/10/2009, 10:05:00. The Create Date in the OLE DocumentSummaryInformation stream is 12901417500.
Another example is a word doc created on 27/10/2009, 15:33:00, gives the Create Date of 12901091580 in the OLE DocumentSummaryInformation stream.
The MSDN documentation on the properties of these OLE streams is http://msdn.microsoft.com/en-us/library/aa380376%28VS.85%29.aspx
The def which pulls these streams out is given below:
import OleFileIO_PL as ole
def enumerateStreams(item):
# item is an arbitrary file
if ole.isOleFile('%s' % item):
loader = ole.OleFileIO('%s' % item)
# enumerate all the OLE streams in the office file
streams = loader.listdir()
streamProps = []
for stream in streams:
if stream[0] == '\x05SummaryInformation':
# get all the properties fro the SummaryInformation OLE stream
streamProps.append(loader.getproperties(stream))
elif stream[0] == '\x05DocumentSummaryInformation':
# get all the properties from the DocumentSummaryInformation stream
streamProps.append(loader.getproperties(stream))
return streamProps
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
(0) 请澄清“like 12 Mar 2007 或类似”:您的意思是您希望将 11 位 int 转换为 12 Mar 2007,还是“12 Mar 2007”只是为了传达您想要的格式显示日期?如果是后者,您不能通过使用 MS Word 或 OpenOffice.org 的文字处理小工具检查某些文件来提供预期的结果吗?您打算如何验证所提供的解决方案是否确实有效?
(1) 请提供多对(OLE,预期),以便可以更有保证地验证任何建议解决方案的正确操作。如果可能的话,您能否创建具有已知预期值的示例,例如 2000 年 1 月 1 日、2001 年 1 月 1 日、2001 年 1 月 2 日、2001 年 2 月 2 日?
(2) 从“从 OLE 流中提取数据”来看,您是否需要 OLE2 复合文档标头中的文件创建等时间戳,或者是否需要内容中存在的时间戳,这一点并不明显。请说明您在哪里寻找时间戳。如果您可以提供与您感兴趣的时间戳相关的 MS 文档的参考,这也会有很大帮助......当然它必须告诉您格式是什么,即使它通过一两个内部/额外间接这样做-文档跳跃。
(3) 请展示你如何将其拉出来——它是一根绳子吗?固定11字节?或者是 str(从 64 位字段转换而来的某个 int)?怎么转换??除了说明之外,还请显示您的转换代码。不要凭记忆重新输入代码;使用复制/粘贴。
请通过编辑您的问题来提供所需的信息,而不是作为评论。
等待信息时更新:
OLE 复合文档标头中的文件创建和修改时间戳似乎是 64 位小端整数,表示(自 1601-01-01T00:00:00 以来的秒数)* 10 ** 7
。 OLE2 数据中的数据使用的类型似乎是 64 位小端 IEEE 754 浮点,表示自 1899-12-30T00:00:00 以来的(天数及其一小部分)。是的,这一天是 30 号,而不是 31 号。
检查提供的 2 个示例后进行更新:
两个观察到的时间戳(将采用您的本地时间)之间的差异是 325920 秒:
这与两个幻数之间的区别:
所以幻数表示自某个纪元以来的秒数...
因此幻数表示自 1601-01-01T00:00:00Z 以来的秒数,并且您的 TZ 距离 UTC 为 11 小时。
这两个幻数不适合 32 位...看起来要么 (a) 它以 64 位的形式存储为自 1601 年以来的秒数(浪费了大约 29 位!),要么 (b) 它被存储为 (正如预期的那样,自 1601 年以来(100 纳秒单位),但在您看到它之前,有东西将其除以 10**7。
您提供的文档参考仅说明它是
VF_FILETIME (UTC)
类型。谷歌搜索后,我发现了一些关于调用 Windows 函数来操作时间戳的 MS 线索,但据我所知没有定义。然而,有两个第 3 方注释(来自 perlmonks 和 Apache POI 项目)说了很多相同的事情:“”“这看起来像 WindowsVT_FILETIME
数据类型,它是一个 64 位无符号整数,表示自 1601 年 1 月 1 日以来经过的 100 纳秒数"""犯罪现场更新:
看来您正在使用
OleFileIO_PL
读取文件。快速翻阅唯一的源文件可以发现这一点:(0) Please clarify "like 12 Mar 2007 or similar": do you mean that you expect the 11-digit int to convert to 12 Mar 2007, or is "12 Mar 2007" merely intended to convey the format in which you want to display the date? If the latter, can't you provide expected results by inspecting some files with MS Word or OpenOffice.org's word processing gadget? How do you intend to verify that any solution that is offered actually works?
(1) Please give more than one (OLE, expected) pair so that correct operation of any proposed solution can be verified with more assurance. If possible, can you create examples with known expected values like 01 Jan 2000, 01 Jan 2001, 02 Jan 2001, 02 Feb 2001?
(2) It is not obvious from "pulls data out of OLE streams" whether you want the file creation etc timestamps in the OLE2 compound document header, or whether you want timestamps that are present in the content. Please say WHERE you are trawling for timestamps. It would also help tremendously if you could give a reference to the MS documentation that relates to the timestamps you are interested in ... surely it must tell you what the format is, even if it does so indirectly by one or two intra/extra-document hops.
(3) Please show HOW you are pulling that out -- is it a string? fixed 11 bytes? Or is it str(some int that you have converted from a 64-bit field)? Converted HOW?? As well as a description, show your conversion code. Don't retype your code from memory; use copy/paste.
Please provide the requested info by editing your question, not as comments.
Update while waiting for info:
The file creation and modification timestamps in an OLE compound document header appear to be 64-bit little-endian integers representing (seconds since 1601-01-01T00:00:00) * 10 ** 7.
The DATE type used in data in OLE2 data appears to be 64-bit little-endian IEEE 754 float representing (days and a fraction thereof) since 1899-12-30T00:00:00. Yes the day is 30, not 31.
Update after examining the 2 examples supplied:
The difference between the two observed timestamps (which will be in your local time) is 325920 seconds:
This is the same as the difference between the two magic numbers:
So the magic numbers represent seconds since some epoch ...
So the magic numbers represent seconds since 1601-01-01T00:00:00Z and your TZ is 11 hours away from UTC.
Those two magic numbers won't fit in 32 bits ... looks like either (a) it is stored in 64 bits as seconds since 1601 (a waste of about 29 bits!) or (b) it is stored as (number of 100-nanosecond units) since 1601 as expected but something is dividing it by 10**7 before you see it.
The documentation reference that you gave merely says that it's a
VF_FILETIME (UTC)
type. Googling that, I find a couple of MS clues on calling Windows functions to manipulate the timestamps, but no definition as far as I looked. However there are two 3rd party notes (from perlmonks and the Apache POI project) which say much the same thing: """This looks like a WindowsVT_FILETIME
data type which is a 64 bit unsigned integer representing the number of elapsed 100 nanoseconds since 1 January 1601"""Update from the crime scene:
Seems you are using
OleFileIO_PL
to read the files. A quick rummage through the sole source file reveals this:这个问题很老了但仍然有用。我最近改进了 OleFileIO_PL,通过将日期自动转换为 Python 日期时间来解决该问题。
请参阅本页的文档,尤其是有关 get_metadata 和 get_properties 的部分:
https://bitbucket.org/decalage/olefileio_pl
使用 get_metadata 时,标准属性流中的所有时间戳,例如'\x05SummaryInformation' 被转换为 Python 日期时间。
如果您需要使用 get_properties,请使用 conversion_time 选项:
Philippe。
this question is quite old but still useful. I recently improved OleFileIO_PL to fix the issue by converting dates automatically to Python datetime.
See the documentation on this page, especially the parts about get_metadata and get_properties:
https://bitbucket.org/decalage/olefileio_pl
When using get_metadata, all timestamps in standard property streams such as '\x05SummaryInformation' are converted to Python datetime.
If you need to use get_properties instead, then use the convert_time option:
Philippe.