Python OLE2日期格式转换

发布于 2024-08-13 07:03:34 字数 1437 浏览 9 评论 0原文

我创建了一个 python 脚本,该脚本可以从 Word 文档中的 OLE 流中提取数据,但在将 OLE2 格式的时间戳转换为更人类可读的内容时遇到问题:(

提取的时间戳是 12760233021 但我不能我将其转换为 2007 年 3 月 12 日或类似的日期,

非常感谢

编辑: 好的,我已经在我的一个 Word 文档上运行了该脚本,该文档是在 2009 年 10 月 31 日 10:05:00 创建的。 OLE DocumentSummaryInformation 流中的创建日期是 12901417500

另一个示例是在 2009 年 10 月 27 日 15:33:00 创建的 Word 文档,在 OLE DocumentSummaryInformation 流中给出的创建日期为 12901091580。

有关这些 OLE 流属性的 MSDN 文档为 http ://msdn.microsoft.com/en-us/library/aa380376%28VS.85%29.aspx

提取这些流的 def 如下:

import OleFileIO_PL as ole

def enumerateStreams(item):
    # item is an arbitrary file
    if ole.isOleFile('%s' % item):
        loader = ole.OleFileIO('%s' % item)
        # enumerate all the OLE streams in the office file
        streams = loader.listdir()
        streamProps = []
        for stream in streams:
            if stream[0] == '\x05SummaryInformation':
                # get all the properties fro the SummaryInformation OLE stream
                streamProps.append(loader.getproperties(stream))
            elif stream[0] == '\x05DocumentSummaryInformation':
                # get all the properties from the DocumentSummaryInformation stream
                streamProps.append(loader.getproperties(stream))
     return streamProps

I have created a python script which pulls data out of OLE streams in Word documents, but am having trouble converting the OLE2-formatted timestamp to something more human-readable :(

The timestamp which is pulled out is 12760233021 but I cannot for the life of me convert this to a date like 12 Mar 2007 or similar.

Any help is greatly appreciated.

EDIT:
OK I have ran the script over one of my word documents, which was created on 31/10/2009, 10:05:00. The Create Date in the OLE DocumentSummaryInformation stream is 12901417500.

Another example is a word doc created on 27/10/2009, 15:33:00, gives the Create Date of 12901091580 in the OLE DocumentSummaryInformation stream.

The MSDN documentation on the properties of these OLE streams is http://msdn.microsoft.com/en-us/library/aa380376%28VS.85%29.aspx

The def which pulls these streams out is given below:

import OleFileIO_PL as ole

def enumerateStreams(item):
    # item is an arbitrary file
    if ole.isOleFile('%s' % item):
        loader = ole.OleFileIO('%s' % item)
        # enumerate all the OLE streams in the office file
        streams = loader.listdir()
        streamProps = []
        for stream in streams:
            if stream[0] == '\x05SummaryInformation':
                # get all the properties fro the SummaryInformation OLE stream
                streamProps.append(loader.getproperties(stream))
            elif stream[0] == '\x05DocumentSummaryInformation':
                # get all the properties from the DocumentSummaryInformation stream
                streamProps.append(loader.getproperties(stream))
     return streamProps

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

绅士风度i 2024-08-20 07:03:34

(0) 请澄清“like 12 Mar 2007 或类似”:您的意思是您希望将 11 位 int 转换为 12 Mar 2007,还是“12 Mar 2007”只是为了传达您想要的格式显示日期?如果是后者,您不能通过使用 MS Word 或 OpenOffice.org 的文字处理小工具检查某些文件来提供预期的结果吗?您打算如何验证所提供的解决方案是否确实有效?

(1) 请提供多对(OLE,预期),以便可以更有保证地验证任何建议解决方案的正确操作。如果可能的话,您能否创建具有已知预期值的示例,例如 2000 年 1 月 1 日、2001 年 1 月 1 日、2001 年 1 月 2 日、2001 年 2 月 2 日?

(2) 从“从 OLE 流中提取数据”来看,您是否需要 OLE2 复合文档标头中的文件创建等时间戳,或者是否需要内容中存在的时间戳,这一点并不明显。请说明您在哪里寻找时间戳。如果您可以提供与您感兴趣的时间戳相关的 MS 文档的参考,这也会有很大帮助......当然它必须告诉您格式是什么,即使它通过一两个内部/额外间接这样做-文档跳跃。

(3) 请展示你如何将其拉出来——它是一根绳子吗?固定11字节?或者是 str(从 64 位字段转换而来的某个 int)?怎么转换??除了说明之外,还请显示您的转换代码。不要凭记忆重新输入代码;使用复制/粘贴。

请通过编辑您的问题来提供所需的信息,而不是作为评论。

等待信息时更新:

OLE 复合文档标头中的文件创建和修改时间戳似乎是 64 位小端整数,表示(自 1601-01-01T00:00:00 以来的秒数)* 10 ** 7

。 OLE2 数据中的数据使用的类型似乎是 64 位小端 IEEE 754 浮点,表示自 1899-12-30T00:00:00 以来的(天数及其一小部分)。是的,这一天是 30 号,而不是 31 号。

检查提供的 2 个示例后进行更新:

两个观察到的时间戳(将采用您的本地时间)之间的差异是 325920 秒:

>>> import datetime
>>> t0 = datetime.datetime(2009,10,27,15,33,0)
>>> t1 = datetime.datetime(2009,10,31,10,5,0)
>>> t1-t0
datetime.timedelta(3, 66720)
>>> secs = 3 * 24 * 60 * 60 + 66720
>>> secs
325920

这与两个幻数之间的区别:

>>> 12901417500 - 1290191580
325920

所以幻数表示自某个纪元以来的秒数...

>>> m1 = 12901417500
>>> days, seconds = divmod(m1, 60*60*24)
>>> epoch = t1 - datetime.timedelta(days, seconds)
>>> epoch
datetime.datetime(1601, 1, 1, 11, 0)

因此幻数表示自 1601-01-01T00:00:00Z 以来的秒数,并且您的 TZ 距离 UTC 为 11 小时。

这两个幻数不适合 32 位...看起来要么 (a) 它以 64 位的形式存储为自 1601 年以来的秒数(浪费了大约 29 位!),要么 (b) 它被存储为 (正如预期的那样,自 1601 年以来(100 纳秒单位),但在您看到它之前,有东西将其除以 10**7。

您提供的文档参考仅说明它是 VF_FILETIME (UTC) 类型。谷歌搜索后,我发现了一些关于调用 Windows 函数来操作时间戳的 MS 线索,但据我所知没有定义。然而,有两个第 3 方注释(来自 perlmonks 和 Apache POI 项目)说了很多相同的事情:“”“这看起来像 Windows VT_FILETIME 数据类型,它是一个 64 位无符号整数,表示自 1601 年 1 月 1 日以来经过的 100 纳秒数"""

犯罪现场更新:

看来您正在使用 OleFileIO_PL 读取文件。快速翻阅唯一的源文件可以发现这一点:

    elif type == VT_FILETIME:
        value = long(i32(s, offset+4)) + (long(i32(s, offset+8))<<32)
        # FIXME: this is a 64-bit int: "number of 100ns periods
        # since Jan 1,1601".  Should map this to Python time
        value = value / 10000000L # seconds

(0) Please clarify "like 12 Mar 2007 or similar": do you mean that you expect the 11-digit int to convert to 12 Mar 2007, or is "12 Mar 2007" merely intended to convey the format in which you want to display the date? If the latter, can't you provide expected results by inspecting some files with MS Word or OpenOffice.org's word processing gadget? How do you intend to verify that any solution that is offered actually works?

(1) Please give more than one (OLE, expected) pair so that correct operation of any proposed solution can be verified with more assurance. If possible, can you create examples with known expected values like 01 Jan 2000, 01 Jan 2001, 02 Jan 2001, 02 Feb 2001?

(2) It is not obvious from "pulls data out of OLE streams" whether you want the file creation etc timestamps in the OLE2 compound document header, or whether you want timestamps that are present in the content. Please say WHERE you are trawling for timestamps. It would also help tremendously if you could give a reference to the MS documentation that relates to the timestamps you are interested in ... surely it must tell you what the format is, even if it does so indirectly by one or two intra/extra-document hops.

(3) Please show HOW you are pulling that out -- is it a string? fixed 11 bytes? Or is it str(some int that you have converted from a 64-bit field)? Converted HOW?? As well as a description, show your conversion code. Don't retype your code from memory; use copy/paste.

Please provide the requested info by editing your question, not as comments.

Update while waiting for info:

The file creation and modification timestamps in an OLE compound document header appear to be 64-bit little-endian integers representing (seconds since 1601-01-01T00:00:00) * 10 ** 7.

The DATE type used in data in OLE2 data appears to be 64-bit little-endian IEEE 754 float representing (days and a fraction thereof) since 1899-12-30T00:00:00. Yes the day is 30, not 31.

Update after examining the 2 examples supplied:

The difference between the two observed timestamps (which will be in your local time) is 325920 seconds:

>>> import datetime
>>> t0 = datetime.datetime(2009,10,27,15,33,0)
>>> t1 = datetime.datetime(2009,10,31,10,5,0)
>>> t1-t0
datetime.timedelta(3, 66720)
>>> secs = 3 * 24 * 60 * 60 + 66720
>>> secs
325920

This is the same as the difference between the two magic numbers:

>>> 12901417500 - 1290191580
325920

So the magic numbers represent seconds since some epoch ...

>>> m1 = 12901417500
>>> days, seconds = divmod(m1, 60*60*24)
>>> epoch = t1 - datetime.timedelta(days, seconds)
>>> epoch
datetime.datetime(1601, 1, 1, 11, 0)

So the magic numbers represent seconds since 1601-01-01T00:00:00Z and your TZ is 11 hours away from UTC.

Those two magic numbers won't fit in 32 bits ... looks like either (a) it is stored in 64 bits as seconds since 1601 (a waste of about 29 bits!) or (b) it is stored as (number of 100-nanosecond units) since 1601 as expected but something is dividing it by 10**7 before you see it.

The documentation reference that you gave merely says that it's a VF_FILETIME (UTC) type. Googling that, I find a couple of MS clues on calling Windows functions to manipulate the timestamps, but no definition as far as I looked. However there are two 3rd party notes (from perlmonks and the Apache POI project) which say much the same thing: """This looks like a Windows VT_FILETIME data type which is a 64 bit unsigned integer representing the number of elapsed 100 nanoseconds since 1 January 1601"""

Update from the crime scene:

Seems you are using OleFileIO_PL to read the files. A quick rummage through the sole source file reveals this:

    elif type == VT_FILETIME:
        value = long(i32(s, offset+4)) + (long(i32(s, offset+8))<<32)
        # FIXME: this is a 64-bit int: "number of 100ns periods
        # since Jan 1,1601".  Should map this to Python time
        value = value / 10000000L # seconds
怀里藏娇 2024-08-20 07:03:34

这个问题很老了但仍然有用。我最近改进了 OleFileIO_PL,通过将日期自动转换为 Python 日期时间来解决该问题。

请参阅本页的文档,尤其是有关 get_metadata 和 get_properties 的部分:
https://bitbucket.org/decalage/olefileio_pl

使用 get_metadata 时,标准属性流中的所有时间戳,例如'\x05SummaryInformation' 被转换为 Python 日期时间。
如果您需要使用 get_properties,请使用 conversion_time 选项:

p = ole.getproperties('specialprops', convert_time=True)

Philippe。

this question is quite old but still useful. I recently improved OleFileIO_PL to fix the issue by converting dates automatically to Python datetime.

See the documentation on this page, especially the parts about get_metadata and get_properties:
https://bitbucket.org/decalage/olefileio_pl

When using get_metadata, all timestamps in standard property streams such as '\x05SummaryInformation' are converted to Python datetime.
If you need to use get_properties instead, then use the convert_time option:

p = ole.getproperties('specialprops', convert_time=True)

Philippe.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文