使用 Python 读取 CR2(原始佳能图像)标头

发布于 2024-09-18 10:29:58 字数 559 浏览 4 评论 0原文

我正在尝试提取从 CR2(原始图片的佳能格式)拍摄照片的日期/时间。

我知道 CR2 规范,并且我知道我可以使用 Python struct 模块从二进制缓冲区中提取片段。

简而言之,规范说在标签 0x0132 / 306 中我可以找到长度为 20 的字符串 - 日期和时间。

我尝试使用以下方法获取该标签:

struct.unpack_from(20*'s', buffer, 0x0132)

但我得到了

('\x00', '\x00', "'", '\x88, ...[and more crap])

任何想法?

编辑

非常感谢您的努力!答案是惊人的,我学到了很多关于处理二进制数据的知识。

I'm trying to extract the date/time when a picture was taken from the CR2 (Canon format for raw pictures).

I know the CR2 specification, and I know I can use Python struct module to extract pieces from a binary buffer.

Briefly, the specification says that in Tag 0x0132 / 306 I can find an string of length 20 - the date and time.

I tried to get that tag by using:

struct.unpack_from(20*'s', buffer, 0x0132)

but I get

('\x00', '\x00', "'", '\x88, ...[and more crap])

Any ideas?

Edit

Many thanks for the thorough effort! The answers are phenomenal and I learned a lot about handling binary data.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

凉宸 2024-09-25 10:29:58

您是否考虑过应该(根据规范)位于您正在讨论的 IFD 块之前的标头?

我查看了规范,它说第一个 IFD 块遵循 16 字节标头。因此,如果我们读取字节 16 和 17(在偏移量 0x10 十六进制处),我们应该获得第一个 IFD 块中的条目数。然后我们只需要搜索每个条目,直到找到一个匹配的标签 id(正如我所读到的),它为我们提供了日期/时间字符串的字节偏移量。

这对我有用:

from struct import *

def FindDateTimeOffsetFromCR2( buffer, ifd_offset ):
    # Read the number of entries in IFD #0
    (num_of_entries,) = unpack_from('H', buffer, ifd_offset)
    print "ifd #0 contains %d entries"%num_of_entries

    # Work out where the date time is stored
    datetime_offset = -1
    for entry_num in range(0,num_of_entries-1):
        (tag_id, tag_type, num_of_value, value) = unpack_from('HHLL', buffer, ifd_offset+2+entry_num*12)
        if tag_id == 0x0132:
            print "found datetime at offset %d"%value
            datetime_offset = value
    return datetime_offset

if __name__ == '__main__':
    with open("IMG_6113.CR2", "rb") as f:
        buffer = f.read(1024) # read the first 1kb of the file should be enough to find the date / time
        datetime_offset = FindDateTimeOffsetFromCR2(buffer, 0x10)
        print unpack_from(20*'s', buffer, datetime_offset)

我的示例文件的输出是:

ifd #0 contains 14 entries
found datetime at offset 250
('2', '0', '1', '0', ':', '0', '8', ':', '0', '1', ' ', '2', '3', ':', '4', '5', ':', '4', '6', '\x00')

[编辑] - 修订/更彻底的示例

from struct import *

recognised_tags = { 
    0x0100 : 'imageWidth',
    0x0101 : 'imageLength',
    0x0102 : 'bitsPerSample',
    0x0103 : 'compression',
    0x010f : 'make',    
    0x0110 : 'model',
    0x0111 : 'stripOffset',
    0x0112 : 'orientation', 
    0x0117 : 'stripByteCounts',
    0x011a : 'xResolution',
    0x011b : 'yResolution',
    0x0128 : 'resolutionUnit',
    0x0132 : 'dateTime',
    0x8769 : 'EXIF',
    0x8825 : 'GPS data'};

def GetHeaderFromCR2( buffer ):
    # Unpack the header into a tuple
    header = unpack_from('HHLHBBL', buffer)

    print "\nbyte_order = 0x%04X"%header[0]
    print "tiff_magic_word = %d"%header[1]
    print "tiff_offset = 0x%08X"%header[2]
    print "cr2_magic_word = %d"%header[3]
    print "cr2_major_version = %d"%header[4]
    print "cr2_minor_version = %d"%header[5]
    print "raw_ifd_offset = 0x%08X\n"%header[6]

    return header

def FindDateTimeOffsetFromCR2( buffer, ifd_offset, endian_flag ):
    # Read the number of entries in IFD #0
    (num_of_entries,) = unpack_from(endian_flag+'H', buffer, ifd_offset)
    print "Image File Directory #0 contains %d entries\n"%num_of_entries

    # Work out where the date time is stored
    datetime_offset = -1

    # Go through all the entries looking for the datetime field
    print " id  | type |  number  |  value   "
    for entry_num in range(0,num_of_entries):

        # Grab this IFD entry
        (tag_id, tag_type, num_of_value, value) = unpack_from(endian_flag+'HHLL', buffer, ifd_offset+2+entry_num*12)

        # Print out the entry for information
        print "%04X | %04X | %08X | %08X "%(tag_id, tag_type, num_of_value, value),
        if tag_id in recognised_tags:
            print recognised_tags[tag_id]

        # If this is the datetime one we're looking for, make a note of the offset
        if tag_id == 0x0132:
            assert tag_type == 2
            assert num_of_value == 20
            datetime_offset = value

    return datetime_offset

if __name__ == '__main__':
    with open("IMG_6113.CR2", "rb") as f:
        # read the first 1kb of the file should be enough to find the date/time
        buffer = f.read(1024) 

        # Grab the various parts of the header
        (byte_order, tiff_magic_word, tiff_offset, cr2_magic_word, cr2_major_version, cr2_minor_version, raw_ifd_offset) = GetHeaderFromCR2(buffer)

        # Set the endian flag
        endian_flag = '@'
        if byte_order == 0x4D4D:
            # motorola format
            endian_flag = '>'
        elif byte_order == 0x4949:
            # intel format
            endian_flag = '<'

        # Search for the datetime entry offset
        datetime_offset = FindDateTimeOffsetFromCR2(buffer, 0x10, endian_flag)

        datetime_string = unpack_from(20*'s', buffer, datetime_offset)
        print "\nDatetime: "+"".join(datetime_string)+"\n"

Have you taken into account the header which should (according to the spec) precede the IFD block you're talking about?

I looked through the spec and it says the first IFD block follows the 16 byte header. So if we read bytes 16 and 17 (at offset 0x10 hex) we should get the number of entries in the first IFD block. Then we just have to search through each entry until we find a matching tag id which (as I read it) gives us the byte offset of your date / time string.

This works for me:

from struct import *

def FindDateTimeOffsetFromCR2( buffer, ifd_offset ):
    # Read the number of entries in IFD #0
    (num_of_entries,) = unpack_from('H', buffer, ifd_offset)
    print "ifd #0 contains %d entries"%num_of_entries

    # Work out where the date time is stored
    datetime_offset = -1
    for entry_num in range(0,num_of_entries-1):
        (tag_id, tag_type, num_of_value, value) = unpack_from('HHLL', buffer, ifd_offset+2+entry_num*12)
        if tag_id == 0x0132:
            print "found datetime at offset %d"%value
            datetime_offset = value
    return datetime_offset

if __name__ == '__main__':
    with open("IMG_6113.CR2", "rb") as f:
        buffer = f.read(1024) # read the first 1kb of the file should be enough to find the date / time
        datetime_offset = FindDateTimeOffsetFromCR2(buffer, 0x10)
        print unpack_from(20*'s', buffer, datetime_offset)

Output for my example file is:

ifd #0 contains 14 entries
found datetime at offset 250
('2', '0', '1', '0', ':', '0', '8', ':', '0', '1', ' ', '2', '3', ':', '4', '5', ':', '4', '6', '\x00')

[edit] - a revised / more thorough example

from struct import *

recognised_tags = { 
    0x0100 : 'imageWidth',
    0x0101 : 'imageLength',
    0x0102 : 'bitsPerSample',
    0x0103 : 'compression',
    0x010f : 'make',    
    0x0110 : 'model',
    0x0111 : 'stripOffset',
    0x0112 : 'orientation', 
    0x0117 : 'stripByteCounts',
    0x011a : 'xResolution',
    0x011b : 'yResolution',
    0x0128 : 'resolutionUnit',
    0x0132 : 'dateTime',
    0x8769 : 'EXIF',
    0x8825 : 'GPS data'};

def GetHeaderFromCR2( buffer ):
    # Unpack the header into a tuple
    header = unpack_from('HHLHBBL', buffer)

    print "\nbyte_order = 0x%04X"%header[0]
    print "tiff_magic_word = %d"%header[1]
    print "tiff_offset = 0x%08X"%header[2]
    print "cr2_magic_word = %d"%header[3]
    print "cr2_major_version = %d"%header[4]
    print "cr2_minor_version = %d"%header[5]
    print "raw_ifd_offset = 0x%08X\n"%header[6]

    return header

def FindDateTimeOffsetFromCR2( buffer, ifd_offset, endian_flag ):
    # Read the number of entries in IFD #0
    (num_of_entries,) = unpack_from(endian_flag+'H', buffer, ifd_offset)
    print "Image File Directory #0 contains %d entries\n"%num_of_entries

    # Work out where the date time is stored
    datetime_offset = -1

    # Go through all the entries looking for the datetime field
    print " id  | type |  number  |  value   "
    for entry_num in range(0,num_of_entries):

        # Grab this IFD entry
        (tag_id, tag_type, num_of_value, value) = unpack_from(endian_flag+'HHLL', buffer, ifd_offset+2+entry_num*12)

        # Print out the entry for information
        print "%04X | %04X | %08X | %08X "%(tag_id, tag_type, num_of_value, value),
        if tag_id in recognised_tags:
            print recognised_tags[tag_id]

        # If this is the datetime one we're looking for, make a note of the offset
        if tag_id == 0x0132:
            assert tag_type == 2
            assert num_of_value == 20
            datetime_offset = value

    return datetime_offset

if __name__ == '__main__':
    with open("IMG_6113.CR2", "rb") as f:
        # read the first 1kb of the file should be enough to find the date/time
        buffer = f.read(1024) 

        # Grab the various parts of the header
        (byte_order, tiff_magic_word, tiff_offset, cr2_magic_word, cr2_major_version, cr2_minor_version, raw_ifd_offset) = GetHeaderFromCR2(buffer)

        # Set the endian flag
        endian_flag = '@'
        if byte_order == 0x4D4D:
            # motorola format
            endian_flag = '>'
        elif byte_order == 0x4949:
            # intel format
            endian_flag = '<'

        # Search for the datetime entry offset
        datetime_offset = FindDateTimeOffsetFromCR2(buffer, 0x10, endian_flag)

        datetime_string = unpack_from(20*'s', buffer, datetime_offset)
        print "\nDatetime: "+"".join(datetime_string)+"\n"
小情绪 2024-09-25 10:29:58

0x0132不是偏移量,而是日期的标签号。 CR2 或 TIFF 分别是基于目录的格式。您必须根据您要查找的(已知)标签来查找条目。

编辑
好的,首先,您必须读取文件数据是否使用小端格式或大端格式保存。前八个字节指定标头,该标头的前两个字节指定字节顺序。 Python 的 struct 模块允许您通过在格式字符串前添加“<”前缀来处理小端和大端数据。或“>”。因此,假设 data 是包含 CR2 图像的缓冲区,您可以通过以下方式处理字节序

header = data[:8]
endian_flag = "<" if header[:2] == "II" else ">"

: 格式规范规定第一个图像文件目录从相对于文件开头的偏移量开始,偏移量为在标头的最后 4 个字节中指定。因此,要获取第一个 IFD 的偏移量,您可以使用类似于此的行:

ifd_offset = struct.unpack("{0}I".format(endian_flag), header[4:])[0]

您现在可以继续读取第一个 IFD。您将在文件中指定偏移处找到目录中的条目数,该文件有两个字节宽。因此,您可以使用以下方法读取第一个 IFD 中的条目数:

number_of_entries = struct.unpack("{0}H".format(endian_flag), data[ifd_offset:ifd_offset+2])[0]

字段条目的长度为 12 个字节,因此您可以计算 IFD 的长度。在 number_of_entries * 12 字节之后,还会有另一个 4 字节长的偏移量,告诉你在哪里寻找下一个目录。这基本上就是处理 TIFF 和 CR2 图像的方式。

这里的“魔力”是要注意,对于每个 12 字节字段条目,前两个字节将是标签 ID。这就是您寻找标签 0x0132 的地方。因此,假设您知道第一个 IFD 从文件中的 ifd_offset 开始,您可以通过以下方式扫描第一个目录:

current_position = ifd_offset + 2
for field_offset in xrange(current_position, number_of_entries*12, 12):
    field_tag = struct.unpack("{0}H".format(endian_flag), data[field_offset:field_offset+2])[0]
    field_type = struct.unpack("{0}H".format(endian_flag), data[field_offset+2:field_offset+4])[0]
    value_count = struct.unpack("{0}I".format(endian_flag), data[field_offset+4:field_offset+8])[0]
    value_offset = struct.unpack("{0}I".format(endian_flag), data[field_offset+8:field_offset+12])[0]

    if field_tag == 0x0132:
        # You are now reading a field entry containing the date and time
        assert field_type == 2 # Type 2 is ASCII
        assert value_count == 20 # You would expect a string length of 20 here
        date_time = struct.unpack("20s", data[value_offset:value_offset+20])
        print date_time

您显然希望将解包重构为一个通用函数,并可能将整个格式包装到一个不错的类中,但是超出了本示例的范围。您还可以通过将多个格式字符串合并为一个来缩短解包过程,从而生成一个更大的元组,其中包含可以解包到不同变量中的所有字段,为了清楚起见,我将其省略。

0x0132 is not the offset, it's the tag number of the date. CR2 or TIFF, respectively, is a directory based format. You have to look up the entry given your the (known) tag you are looking for.

Edit:
Ok, first of all, you have to read if the file data is saved using little or big-endian format. The first eight byte specify the header, and the first two byte of that header specify the endianness. Python's struct module allows you to handle little and big endian data by prefixing a format string with either '<' or '>'. So, assuming data is a buffer containing your CR2 image, you can handle endianness via

header = data[:8]
endian_flag = "<" if header[:2] == "II" else ">"

The format specification states that the first image file directory begins at an offset relative to the beginning of the file, with the offset being specified in the last 4 bytes of the header. So, to get the offset to the first IFD, you can use a line similar to this one:

ifd_offset = struct.unpack("{0}I".format(endian_flag), header[4:])[0]

You can now go ahead and read the first IFD. You will find the number of entries in the directory at the specified offset into the file, which is two bytes wide. Thus, you would read the number of entries in the first IFD using:

number_of_entries = struct.unpack("{0}H".format(endian_flag), data[ifd_offset:ifd_offset+2])[0]

A field entry is 12 bytes long, so you can calculate the length of the IFD. After number_of_entries * 12 bytes, there will be another 4 byte long offset, telling you where to look for the next directory. That is basically how you work with TIFF and CR2 images.

The "magic" here is to note that with each of the 12 byte field entries, the first two bytes will be the tag ID. And that is where you look for your tag 0x0132. So, given you know that the first IFD starts at ifd_offset in the file, you can scan the first directory via:

current_position = ifd_offset + 2
for field_offset in xrange(current_position, number_of_entries*12, 12):
    field_tag = struct.unpack("{0}H".format(endian_flag), data[field_offset:field_offset+2])[0]
    field_type = struct.unpack("{0}H".format(endian_flag), data[field_offset+2:field_offset+4])[0]
    value_count = struct.unpack("{0}I".format(endian_flag), data[field_offset+4:field_offset+8])[0]
    value_offset = struct.unpack("{0}I".format(endian_flag), data[field_offset+8:field_offset+12])[0]

    if field_tag == 0x0132:
        # You are now reading a field entry containing the date and time
        assert field_type == 2 # Type 2 is ASCII
        assert value_count == 20 # You would expect a string length of 20 here
        date_time = struct.unpack("20s", data[value_offset:value_offset+20])
        print date_time

You'd obviously want to refactor that unpacking into a common function and probably wrap the whole format into a nice class, but that is beyond the scope of this example. You can also shorten the unpacking by combining multiple format strings into one, yielding a larger tuple containing all the fields you can unpack into distinct variables, which I left out for clarity.

原野 2024-09-25 10:29:58

我发现 EXIF.py 来自 https://github.com/ianare/exif-py从 .CR2 文件读取 EXIF 数据。看来因为.CR2 文件是基于.TIFF 文件的,所以EXIF.py 是兼容的。

    import EXIF
    import time

    # Change the filename to be suitable for you
    f = open('../DCIM/100CANON/IMG_3432.CR2', 'rb')
    data = EXIF.process_file(f)
    f.close()
    date_str = data['EXIF DateTimeOriginal'].values

    # We have the raw data
    print date_str

    # We can now convert it
    date = time.strptime(date_str, '%Y:%m:%d %H:%M:%S')
    print date

这打印:

    2011:04:30 11:08:44
    (2011, 4, 30, 11, 8, 44, 5, 120, -1)

I found that EXIF.py from https://github.com/ianare/exif-py reads the EXIF data from .CR2 files. It seems that because .CR2 files are based on .TIFF files EXIF.py is compatible.

    import EXIF
    import time

    # Change the filename to be suitable for you
    f = open('../DCIM/100CANON/IMG_3432.CR2', 'rb')
    data = EXIF.process_file(f)
    f.close()
    date_str = data['EXIF DateTimeOriginal'].values

    # We have the raw data
    print date_str

    # We can now convert it
    date = time.strptime(date_str, '%Y:%m:%d %H:%M:%S')
    print date

And this prints:

    2011:04:30 11:08:44
    (2011, 4, 30, 11, 8, 44, 5, 120, -1)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文