通过 PHP 检测 excel .xlsx 文件 mimetype

发布于 2024-12-03 01:32:50 字数 343 浏览 0 评论 0原文

我无法通过 PHP 检测 xlsx Excel 文件的 mimetype,因为它是 zip 存档。

File utilite

file file.xlsx
file.xlsx: Zip archive data, at least v2.0 to extract

PECL fileinfo

$finfo = finfo_open(FILEINFO_MIME_TYPE);
finfo_file($finfo, "file.xlsx");
application/zip

如何验证它?解压并查看结构?但如果是弧弹呢?

I can't detect mimetype for xlsx Excel file via PHP because it's zip archive.

File utilite

file file.xlsx
file.xlsx: Zip archive data, at least v2.0 to extract

PECL fileinfo

$finfo = finfo_open(FILEINFO_MIME_TYPE);
finfo_file($finfo, "file.xlsx");
application/zip

How to validate it? Unpack and view structure? But if it's arcbomb?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

野稚 2024-12-10 01:32:50

概述

PHP 使用 libmagic。当 Magic 检测到 MIME 类型为“application/zip”而不是“application/vnd.openxmlformats-officedocument.spreadsheetml.sheet”时,这是因为添加到 ZIP 存档的文件需要按特定顺序排列。

将文件上传到强制匹配文件扩展名和 MIME 类型的服务时,这会导致问题。例如,基于 Mediawiki 的 wiki(使用 PHP 编写)会阻止上传某些 XLSX 文件,因为它们被检测为 ZIP 文件。

您需要做的是通过重新排序写入 ZIP 存档的文件来修复 XLSX,以便 Magic 可以正确检测 MIME 类型。

分析文件

在本示例中,我们将分析使用 Openpyxl 和 Excel 创建的 XLSX 文件。

可以使用unzip查看文件列表:

$ unzip -l Openpyxl.xlsx
Archive:  Openpyxl.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
      177  2019-12-21 04:34   docProps/app.xml
      452  2019-12-21 04:34   docProps/core.xml
    10140  2019-12-21 04:34   xl/theme/theme1.xml
    22445  2019-12-21 04:34   xl/worksheets/sheet1.xml
      586  2019-12-21 04:34   xl/tables/table1.xml
      238  2019-12-21 04:34   xl/worksheets/_rels/sheet1.xml.rels
      951  2019-12-21 04:34   xl/styles.xml
      534  2019-12-21 04:34   _rels/.rels
      552  2019-12-21 04:34   xl/workbook.xml
      507  2019-12-21 04:34   xl/_rels/workbook.xml.rels
     1112  2019-12-21 04:34   [Content_Types].xml
---------                     -------
    37694                     11 files

$ unzip -l Excel.xlsx
Archive:  Excel.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1476  1980-01-01 00:00   [Content_Types].xml
      732  1980-01-01 00:00   _rels/.rels
      831  1980-01-01 00:00   xl/_rels/workbook.xml.rels
     1159  1980-01-01 00:00   xl/workbook.xml
      239  1980-01-01 00:00   xl/sharedStrings.xml
      293  1980-01-01 00:00   xl/worksheets/_rels/sheet1.xml.rels
     6796  1980-01-01 00:00   xl/theme/theme1.xml
     1540  1980-01-01 00:00   xl/styles.xml
     1119  1980-01-01 00:00   xl/worksheets/sheet1.xml
    39574  1980-01-01 00:00   docProps/thumbnail.wmf
      785  1980-01-01 00:00   docProps/app.xml
      169  1980-01-01 00:00   xl/calcChain.xml
      513  1980-01-01 00:00   xl/tables/table1.xml
      601  1980-01-01 00:00   docProps/core.xml
---------                     -------
    55827                     14 files

注意文件顺序不同。

MIME 类型可以使用 PHP:

<?php
echo mime_content_type('Openpyxl.xlsx') . "<br/>\n";
echo mime_content_type('Excel.xlsx');

使用 python-magic:

pip install python-magic

或在 Windows 上

pip install python-magic-bin==0.4.14

查看: ‌代码:

import magic
mime = magic.Magic(mime=True)
print(mime.from_file("Openpyxl.xlsx"))
print(mime.from_file("Excel.xlsx"))

输出:

application/zip
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

解决方案

@adrilo 已经研究了这个问题并制定了解决方案。

@garak

折腾了几个小时后,我终于明白了为什么哑剧类型是错误的。事实证明,XML 文件添加到最终 ZIP 文件(XLSX 文件是带有 xlsx 扩展名的 ZIP 文件)的顺序对于用于检测类型的启发式方法很重要。

目前,文件按以下顺序添加:

<前><代码>[Content_Types].xml
_rels/.rels
docProps/app.xml
docProps/core.xml
xl/_rels/workbook.xml.rels
xl/sharedStrings.xml
xl/styles.xml
xl/workbook.xml
xl/worksheets/sheet1.xml

问题来自于插入“docProps”相关文件。看起来启发式是查看前几个字节并检查是否找到 Content_Typesxl。通过在中间插入“docProps”文件,第一个 xl 出现必须发生在算法查看的第一个字节之外,因此得出结论它是一个简单的 zip 文件。

我会尽力解决这个问题

修复#149

启发式检测期望看到的 XLSX 文件的正确 MIME 类型
XLSX 存档开头的某些文件。其中的顺序
添加 XML 文件因此很重要。具体来说,
应首先添加“[Content_Types].xml”,然后添加文件
位于“xl”文件夹中(至少 1 个文件)。

根据< a href="http://opensource.box.com/spout/" rel="nofollow noreferrer">Spout 的 FileSystemHelper.php

为了正确检测文件的 mime 类型,文件需要
以特定顺序添加到 zip 文件中。 “[内容类型].xml”
那么应该首先压缩位于“xl”文件夹中的至少 2 个文件。

解决方案是添加文件按此顺序排列“[Content_Types].xml”、“xl/workbook.xml”和“xl/styles.xml”,然后是其余文件。

代码

此 Python 脚本将重写一个 XLSX 文件,其中包含按正确顺序排列的存档文件。

#!/usr/bin/env python

from io import BytesIO
from zipfile import ZipFile, ZIP_DEFLATED

XL_FOLDER_NAME = "xl"

CONTENT_TYPES_XML_FILE_NAME = "[Content_Types].xml"
WORKBOOK_XML_FILE_NAME = "workbook.xml"
STYLES_XML_FILE_NAME = "styles.xml"

FIRST_NAMES = [
    CONTENT_TYPES_XML_FILE_NAME,
    f"{XL_FOLDER_NAME}/{WORKBOOK_XML_FILE_NAME}",
    f"{XL_FOLDER_NAME}/{STYLES_XML_FILE_NAME}"
]


def fix_workbook_mime_type(file_path):
    buffer = BytesIO()

    with ZipFile(file_path) as zip_file:
        names = zip_file.namelist()
        print(names)

        remaining_names = [name for name in names if name not in FIRST_NAMES]
        ordered_names = FIRST_NAMES + remaining_names
        print(ordered_names)

        with ZipFile(buffer, "w", ZIP_DEFLATED, allowZip64=True) as buffer_zip_file:
            for name in ordered_names:
                try:
                    file = zip_file.open(name)
                    buffer_zip_file.writestr(file.name, file.read())
                except KeyError:
                    pass

    with open(file_path, "wb") as file:
        file.write(buffer.getvalue())


def main(*args):
    fix_workbook_mime_type("File.xlsx")


if __name__ == "__main__":
    main()

Overview

PHP uses libmagic. When Magic detects the MIME type as "application/zip" instead of "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", this is because the files added to the ZIP archive need to be in a certain order.

This causes a problem when uploading files to services that enforce matching file extension and MIME type. For example, Mediawiki-based wikis (written using PHP) are blocking certain XLSX files from being uploaded because they are detected as ZIP files.

What you need to do is fix your XLSX by reordering the files written to the ZIP archive so that Magic can detect the MIME type properly.

Analyzing files

For this example, we will analyze an XLSX file created using Openpyxl and Excel.

The file list can be viewed using unzip:

$ unzip -l Openpyxl.xlsx
Archive:  Openpyxl.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
      177  2019-12-21 04:34   docProps/app.xml
      452  2019-12-21 04:34   docProps/core.xml
    10140  2019-12-21 04:34   xl/theme/theme1.xml
    22445  2019-12-21 04:34   xl/worksheets/sheet1.xml
      586  2019-12-21 04:34   xl/tables/table1.xml
      238  2019-12-21 04:34   xl/worksheets/_rels/sheet1.xml.rels
      951  2019-12-21 04:34   xl/styles.xml
      534  2019-12-21 04:34   _rels/.rels
      552  2019-12-21 04:34   xl/workbook.xml
      507  2019-12-21 04:34   xl/_rels/workbook.xml.rels
     1112  2019-12-21 04:34   [Content_Types].xml
---------                     -------
    37694                     11 files

$ unzip -l Excel.xlsx
Archive:  Excel.xlsx
  Length      Date    Time    Name
---------  ---------- -----   ----
     1476  1980-01-01 00:00   [Content_Types].xml
      732  1980-01-01 00:00   _rels/.rels
      831  1980-01-01 00:00   xl/_rels/workbook.xml.rels
     1159  1980-01-01 00:00   xl/workbook.xml
      239  1980-01-01 00:00   xl/sharedStrings.xml
      293  1980-01-01 00:00   xl/worksheets/_rels/sheet1.xml.rels
     6796  1980-01-01 00:00   xl/theme/theme1.xml
     1540  1980-01-01 00:00   xl/styles.xml
     1119  1980-01-01 00:00   xl/worksheets/sheet1.xml
    39574  1980-01-01 00:00   docProps/thumbnail.wmf
      785  1980-01-01 00:00   docProps/app.xml
      169  1980-01-01 00:00   xl/calcChain.xml
      513  1980-01-01 00:00   xl/tables/table1.xml
      601  1980-01-01 00:00   docProps/core.xml
---------                     -------
    55827                     14 files

Notice that the file order is different.

The MIME types can be viewed using PHP:

<?php
echo mime_content_type('Openpyxl.xlsx') . "<br/>\n";
echo mime_content_type('Excel.xlsx');

or using python-magic:

pip install python-magic

on Windows:

pip install python-magic-bin==0.4.14

‌Code:

import magic
mime = magic.Magic(mime=True)
print(mime.from_file("Openpyxl.xlsx"))
print(mime.from_file("Excel.xlsx"))

Output:

application/zip
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Solution

@adrilo has investigated this problem and has developed a solution.

Hey @garak,

After pulling my hair out for a few hours, I finally figured out why the mime type is wrong. It turns out the order in which the XML files gets added to the final ZIP file (an XLSX file being a ZIP file with the xlsx extension) matters for the heuristics used to detect types.

Currently, files are added in this order:

[Content_Types].xml
_rels/.rels
docProps/app.xml
docProps/core.xml
xl/_rels/workbook.xml.rels
xl/sharedStrings.xml
xl/styles.xml
xl/workbook.xml
xl/worksheets/sheet1.xml

The problem comes from inserting the "docProps" related files. It seems like the heuristic is to look at the first few bytes and check if it finds Content_Types and xl. By having the "docProps" files inserted in between, the first xl occurrence must happen outside of the first bytes the algorithm looks at and therefore concludes it's a simple zip file.

I'll try to fix this nicely

Fixes #149

Heuristics to detect proper mime type for XLSX files expect to see
certain files at the beginning of the XLSX archive. The order in which
the XML files are added therefore matters. Specifically,
"[Content_Types].xml" should be added first, followed by the files
located in the "xl" folder (at least 1 file).

According to Spout's FileSystemHelper.php:

In order to have the file's mime type detected properly, files need to
be added to the zip file in a particular order. "[Content_Types].xml"
then at least 2 files located in "xl" folder should be zipped first.

The solution is to add the files "[Content_Types].xml", "xl/workbook.xml", and "xl/styles.xml" in that order and then the remaining files.

Code

This Python script will rewrite an XLSX file that has the archive files in the proper order.

#!/usr/bin/env python

from io import BytesIO
from zipfile import ZipFile, ZIP_DEFLATED

XL_FOLDER_NAME = "xl"

CONTENT_TYPES_XML_FILE_NAME = "[Content_Types].xml"
WORKBOOK_XML_FILE_NAME = "workbook.xml"
STYLES_XML_FILE_NAME = "styles.xml"

FIRST_NAMES = [
    CONTENT_TYPES_XML_FILE_NAME,
    f"{XL_FOLDER_NAME}/{WORKBOOK_XML_FILE_NAME}",
    f"{XL_FOLDER_NAME}/{STYLES_XML_FILE_NAME}"
]


def fix_workbook_mime_type(file_path):
    buffer = BytesIO()

    with ZipFile(file_path) as zip_file:
        names = zip_file.namelist()
        print(names)

        remaining_names = [name for name in names if name not in FIRST_NAMES]
        ordered_names = FIRST_NAMES + remaining_names
        print(ordered_names)

        with ZipFile(buffer, "w", ZIP_DEFLATED, allowZip64=True) as buffer_zip_file:
            for name in ordered_names:
                try:
                    file = zip_file.open(name)
                    buffer_zip_file.writestr(file.name, file.read())
                except KeyError:
                    pass

    with open(file_path, "wb") as file:
        file.write(buffer.getvalue())


def main(*args):
    fix_workbook_mime_type("File.xlsx")


if __name__ == "__main__":
    main()
羅雙樹 2024-12-10 01:32:50

我知道这适用于 zip 文件,但我不太确定 xlsx 文件。值得一试:

列出 zip 存档中的文件:

$zip = new ZipArchive;
$res = $zip->open('test.zip');
if ($res === TRUE) {
    for ($i=0; $i<$zip->numFiles; $i++) {
        print_r($zip->statIndex($i));
    }
    $zip->close();
} else {
    echo 'failed, code:' . $res;
}

这将打印所有文件,如下所示:

Array
(
    [name] => file.png
    [index] => 2
    [crc] => -485783131
    [size] => 1486337
    [mtime] => 1311209860
    [comp_size] => 1484832
    [comp_method] => 8
)

正如您在此处看到的,它给出了 sizecomp_size对于每个档案。如果是档案炸弹,这两个数字的比例将是天文数字。您可以简单地限制最大解压缩文件大小的兆字节,如果超过该数量,则跳过该文件并向用户返回错误消息,否则继续提取。有关详细信息,请参阅手册

I know this works for zip files, but I'm not too sure about xlsx files. It's worth a try:

To list the files in a zip archive:

$zip = new ZipArchive;
$res = $zip->open('test.zip');
if ($res === TRUE) {
    for ($i=0; $i<$zip->numFiles; $i++) {
        print_r($zip->statIndex($i));
    }
    $zip->close();
} else {
    echo 'failed, code:' . $res;
}

This will print all the files like this:

Array
(
    [name] => file.png
    [index] => 2
    [crc] => -485783131
    [size] => 1486337
    [mtime] => 1311209860
    [comp_size] => 1484832
    [comp_method] => 8
)

As you can see here, it gives the size and the comp_size for each archive. If it is an archive bomb, the ratio between these two numbers will be astronomical. You could simply put a limit of however many megabytes you want the maximum decompressed file size and if it exceeds that amount, skip that file and give an error message back to the user, else proceed with your extraction. See the manual for more information.

゛清羽墨安 2024-12-10 01:32:50

这是一个可以正确识别 Microsoft Office 2007 文档的包装器。使用、编辑和添加更多文件扩展名/mimetypes 都是简单而简单的。

function get_mimetype($filepath) {
    if(!preg_match('/\.[^\/\\\\]+$/',$filepath)) {
        return finfo_file(finfo_open(FILEINFO_MIME_TYPE), $filepath);
    }
    switch(strtolower(preg_replace('/^.*\./','',$filepath))) {
        // START MS Office 2007 Docs
        case 'docx':
            return 'application/vnd.openxmlformats-officedocument.wordprocessingml.document';
        case 'docm':
            return 'application/vnd.ms-word.document.macroEnabled.12';
        case 'dotx':
            return 'application/vnd.openxmlformats-officedocument.wordprocessingml.template';
        case 'dotm':
            return 'application/vnd.ms-word.template.macroEnabled.12';
        case 'xlsx':
            return 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet';
        case 'xlsm':
            return 'application/vnd.ms-excel.sheet.macroEnabled.12';
        case 'xltx':
            return 'application/vnd.openxmlformats-officedocument.spreadsheetml.template';
        case 'xltm':
            return 'application/vnd.ms-excel.template.macroEnabled.12';
        case 'xlsb':
            return 'application/vnd.ms-excel.sheet.binary.macroEnabled.12';
        case 'xlam':
            return 'application/vnd.ms-excel.addin.macroEnabled.12';
        case 'pptx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.presentation';
        case 'pptm':
            return 'application/vnd.ms-powerpoint.presentation.macroEnabled.12';
        case 'ppsx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.slideshow';
        case 'ppsm':
            return 'application/vnd.ms-powerpoint.slideshow.macroEnabled.12';
        case 'potx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.template';
        case 'potm':
            return 'application/vnd.ms-powerpoint.template.macroEnabled.12';
        case 'ppam':
            return 'application/vnd.ms-powerpoint.addin.macroEnabled.12';
        case 'sldx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.slide';
        case 'sldm':
            return 'application/vnd.ms-powerpoint.slide.macroEnabled.12';
        case 'one':
            return 'application/msonenote';
        case 'onetoc2':
            return 'application/msonenote';
        case 'onetmp':
            return 'application/msonenote';
        case 'onepkg':
            return 'application/msonenote';
        case 'thmx':
            return 'application/vnd.ms-officetheme';
            //END MS Office 2007 Docs

    }
    return finfo_file(finfo_open(FILEINFO_MIME_TYPE), $filepath);
}

Here is an wrapper that will properly identify Microsoft Office 2007 documents. It's trivial and straightforward to use, edit, and to add more file extentions/mimetypes.

function get_mimetype($filepath) {
    if(!preg_match('/\.[^\/\\\\]+$/',$filepath)) {
        return finfo_file(finfo_open(FILEINFO_MIME_TYPE), $filepath);
    }
    switch(strtolower(preg_replace('/^.*\./','',$filepath))) {
        // START MS Office 2007 Docs
        case 'docx':
            return 'application/vnd.openxmlformats-officedocument.wordprocessingml.document';
        case 'docm':
            return 'application/vnd.ms-word.document.macroEnabled.12';
        case 'dotx':
            return 'application/vnd.openxmlformats-officedocument.wordprocessingml.template';
        case 'dotm':
            return 'application/vnd.ms-word.template.macroEnabled.12';
        case 'xlsx':
            return 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet';
        case 'xlsm':
            return 'application/vnd.ms-excel.sheet.macroEnabled.12';
        case 'xltx':
            return 'application/vnd.openxmlformats-officedocument.spreadsheetml.template';
        case 'xltm':
            return 'application/vnd.ms-excel.template.macroEnabled.12';
        case 'xlsb':
            return 'application/vnd.ms-excel.sheet.binary.macroEnabled.12';
        case 'xlam':
            return 'application/vnd.ms-excel.addin.macroEnabled.12';
        case 'pptx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.presentation';
        case 'pptm':
            return 'application/vnd.ms-powerpoint.presentation.macroEnabled.12';
        case 'ppsx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.slideshow';
        case 'ppsm':
            return 'application/vnd.ms-powerpoint.slideshow.macroEnabled.12';
        case 'potx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.template';
        case 'potm':
            return 'application/vnd.ms-powerpoint.template.macroEnabled.12';
        case 'ppam':
            return 'application/vnd.ms-powerpoint.addin.macroEnabled.12';
        case 'sldx':
            return 'application/vnd.openxmlformats-officedocument.presentationml.slide';
        case 'sldm':
            return 'application/vnd.ms-powerpoint.slide.macroEnabled.12';
        case 'one':
            return 'application/msonenote';
        case 'onetoc2':
            return 'application/msonenote';
        case 'onetmp':
            return 'application/msonenote';
        case 'onepkg':
            return 'application/msonenote';
        case 'thmx':
            return 'application/vnd.ms-officetheme';
            //END MS Office 2007 Docs

    }
    return finfo_file(finfo_open(FILEINFO_MIME_TYPE), $filepath);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文