通过 PHP 检测 excel .xlsx 文件 mimetype
我无法通过 PHP 检测 xlsx Excel 文件的 mimetype,因为它是 zip 存档。
File utilite
file file.xlsx
file.xlsx: Zip archive data, at least v2.0 to extract
PECL fileinfo
$finfo = finfo_open(FILEINFO_MIME_TYPE);
finfo_file($finfo, "file.xlsx");
application/zip
如何验证它?解压并查看结构?但如果是弧弹呢?
I can't detect mimetype for xlsx Excel file via PHP because it's zip archive.
File utilite
file file.xlsx
file.xlsx: Zip archive data, at least v2.0 to extract
PECL fileinfo
$finfo = finfo_open(FILEINFO_MIME_TYPE);
finfo_file($finfo, "file.xlsx");
application/zip
How to validate it? Unpack and view structure? But if it's arcbomb?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
概述
PHP 使用 libmagic。当 Magic 检测到 MIME 类型为“application/zip”而不是“application/vnd.openxmlformats-officedocument.spreadsheetml.sheet”时,这是因为添加到 ZIP 存档的文件需要按特定顺序排列。
将文件上传到强制匹配文件扩展名和 MIME 类型的服务时,这会导致问题。例如,基于 Mediawiki 的 wiki(使用 PHP 编写)会阻止上传某些 XLSX 文件,因为它们被检测为 ZIP 文件。
您需要做的是通过重新排序写入 ZIP 存档的文件来修复 XLSX,以便 Magic 可以正确检测 MIME 类型。
分析文件
在本示例中,我们将分析使用 Openpyxl 和 Excel 创建的 XLSX 文件。
可以使用unzip查看文件列表:
注意文件顺序不同。
MIME 类型可以使用 PHP:
使用 python-magic:
或在 Windows 上
查看: 代码:
输出:
解决方案
@adrilo 已经研究了这个问题并制定了解决方案。
根据< a href="http://opensource.box.com/spout/" rel="nofollow noreferrer">Spout 的
FileSystemHelper.php
:解决方案是添加文件按此顺序排列“[Content_Types].xml”、“xl/workbook.xml”和“xl/styles.xml”,然后是其余文件。
代码
此 Python 脚本将重写一个 XLSX 文件,其中包含按正确顺序排列的存档文件。
Overview
PHP uses libmagic. When Magic detects the MIME type as "application/zip" instead of "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", this is because the files added to the ZIP archive need to be in a certain order.
This causes a problem when uploading files to services that enforce matching file extension and MIME type. For example, Mediawiki-based wikis (written using PHP) are blocking certain XLSX files from being uploaded because they are detected as ZIP files.
What you need to do is fix your XLSX by reordering the files written to the ZIP archive so that Magic can detect the MIME type properly.
Analyzing files
For this example, we will analyze an XLSX file created using Openpyxl and Excel.
The file list can be viewed using unzip:
Notice that the file order is different.
The MIME types can be viewed using PHP:
or using python-magic:
on Windows:
Code:
Output:
Solution
@adrilo has investigated this problem and has developed a solution.
According to Spout's
FileSystemHelper.php
:The solution is to add the files "[Content_Types].xml", "xl/workbook.xml", and "xl/styles.xml" in that order and then the remaining files.
Code
This Python script will rewrite an XLSX file that has the archive files in the proper order.
我知道这适用于 zip 文件,但我不太确定 xlsx 文件。值得一试:
列出 zip 存档中的文件:
这将打印所有文件,如下所示:
正如您在此处看到的,它给出了
size
和comp_size
对于每个档案。如果是档案炸弹,这两个数字的比例将是天文数字。您可以简单地限制最大解压缩文件大小的兆字节,如果超过该数量,则跳过该文件并向用户返回错误消息,否则继续提取。有关详细信息,请参阅手册。I know this works for zip files, but I'm not too sure about xlsx files. It's worth a try:
To list the files in a zip archive:
This will print all the files like this:
As you can see here, it gives the
size
and thecomp_size
for each archive. If it is an archive bomb, the ratio between these two numbers will be astronomical. You could simply put a limit of however many megabytes you want the maximum decompressed file size and if it exceeds that amount, skip that file and give an error message back to the user, else proceed with your extraction. See the manual for more information.这是一个可以正确识别 Microsoft Office 2007 文档的包装器。使用、编辑和添加更多文件扩展名/mimetypes 都是简单而简单的。
Here is an wrapper that will properly identify Microsoft Office 2007 documents. It's trivial and straightforward to use, edit, and to add more file extentions/mimetypes.