如何检测文件是 PDF 还是 TIFF?
请耐心等待,因为我在不了解所有背景的情况下就被投入到这个项目中。如果你有什么WTF问题,相信我,我也有。
场景如下:我有一堆文件驻留在 IIS 服务器上。它们没有文件扩展名。只是名称为“asda-2342-sd3rs-asd24-ut57”等的裸文件。没有什么直观的。
问题是我需要在 ASP.NET (2.0) 页面上提供文件并将 tiff 文件显示为 tiff,将 PDF 文件显示为 PDF。不幸的是,我不知道哪个是哪个,我需要能够以各自的格式正确显示它们。
例如,假设我需要显示 2 个文件,一个是 tiff,一个是 PDF。该页面应该显示一个 tiff 图像,也许还有一个可以在新选项卡/窗口中打开 PDF 的链接。
问题:
由于这些文件都是无扩展名的,我不得不强制 IIS 将所有内容都以 TIFF 格式提供。但如果我这样做,PDF 文件将不会显示。我可以更改 IIS 以强制 MIME 类型为未知文件扩展名的 PDF,但我会遇到相反的问题。
http://support.microsoft.com/kb/326965
这个问题比我想象的要容易吗?是不是像我想象的那么糟糕?
Please bear with me as I've been thrown into the middle of this project without knowing all the background. If you've got WTF questions, trust me, I have them too.
Here is the scenario: I've got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like "asda-2342-sd3rs-asd24-ut57" and so on. Nothing intuitive.
The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don't know which is which and I need to be able to display them appropriately in their respective formats.
For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.
The problem:
As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won't display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I'd have the reverse problem.
http://support.microsoft.com/kb/326965
Is this problem easier than I think or is it as nasty as I am expecting?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
好吧,有足够多的人犯了这个错误,所以我将发布一些我必须识别 TIFF 的代码:
我分解了一些更通用的代码来得到这个。
对于 PDF,我的代码如下所示:
现在,GetToken() 是对扫描仪的调用,该扫描仪将 Stream 标记为 PDF 标记。这很重要,所以我不打算将其粘贴到这里。我使用分词器而不是查看子字符串来避免出现这样的问题:
上面的代码片段将这段代码标记为 NOT PDF,而更简单的代码块会错误地将其标记为 PDF。
我还应该指出,当前的 ISO 规范缺乏以前 Adobe 拥有的规范中的实施说明。最重要的是来自 PDF 参考,版本 1.6:
OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:
I hacked apart some much more general code to get this.
For PDF, I have code that looks like this:
Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:
this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.
I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:
可以通过查看第一个字节来检测 TIFF http://local.wasp .uwa.edu.au/~pbourke/dataformats/tiff/
关于 PDF: http://www.adobe.com/devnet/livecycle/articles /lc_pdf_overview_format.pdf
TIFF can be detected by peeking at first bytes http://local.wasp.uwa.edu.au/~pbourke/dataformats/tiff/
About PDF: http://www.adobe.com/devnet/livecycle/articles/lc_pdf_overview_format.pdf
阅读每种文件格式的规范将告诉您如何识别该格式的文件。
TIFF 文件 - 检查字节 1 和 2 是否有 0x4D4D 或 0x4949 < em>和字节 2-3 表示值“42”。
规范第 13 页内容如下:
PDF 文件以 PDF 版本开头,后跟几个二进制字节。 (我认为您现在必须购买当前版本的 ISO 规范。)
第 7.5.2 节
当然,您可以通过检查更多文件特定项目来对每个文件进行“更深入”的检查。
Reading the specification for each file format will tell you how to identify files of that format.
TIFF files - Check bytes 1 and 2 for 0x4D4D or 0x4949 and bytes 2-3 for the value '42'.
Page 13 of the spec reads:
PDF files start with the PDF version followed by several binary bytes. (I think you now have to purchase the ISO spec for the current version.)
Section 7.5.2
Of course you could do a "deeper" check on each file by checking more file specific items.
Gary Kessler 提供了一个非常有用的文件签名列表,又名“魔术数字”http://www .garykessler.net/library/file_sigs.html
A very useful list of File Signatures aka "magic numbers" by Gary Kessler is available http://www.garykessler.net/library/file_sigs.html
在内部,文件头信息应该有所帮助。如果您执行低级文件打开,例如 StreamReader() 或 FOPEN(),请查看文件中的前两个字符...几乎每种文件类型都有自己的签名。
我过去也必须处理这个问题......还可以帮助防止不需要的文件被上传到给定的站点,并在检查后立即中止它。
编辑-发布示例代码来读取和测试文件头类型
Internally, the file header information should help. if you do a low-level file open, such as StreamReader() or FOPEN(), look at the first two characters in the file... Almost every file type has its own signature.
I've had to deal with this in the past too... also to help prevent unwanted files from being uploaded to a given site and immediately aborting it once checked.
EDIT -- Posted sample code to read and test file header types
如果您访问此处,您会看到 TIFF 通常以“magic”开头Numbers" 0x49 0x49 0x2A 0x00(还给出了一些其他定义),这是文件的前4个字节。
因此只需使用前 4 个字节即可确定文件是否为 TIFF。
编辑,最好以其他方式进行,并首先检测 PDF。 PDF 的幻数更加标准化:正如 Plinth 善意指出的那样,它们在前 1024 个字节中的某个位置以“%PDF”开头(0x25 0x50 0x44 0x46)。 来源
If you go here, you will see that the TIFF usually starts with "magic numbers" 0x49 0x49 0x2A 0x00 (some other definitions are also given), which is the first 4 bytes of the file.
So just use these first 4 bytes to determine whether file is TIFF or not.
EDIT, it is probably better to do it the other way, and detect PDF first. The magic numbers for PDF are more standardized: As Plinth kindly pointed out they start with "%PDF" somewhere in the first 1024 bytes (0x25 0x50 0x44 0x46). source
您将必须编写 ashx 来获取请求的文件。
然后,您的处理程序应该读取前几个字节(左右)来确定文件类型到底是什么 - PDF 和 TIFF 在文件的开头有“魔术数字”,您可以使用它来确定这一点,然后设置您的响应相应的标题。
You are going to have to write an ashx to get the file requested.
then, your handler should read the first few bytes (or so) to determine what the file type really is-- PDF and TIFF's have "magic numers" in the beginning of the file that you can use to determin this, then set your Response Headers accordingly.
您可以使用Myrmec来识别文件类型,该库使用文件字节头。这个库在nuget“Myrmec”上可用,这是repo,myrmec也支持mime类型,你可以尝试一下。代码将如下所示:
并获取 mime 类型:
string mimeType = MimeTypes.GetMimeType(result.First());
但支持 tiff 仅“49 49 2A 00”和“4D 4D 00 2A”两个签名,如果您有更多可以添加自己,也许您可以查看 myrmec 的自述文件寻求帮助。 myrmec github 存储库
you can use Myrmec to identify the file type, this library use the file byte head. this library avaliable on nuget "Myrmec",and this is the repo, myrmec also support mime type,you can try it. the code will like this :
and get mime type :
string mimeType = MimeTypes.GetMimeType(result.First());
but that support tiff only "49 49 2A 00" and "4D 4D 00 2A" two signature, if you have more you can add your self, may be you can see the readme file of myrmec for help. myrmec github repo