如何检测文件是 PDF 还是 TIFF?

发布于 2024-08-30 10:51:40 字数 657 浏览 6 评论 0原文

请耐心等待,因为我在不了解所有背景的情况下就被投入到这个项目中。如果你有什么WTF问题,相信我,我也有。

场景如下:我有一堆文件驻留在 IIS 服务器上。它们没有文件扩展名。只是名称为“asda-2342-sd3rs-asd24-ut57”等的裸文件。没有什么直观的。

问题是我需要在 ASP.NET (2.0) 页面上提供文件并将 tiff 文件显示为 tiff,将 PDF 文件显示为 PDF。不幸的是,我不知道哪个是哪个,我需要能够以各自的格式正确显示它们。

例如,假设我需要显示 2 个文件,一个是 tiff,一个是 PDF。该页面应该显示一个 tiff 图像,也许还有一个可以在新选项卡/窗口中打开 PDF 的链接。

问题:

由于这些文件都是无扩展名的,我不得不强制 IIS 将所有内容都以 TIFF 格式提供。但如果我这样做,PDF 文件将不会显示。我可以更改 IIS 以强制 MIME 类型为未知文件扩展名的 PDF,但我会遇到相反的问题。

http://support.microsoft.com/kb/326965

这个问题比我想象的要容易吗?是不是像我想象的那么糟糕?

Please bear with me as I've been thrown into the middle of this project without knowing all the background. If you've got WTF questions, trust me, I have them too.

Here is the scenario: I've got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like "asda-2342-sd3rs-asd24-ut57" and so on. Nothing intuitive.

The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don't know which is which and I need to be able to display them appropriately in their respective formats.

For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.

The problem:

As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won't display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I'd have the reverse problem.

http://support.microsoft.com/kb/326965

Is this problem easier than I think or is it as nasty as I am expecting?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

弃爱 2024-09-06 10:51:40

好吧,有足够多的人犯了这个错误,所以我将发布一些我必须识别 TIFF 的代码:

private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;


private bool IsTiff(Stream stm)
{
    stm.Seek(0);
    if (stm.Length < kMinimumTiffSize)
        return false;
    byte[] header = new byte[kHeaderSize];

    stm.Read(header, 0, header.Length);

    if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
        return false;
    bool isIntel = header[0] == kIntelMark;

    ushort magicNumber = ReadShort(stm, isIntel);
    if (magicNumber != kTiffMagicNumber)
        return false;
    return true;
}

private ushort ReadShort(Stream stm, bool isIntel)
{
    byte[] b = new byte[2];
    _stm.Read(b, 0, b.Length);
    return ToShort(_isIntel, b[0], b[1]);
}

private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
    if (isIntel)
    {
        return (ushort)(((int)b1 << 8) | (int)b0);
    }
    else
    {
        return (ushort)(((int)b0 << 8) | (int)b1);
    }
}

我分解了一些更通用的代码来得到这个。

对于 PDF,我的代码如下所示:

public bool IsPdf(Stream stm)
{
    stm.Seek(0, SeekOrigin.Begin);
    PdfToken token;
    while ((token = GetToken(stm)) != null) 
    {
        if (token.TokenType == MLPdfTokenType.Comment) 
        {
            if (token.Text.StartsWith("%PDF-1.")) 
                return true;
        }
        if (stm.Position > 1024)
            break;
    }
    return false;
}

现在,GetToken() 是对扫描仪的调用,该扫描仪将 Stream 标记为 PDF 标记。这很重要,所以我不打算将其粘贴到这里。我使用分词器而不是查看子字符串来避免出现这样的问题:

% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage

上面的代码片段将这段代码标记为 NOT PDF,而更简单的代码块会错误地将其标记为 PDF。

我还应该指出,当前的 ISO 规范缺乏以前 Adob​​e 拥有的规范中的实施说明。最重要的是来自 PDF 参考,版本 1.6:

Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

OK, enough people are getting this wrong that I'm going to post some code I have to identify TIFFs:

private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;


private bool IsTiff(Stream stm)
{
    stm.Seek(0);
    if (stm.Length < kMinimumTiffSize)
        return false;
    byte[] header = new byte[kHeaderSize];

    stm.Read(header, 0, header.Length);

    if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
        return false;
    bool isIntel = header[0] == kIntelMark;

    ushort magicNumber = ReadShort(stm, isIntel);
    if (magicNumber != kTiffMagicNumber)
        return false;
    return true;
}

private ushort ReadShort(Stream stm, bool isIntel)
{
    byte[] b = new byte[2];
    _stm.Read(b, 0, b.Length);
    return ToShort(_isIntel, b[0], b[1]);
}

private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
    if (isIntel)
    {
        return (ushort)(((int)b1 << 8) | (int)b0);
    }
    else
    {
        return (ushort)(((int)b0 << 8) | (int)b1);
    }
}

I hacked apart some much more general code to get this.

For PDF, I have code that looks like this:

public bool IsPdf(Stream stm)
{
    stm.Seek(0, SeekOrigin.Begin);
    PdfToken token;
    while ((token = GetToken(stm)) != null) 
    {
        if (token.TokenType == MLPdfTokenType.Comment) 
        {
            if (token.Text.StartsWith("%PDF-1.")) 
                return true;
        }
        if (stm.Position > 1024)
            break;
    }
    return false;
}

Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I'm not going to paste it here. I'm using the tokenizer instead of looking at substring to avoid a problem like this:

% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage

this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.

I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:

Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.
在梵高的星空下 2024-09-06 10:51:40

可以通过查看第一个字节来检测 TIFF http://local.wasp .uwa.edu.au/~pbourke/dataformats/tiff/

前 8 个字节构成标头。
其中前两个字节是
“II”表示小端字节顺序
或“MM”表示大端字节顺序。

关于 PDF: http://www.adobe.com/devnet/livecycle/articles /lc_pdf_overview_format.pdf

标题仅包含一行
识别 PDF 的版本。
示例:%PDF-1.6

TIFF can be detected by peeking at first bytes http://local.wasp.uwa.edu.au/~pbourke/dataformats/tiff/

The first 8 bytes forms the header.
The first two bytes of which is either
"II" for little endian byte ordering
or "MM" for big endian byte ordering.

About PDF: http://www.adobe.com/devnet/livecycle/articles/lc_pdf_overview_format.pdf

The header contains just one line that
identifies the version of PDF.
Example: %PDF-1.6

阅读每种文件格式的规范将告诉您如何识别该格式的文件。

TIFF 文件 - 检查字节 1 和 2 是否有 0x4D4D 或 0x4949 < em>和字节 2-3 表示值“42”。

规范第 13 页内容如下:

TIFF 文件以 8 字节开头
图像文件头,包含
以下信息: 字节 0-1:
文件内使用的字节顺序。合法的
值为:“II”(4949.H)“MM”
(4D4D.H) 在“II”格式中,字节
顺序总是从最少的开始
最大有效字节
重要字节,对于 16 位和
32位整数这被称为
小端字节顺序。在“MM”中
格式,字节顺序总是从最开始
显着到最不显着,对于
16 位和 32 位整数。这
称为大端字节顺序。字节
2-3 任意但精心选择的
编号 (42) 进一步标识
文件为 TIFF 文件。字节
顺序取决于字节的值
0-1。

PDF 文件以 PDF 版本开头,后跟几个二进制字节。 (我认为您现在必须购买当前版本的 ISO 规范。)

第 7.5.2 节

PDF 文件的第一行应为
由 5 组成的标头
字符 %PDF– 后跟版本
1.N 形式的数字,其中 N 是
0 到 7 之间的数字。A 符合
读者应接受具有以下任何一项的文件
以下标头:%PDF–1.0,
%PDF–1.1、%PDF–1.2、%PDF–1.3、%PDF–1.4、
%PDF–1.5、%PDF–1.6、%PDF–1.7 开始
对于 PDF 1.4,版本条目位于
文档的目录字典(位于
通过文件中的根条目
预告片,如 7.5.5 中所述,“文件
拖车”)(如有)应使用
而不是中指定的版本
标题。

如果 PDF 文件包含二进制数据,如
大多数人这样做(参见7.2,“词汇
约定”),标题行应
随后立即发表评论
包含至少四个二进制的行
字符——即,其字符
代码为 128 或更大。这确保了
文件传输的正确行为
检查附近数据的应用程序
确定文件的开头
是否处理文件的内容
作为文本或二进制。

当然,您可以通过检查更多文件特定项目来对每个文件进行“更深入”的检查。

Reading the specification for each file format will tell you how to identify files of that format.

TIFF files - Check bytes 1 and 2 for 0x4D4D or 0x4949 and bytes 2-3 for the value '42'.

Page 13 of the spec reads:

A TIFF file begins with an 8-byte
image file header, containing the
following information: Bytes 0-1: The
byte order used within the file. Legal
values are: “II” (4949.H) “MM”
(4D4D.H) In the “II” format, byte
order is always from the least
significant byte to the most
significant byte, for both 16-bit and
32-bit integers This is called
little-endian byte order. In the “MM”
format, byte order is always from most
significant to least significant, for
both 16-bit and 32-bit integers. This
is called big-endian byte order. Bytes
2-3 An arbitrary but carefully chosen
number (42) that further identifies
the file as a TIFF file. The byte
order depends on the value of Bytes
0-1.

PDF files start with the PDF version followed by several binary bytes. (I think you now have to purchase the ISO spec for the current version.)

Section 7.5.2

The first line of a PDF file shall be
a header consisting of the 5
characters %PDF– followed by a version
number of the form 1.N, where N is a
digit between 0 and 7. A conforming
reader shall accept files with any of
the following headers: %PDF–1.0,
%PDF–1.1, %PDF–1.2, %PDF–1.3, %PDF–1.4,
%PDF–1.5, %PDF–1.6, %PDF–1.7 Beginning
with PDF 1.4, the Version entry in the
document’s catalog dictionary (located
via the Root entry in the file’s
trailer, as described in 7.5.5, "File
Trailer"), if present, shall be used
instead of the version specified in
the Header.

If a PDF file contains binary data, as
most do (see 7.2, "Lexical
Conventions"), the header line shall
be immediately followed by a comment
line containing at least four binary
characters—that is, characters whose
codes are 128 or greater. This ensures
proper behaviour of file transfer
applications that inspect data near
the beginning of a file to determine
whether to treat the file’s contents
as text or as binary.

Of course you could do a "deeper" check on each file by checking more file specific items.

夜还是长夜 2024-09-06 10:51:40

Gary Kessler 提供了一个非常有用的文件签名列表,又名“魔术数字”http://www .garykessler.net/library/file_sigs.html

A very useful list of File Signatures aka "magic numbers" by Gary Kessler is available http://www.garykessler.net/library/file_sigs.html

傲鸠 2024-09-06 10:51:40

在内部,文件头信息应该有所帮助。如果您执行低级文件打开,例如 StreamReader() 或 FOPEN(),请查看文件中的前两个字符...几乎每种文件类型都有自己的签名。

PDF always starts with "%P" (but more specifically would have like %PDF)
TIFF appears to start with "II"
Bitmap files with "BM"
Executable files with "MZ"

我过去也必须处理这个问题......还可以帮助防止不需要的文件被上传到给定的站点,并在检查后立即中止它。

编辑-发布示例代码来读取和测试文件头类型

String fn = "Example.pdf";

StreamReader sr = new StreamReader( fn );
char[] buf = new char[5];
sr.Read( buf, 0, 4);
sr.Close();
String Hdr = buf[0].ToString()
    + buf[1].ToString()
    + buf[2].ToString()
    + buf[3].ToString()
    + buf[4].ToString();

String WhatType;
if (Hdr.StartsWith("%PDF"))
   WhatType = "PDF";
else if (Hdr.StartsWith("MZ"))
   WhatType = "EXE or DLL";
else if (Hdr.StartsWith("BM"))
   WhatType = "BMP";
else if (Hdr.StartsWith("?_"))
   WhatType = "HLP (help file)";
else if (Hdr.StartsWith("\0\0\1"))
   WhatType = "Icon (.ico)";
else if (Hdr.StartsWith("\0\0\2"))
   WhatType = "Cursor (.cur)";
else
   WhatType = "Unknown";

Internally, the file header information should help. if you do a low-level file open, such as StreamReader() or FOPEN(), look at the first two characters in the file... Almost every file type has its own signature.

PDF always starts with "%P" (but more specifically would have like %PDF)
TIFF appears to start with "II"
Bitmap files with "BM"
Executable files with "MZ"

I've had to deal with this in the past too... also to help prevent unwanted files from being uploaded to a given site and immediately aborting it once checked.

EDIT -- Posted sample code to read and test file header types

String fn = "Example.pdf";

StreamReader sr = new StreamReader( fn );
char[] buf = new char[5];
sr.Read( buf, 0, 4);
sr.Close();
String Hdr = buf[0].ToString()
    + buf[1].ToString()
    + buf[2].ToString()
    + buf[3].ToString()
    + buf[4].ToString();

String WhatType;
if (Hdr.StartsWith("%PDF"))
   WhatType = "PDF";
else if (Hdr.StartsWith("MZ"))
   WhatType = "EXE or DLL";
else if (Hdr.StartsWith("BM"))
   WhatType = "BMP";
else if (Hdr.StartsWith("?_"))
   WhatType = "HLP (help file)";
else if (Hdr.StartsWith("\0\0\1"))
   WhatType = "Icon (.ico)";
else if (Hdr.StartsWith("\0\0\2"))
   WhatType = "Cursor (.cur)";
else
   WhatType = "Unknown";
苍暮颜 2024-09-06 10:51:40

如果您访问此处,您会看到 TIFF 通常以“magic”开头Numbers" 0x49 0x49 0x2A 0x00(还给出了一些其他定义),这是文件的前4个字节。

因此只需使用前 4 个字节即可确定文件是否为 TIFF。

编辑,最好以其他方式进行,并首先检测 PDF。 PDF 的幻数更加标准化:正如 Plinth 善意指出的那样,它们在前 1024 个字节中的某个位置以“%PDF”开头(0x25 0x50 0x44 0x46)。 来源

If you go here, you will see that the TIFF usually starts with "magic numbers" 0x49 0x49 0x2A 0x00 (some other definitions are also given), which is the first 4 bytes of the file.

So just use these first 4 bytes to determine whether file is TIFF or not.

EDIT, it is probably better to do it the other way, and detect PDF first. The magic numbers for PDF are more standardized: As Plinth kindly pointed out they start with "%PDF" somewhere in the first 1024 bytes (0x25 0x50 0x44 0x46). source

走走停停 2024-09-06 10:51:40

您将必须编写 ashx 来获取请求的文件。

然后,您的处理程序应该读取前几个字节(左右)来确定文件类型到底是什么 - PDF 和 TIFF 在文件的开头有“魔术数字”,您可以使用它来确定这一点,然后设置您的响应相应的标题。

You are going to have to write an ashx to get the file requested.

then, your handler should read the first few bytes (or so) to determine what the file type really is-- PDF and TIFF's have "magic numers" in the beginning of the file that you can use to determin this, then set your Response Headers accordingly.

花海 2024-09-06 10:51:40

您可以使用Myrmec来识别文件类型,该库使用文件字节头。这个库在nuget“Myrmec”上可用,这是repo,myrmec也支持mime类型,你可以尝试一下。代码将如下所示:

// create a sniffer instance.
Sniffer sniffer = new Sniffer();

// populate with mata data.
sniffer.Populate(FileTypes.CommonFileTypes);

// get file head byte, may be 20 bytes enough.
byte[] fileHead = ReadFileHead();

// start match.
List<string> results = sniffer.Match(fileHead);

并获取 mime 类型:

List<string> result = sniffer.Match(head);

string mimeType = MimeTypes.GetMimeType(result.First());

但支持 tiff 仅“49 49 2A 00”和“4D 4D 00 2A”两个签名,如果您有更多可以添加自己,也许您可​​以查看 myrmec 的自述文件寻求帮助。 myrmec github 存储库

you can use Myrmec to identify the file type, this library use the file byte head. this library avaliable on nuget "Myrmec",and this is the repo, myrmec also support mime type,you can try it. the code will like this :

// create a sniffer instance.
Sniffer sniffer = new Sniffer();

// populate with mata data.
sniffer.Populate(FileTypes.CommonFileTypes);

// get file head byte, may be 20 bytes enough.
byte[] fileHead = ReadFileHead();

// start match.
List<string> results = sniffer.Match(fileHead);

and get mime type :

List<string> result = sniffer.Match(head);

string mimeType = MimeTypes.GetMimeType(result.First());

but that support tiff only "49 49 2A 00" and "4D 4D 00 2A" two signature, if you have more you can add your self, may be you can see the readme file of myrmec for help. myrmec github repo

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文