文件格式是如何创建的?如果都是二进制的,编码如何改变文件类型?
我已经阅读了一些有关文件格式和编码主题的链接,但它是如何完成的?
如果所有数据都是二进制的,那么什么将数据分割成不同的文件格式?数据编码到底涉及什么?它是如何完成的?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
我已经阅读了一些有关文件格式和编码主题的链接,但它是如何完成的?
如果所有数据都是二进制的,那么什么将数据分割成不同的文件格式?数据编码到底涉及什么?它是如何完成的?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(4)
根据剧院的回应,这完全是一个解释问题。
通常,文件扩展名(.txt、.jpg、.pdf 等)提供了足够的信息来确定哪个程序应处理该文件 - 然后程序将知道如何处理给定的格式(或在保存到该文件时生成此格式)特定文件类型)。
每种文件格式都有一个(希望如此!)明确定义的格式,例如 PDF 文件始终以“%PDF-xy”行开头,其中 xy 是版本号,例如 1.6。这使得 Acrobat 之类的程序能够确定这“很可能是一个 PDF 文件”并决定如何处理它(不同版本将具有不同的内部结构)。
.txt 文件通常只是以特定方式编码的“字符”序列 - 纯英文文本很容易编码,具有数千个字符的更复杂的语言需要更复杂的编码(Unicode 或 UTF-8,后者是“压缩”的) Unicode 形式)。
尝试在十六进制编辑器中打开一些非关键文件,了解一些格式规范,看看能找到什么!
As per theatrus' response, it's all a matter of interpretation.
Typically the file extension (.txt, .jpg, .pdf etc.) provides enough information to determine which program should handle the file - and then the program will know how to handle the format it's given (or produce this format when saving to that particular file type).
Each file format has a (hopefully!) well defined format, for example a PDF file will always start with a line that reads "%PDF-x.y" where x.y is the version number e.g. 1.6. which enables the likes of Acrobat to determine that this 'is most likely a PDF file' and to decide how to handle it (different versions will have different internal structures).
.txt files are usually just sequences of 'characters' encoded in a particular way - plain English text is easily encoded, more complex languages with thousands of characters require more complex encodings (Unicode, or UTF-8, the latter being a 'compressed' form of Unicode).
Try opening up a few non-critical files in a hex-editor and get your hands on some format specifications and see what you can find!
文件格式以特定的表示形式描述数据。例如,jpeg、bmp、png 和 tiff 都描述图像,而 html 和 rtf 描述文本文档。
文件格式由描述有关所包含数据的信息(图像尺寸、压缩文件名等)的标头组成。这些将包含标记文件为特定类型的识别签名:
JFIF
(不记得确切的偏移量)(大写或小写)
这是 unix
file
命令和libmagic
API 背后的概念。文本编码是对文本进行编码的字符集。这是因为程序历史上使用单字节数组(C/C++ 中的
char *
)来表示字符串,而这不足以表示大多数人类语言。文本编码表示“此文本是简体中文”或“此文本是西里尔文”。如何选择文本编码取决于所使用的文件格式。纯文本格式(text、html、xml)可以在开头有一个“字节顺序标记”,将文本标识为 UTF-32(小端或大端)、UTF-16(小端或大端)、或 UTF-8。这些是 Unicode 字符的不同表示形式。
XML 允许您在
声明中指定编码 - 例如
。 HTML 允许您在
标记中指定编码 - 例如
您可以看到一些示例,其中文本以一种形式编码,但在某些电子邮件或其他地方被解码为另一种形式(文本被破坏)。这些看起来像
…
(这是一个以 utf-8 编码的项目符号字符(中间的黑点))——您可以通过转到View > 来在 Firefox 中看到这一点。字符编码
菜单并将编码更改为西方(ISO-8859-1)
(特别是对于非西方字符)。您还可以使用其他类型的编码。例如,电子邮件在传输过程中可以封装在 base64 中。
File formats describe data in a specific representation. For example, jpeg, bmp, png and tiff all describe images whereas html and rtf describe text documents.
A file format consists of a header that describes information about the contained data (image dimensions, compressed file name, etc.). These will contain identifying signatures that mark the file being a specific type:
JFIF
in the first 20 bytes or so (can't remember the exact offset)<html
(upper or lower case) near the start of the documentThis is the concept behind the unix
file
command andlibmagic
API.Text encoding is what character set the text is encoded in. This is because programs historically use single-byte arrays (
char *
in C/C++) to represent strings and that is not enough to represent most human languages. The text encoding says that "this text is Simplified Chinese", or "this text is Cyrillic".How text encodings are selected depends on the file format being used. Plain text formats (text, html, xml) can have a "byte-order-mark" at the beginning that identifies that text as UTF-32 (little endian or big endian), UTF-16 (little endian or big endian), or UTF-8. These are different representations of Unicode characters.
XML allows you to specify the encoding in the
<?xml?>
declaration -- e.g.<?xml version="1.0" encoding="ShiftJIS"?>
. HTML allows you to specify the encoding in a<meta>
tag -- e.g.<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
.You can see examples where text is encoded in one form, but decoded as another (the text is mangled) in some emails or other places. These will look like
•
(which is a bullet character (middle black dot) encoded in utf-8) -- you can see this in firefox by going to theView > Character encoding
menu and changing the encoding toWestern (ISO-8859-1)
(especially for non-Western characters).You can also have other types of encoding. For example, email can be wrapped in base64 during transport.
决定某些内容的格式的主要方法是通过文件扩展名或 MIME 类型,而不太常见的是通过“幻数”。
文件扩展名将由操作系统或应用程序检查,以决定如何处理它(在哪个应用程序中运行它,或者为其执行代码的哪一部分)。
MIME 类型用于扩展名(或文件名)并不总是适用的情况 - 例如,通过 HTTP 下载文件时,文件的 URI 可能类似于
~.php?id=12973
。文件类型不能仅凭此确定,但 HTTP 协议将发送“Content-Type”定义来说明文件的格式,浏览器将正确处理它。例如:Content-Type: image/png 会强制浏览器将文件传递给某些 PNG 解码函数。当应用程序知道文件格式是什么时,它将数据传递给专门为该格式编写的代码。如果程序没有读取格式的代码,它将无法读取它。
文件的编码方式特定于该文件。大多数标准格式都有一个规范来描述其二进制编码,并且任何读取该文件类型的应用程序都必须实现代码以匹配该规范。 (尽管这通常是通过使用已经为您完成阅读的库来完成的)。
为了举例说明二进制编码如何工作,请考虑一张图像。规范可能会说字节 10-13 表示图像的宽度,字节 14-17 表示图像的高度。为了从文件中读取这些信息,代码必须在规范指示的正确位置显式读取正确的大小数据。例如:fseek(f, 10, SEEK_SET); fread(&宽度, 4, 1, f); //将位置10处的4个字节读入“width”)。我认为您的困惑是“什么分隔二进制文件中的数据片段?” (即,在文本文件中,这可以通过换行、空格、逗号分隔值(CSV)等来完成)。答案是:通常数据的大小将决定它的结束位置 - 规范会说明每个字段的二进制类型是什么(也许它可能会说 int32,表示 32 位/4 字节)。
除此之外,文件格式可能存在歧义,但通常发生在文本文件中,可以读取其中的文本以确定格式。这并不总是适用,因为文本文件通常只有扩展名“.txt”,因此应用程序可能不知道文本的字符编码是什么。 (对于不使用 unicode 的应用程序来说,这曾经是、现在仍然是一个问题)。
The main ways to decide what format something is are by file extension or by MIME type - and less frequently by "magic numbers".
The file extension will be checked by an OS or Application to decide what to do with it (which app to run it in, or which part of code to execute for it).
MIME types are used where an extension (or filename) isn't always applicable - for example, when downloading a file over HTTP, the URI for a file might be something like
~.php?id=12973
. The filetype cannot be determined from ths alone, but the HTTP protocol will send a "Content-Type" definition to say what format the file is, and the browser will handle it correctly. eg: a Content-Type: image/png would force the browser to pass the file to some PNG decoding function.When the application knows what the file format is, it'll pass the data to code which is written specifically for that format. If the program doesn't have code to read a format, it will fail to read it.
How a file is encoded is specific to the file. Most standard formats will have a specification to describe their binary encoding, and any application reading that file type must implement code to match the specification. (Although this is usually done by using a library which already does the reading for you).
To give an example of how binary encodings work, consider an image. The specification might say that bytes 10-13 signify the width of the image, and bytes 14-17 signify the height of the image. In order to read those pieces of the information from the file, the code must explicitly read the correct size data at the correct locations indicated by the spec. EG:
fseek(f, 10, SEEK_SET); fread(&width, 4, 1, f); //Read 4 bytes at location 10 into "width")
. I think your confusion is "what separates pieces of data in binary files?" (ie, in text files, this can be done by new lines, spaces, comma-separated values (CSV), etc). The answer is: usually the size of the data will determine where it ends - a specification will say what the binary type of each field is (perhaps it may say int32, indicating 32 bits/4 bytes).Other than that, there can be ambiguities in file formats, but usually happens with text files, where the text inside can be read to determine the format. This isn't always applicable, because often a text file will simply have the extension ".txt", so it can be unknown to the application what the character encoding of the text is. (This was, and still is a problem for applications which do not use unicode).
所有数据都是二进制的,包括您现在正在查看的这个网页。重要的是对数据的解释。
例如,假设您有四个字节:
可能是(无特定顺序):
这只是无符号数。这些字节或位中的任何一个都可以是标记、顺序指示符、字符串、位置指示符等。
All data is binary, including this web page you are viewing right now. Its the interpretation of the data that matters.
For instance, pretend you have four bytes:
That could be (in no particular order):
And this is only the unsigned numbers. Any of those bytes or bits could be markers, order indicators, strings, position indicators, etc.