Since the file should consist of only ASCII characters, it normally doesn't matter if you save it as ANSI or UTF-8.
However, you should choose ANSI if you have a choice because when you save a file as UTF-8, notepad adds the Unicode Byte Order Mark to the front of the file, which may make the file unreadable for interpreters that only know ASCII.
But, notepad and other programs will insert a 3 byte BOM (Byte Order Mark) in the beginning of the file causing Google to not being able to read that first line (showing an "invalid syntax" error).
Either; remove the BOM, or much easier, Add a line break on the first row so that the first line of instructions comes on line number two.
The "invalid syntax" line caused by the BOM will only affect the first line which now is empty.
As for the encoding: @Roland already nailed it. The file should contain only URLs. Non-ASCII characters in URLs are illegal, so saving the file as ASCII should be just fine.
If you need to serve UTF-8 for some reason, make sure this is specified correctly in the content-type header of the text file. You will have to set this in your web server's settings.
As to case sensitivity:
According to robotstxt.org, the robots.txt file needs to be lowercase:
Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.
The keywords are probably case insensitive - I can't find a reference on that - but I would tend to do what all the others do: Use capitalized versions (Sitemap).
I recommend either encoding robots.txt in UTF8, without BOM, or encoding it in ASCII.
For URLs that contain non ASCII characters, I suggest either using UTF8, which is fine in most cases, or use URL-encode to represent all of the characters in ASCII.
我建议您使用 ANSI,因为如果您的 robots.txt 保存为 UTF-8,那么由于添加到其开头的 Unicode 字节顺序标记,它将在 Google 的搜索控制台中被标记为错误(如上面 Roland Illig 中所述) )。
I suggest you to use ANSI, because if your robots.txt is saved as UTF-8, then it will be marked as faulty in Google's Search Console due to the Unicode Byte Order Mark that's added to its beginning (as mentioned from Roland Illig above).
请记住,robots.txt 只是一个事实上的标准。无法保证任何抓取工具都会按照 Google 建议的方式读取此文件,也不会强制任何抓取工具遵守任何已定义的规则。
Most answers seem to be outdated. As of 2022, Google specifies the robots.txt format as follows (source):
File format
The robots.txt file must be a UTF-8 encoded plain text file and the lines must be separated by CR, CR/LF, or LF.
Google ignores invalid lines in robots.txt files, including the Unicode Byte Order Mark (BOM) at the beginning of the robots.txt file, and use only valid lines. For example, if the content downloaded is HTML instead of robots.txt rules, Google will try to parse the content and extract rules, and ignore everything else.
Similarly, if the character encoding of the robots.txt file isn't UTF-8, Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots.txt rules invalid.
Google currently enforces a robots.txt file size limit of 500 kibibytes (KiB). Content which is after the maximum file size is ignored. You can reduce the size of the robots.txt file by consolidating directives that would result in an oversized robots.txt file. For example, place excluded material in a separate directory.
TL;DR to answer the question:
You can use Notepad to save a robots.txt file. Just use UTF-8 encoding.
It may or may not contain a BOM; It will be ignored anyways.
The file has to be named robots.txt exactly. No capital "R".
Field names are not case sensitive (source). Therefore, both, sitemap and Sitemap are fine.
Keep in mind that robots.txt is just a de-facto standard. There is no guarantee any crawler will read this file as Google proposes it to do nor is any crawler forced to respect any defined rules.
发布评论
评论(7)
由于该文件应仅包含 ASCII 字符,因此通常将其保存为 ANSI 或 UTF-8 并不重要。
但是,如果可以选择,您应该选择 ANSI,因为当您将文件另存为 UTF-8 时,记事本会将 Unicode 字节顺序标记添加到文件的前面,这可能会使仅识别 ASCII 的解释器无法读取该文件。
Since the file should consist of only ASCII characters, it normally doesn't matter if you save it as ANSI or UTF-8.
However, you should choose ANSI if you have a choice because when you save a file as UTF-8, notepad adds the Unicode Byte Order Mark to the front of the file, which may make the file unreadable for interpreters that only know ASCII.
我相信Robots.txt“应该”采用UTF-8编码。
但是,记事本等程序会在开头插入一个3字节的BOM(字节顺序标记)文件导致 Google 无法读取第一行(显示“无效语法”错误);
或者删除 BOM,或者更简单,在第一行添加换行符,以便第一行指令位于第二行,
由 BOM 引起的“无效语法”行只会影响现在为空的第一行
。
I believe Robots.txt "should" be UTF-8 encoded.
But, notepad and other programs will insert a 3 byte BOM (Byte Order Mark) in the beginning of the file causing Google to not being able to read that first line (showing an "invalid syntax" error).
Either; remove the BOM, or much easier, Add a line break on the first row so that the first line of instructions comes on line number two.
The "invalid syntax" line caused by the BOM will only affect the first line which now is empty.
The rest of the lines will be read successfully.
至于编码:@Roland 已经确定了。该文件应仅包含 URL。 URL 中的非 ASCII 字符是非法的,因此将文件保存为 ASCII 应该没问题。
如果出于某种原因需要提供 UTF-8,请确保在文本文件的
content-type
标头中正确指定。您必须在网络服务器的设置中进行此项设置。至于区分大小写:
根据robotstxt.org,robots.txt文件需要为小写:
<块引用>
请记住文件名全部使用小写:“robots.txt”,而不是“Robots.TXT”。
关键字可能不区分大小写 - 我找不到相关参考 - 但我倾向于做所有的事情其他人这样做:使用大写版本(
Sitemap
)。As for the encoding: @Roland already nailed it. The file should contain only URLs. Non-ASCII characters in URLs are illegal, so saving the file as ASCII should be just fine.
If you need to serve UTF-8 for some reason, make sure this is specified correctly in the
content-type
header of the text file. You will have to set this in your web server's settings.As to case sensitivity:
According to robotstxt.org, the robots.txt file needs to be lowercase:
The keywords are probably case insensitive - I can't find a reference on that - but I would tend to do what all the others do: Use capitalized versions (
Sitemap
).我觉得你想太多了。我总是使用小写,只是因为它更容易。
你可以查看SO的robots.txt。 https://stackoverflow.com/robots.txt
I think you're over thinking things too much. I always do lowercase, just because it's easier.
You can view SO's robots.txt. https://stackoverflow.com/robots.txt
我建议使用不带 BOM 的 UTF8 编码
robots.txt
或使用 ASCII 编码。对于包含非 ASCII 字符的 URL,我建议使用 UTF8(在大多数情况下都可以),或者使用 URL 编码来表示 ASCII 中的所有字符。
看一下 维基百科的
robots.txt
文件 - 它是 UTF8 编码的。请参阅参考文献:
I recommend either encoding
robots.txt
in UTF8, without BOM, or encoding it in ASCII.For URLs that contain non ASCII characters, I suggest either using UTF8, which is fine in most cases, or use URL-encode to represent all of the characters in ASCII.
Take a look at Wikipedia's
robots.txt
file - it's UTF8 encoded.See references:
我建议您使用 ANSI,因为如果您的 robots.txt 保存为 UTF-8,那么由于添加到其开头的 Unicode 字节顺序标记,它将在 Google 的搜索控制台中被标记为错误(如上面 Roland Illig 中所述) )。
I suggest you to use ANSI, because if your robots.txt is saved as UTF-8, then it will be marked as faulty in Google's Search Console due to the Unicode Byte Order Mark that's added to its beginning (as mentioned from Roland Illig above).
大多数答案似乎已经过时了。自 2022 年起,Google 指定
robots.txt
格式如下 (来源):TL;DR 回答问题:
robots.txt
文件。只需使用UTF-8编码即可。robots.txt
。没有大写“R”。sitemap
和Sitemap
都可以。请记住,
robots.txt
只是一个事实上的标准。无法保证任何抓取工具都会按照 Google 建议的方式读取此文件,也不会强制任何抓取工具遵守任何已定义的规则。Most answers seem to be outdated. As of 2022, Google specifies the
robots.txt
format as follows (source):TL;DR to answer the question:
robots.txt
file. Just use UTF-8 encoding.robots.txt
exactly. No capital "R".sitemap
andSitemap
are fine.Keep in mind that
robots.txt
is just a de-facto standard. There is no guarantee any crawler will read this file as Google proposes it to do nor is any crawler forced to respect any defined rules.