机器人.txt;什么编码?

发布于 2024-09-25 23:04:39 字数 1435 浏览 7 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

难理解 2024-10-02 23:04:39

由于该文件应仅包含 ASCII 字符,因此通常将其保存为 ANSI 或 UTF-8 并不重要。

但是,如果可以选择,您应该选择 ANSI,因为当您将文件另存为 UTF-8 时,记事本会将 Unicode 字节顺序标记添加到文件的前面,这可能会使仅识别 ASCII 的解释器无法读取该文件。

Since the file should consist of only ASCII characters, it normally doesn't matter if you save it as ANSI or UTF-8.

However, you should choose ANSI if you have a choice because when you save a file as UTF-8, notepad adds the Unicode Byte Order Mark to the front of the file, which may make the file unreadable for interpreters that only know ASCII.

冰雪梦之恋 2024-10-02 23:04:39

我相信Robots.txt“应该”采用UTF-8编码。

“预期的文件格式是以 UTF-8 编码的纯文本。该文件
由用 CR、CR/LF 或 LF 分隔的记录(行)组成。"

/来自 https://developers.google.com/webmasters /control-crawl-index/docs/robots_txt

但是,记事本等程序会在开头插入一个3字节的BOM(字节顺序标记)文件导致 Google 无法读取第一行(显示“无效语法”错误);

或者删除 BOM,或者更简单,在第一行添加换行符,以便第一行指令位于第二行,

由 BOM 引起的“无效语法”行只会影响现在为空的第一行

I believe Robots.txt "should" be UTF-8 encoded.

"The expected file format is plain text encoded in UTF-8. The file
consists of records (lines) separated by CR, CR/LF or LF."

/from https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

But, notepad and other programs will insert a 3 byte BOM (Byte Order Mark) in the beginning of the file causing Google to not being able to read that first line (showing an "invalid syntax" error).

Either; remove the BOM, or much easier, Add a line break on the first row so that the first line of instructions comes on line number two.

The "invalid syntax" line caused by the BOM will only affect the first line which now is empty.

The rest of the lines will be read successfully.

两个我 2024-10-02 23:04:39

至于编码:@Roland 已经确定了。该文件应仅包含 URL。 URL 中的非 ASCII 字符是非法的,因此将文件保存为 ASCII 应该没问题。

如果出于某种原因需要提供 UTF-8,请确保在文本文件的 content-type 标头中正确指定。您必须在网络服务器的设置中进行此项设置。

至于区分大小写:

  • 根据robotstxt.org,robots.txt文件需要为小写:

    <块引用>

    请记住文件名全部使用小写:“robots.txt”,而不是“Robots.TXT”。

  • 关键字可能不区分大小写 - 我找不到相关参考 - 但我倾向于做所有的事情其他人这样做:使用大写版本(Sitemap)。

As for the encoding: @Roland already nailed it. The file should contain only URLs. Non-ASCII characters in URLs are illegal, so saving the file as ASCII should be just fine.

If you need to serve UTF-8 for some reason, make sure this is specified correctly in the content-type header of the text file. You will have to set this in your web server's settings.

As to case sensitivity:

  • According to robotstxt.org, the robots.txt file needs to be lowercase:

    Remember to use all lower case for the filename: "robots.txt", not "Robots.TXT.

  • The keywords are probably case insensitive - I can't find a reference on that - but I would tend to do what all the others do: Use capitalized versions (Sitemap).

过度放纵 2024-10-02 23:04:39

我觉得你想太多了。我总是使用小写,只是因为它更容易。

你可以查看SO的robots.txt。 https://stackoverflow.com/robots.txt

I think you're over thinking things too much. I always do lowercase, just because it's easier.

You can view SO's robots.txt. https://stackoverflow.com/robots.txt

掐死时间 2024-10-02 23:04:39

我建议使用不带 BOM 的 UTF8 编码 robots.txt 或使用 ASCII 编码。

对于包含非 ASCII 字符的 URL,我建议使用 UTF8(在大多数情况下都可以),或者使用 URL 编码来表示 ASCII 中的所有字符。

看一下 维基百科的 robots.txt 文件 - 它是 UTF8 编码的。

请参阅参考文献:

I recommend either encoding robots.txt in UTF8, without BOM, or encoding it in ASCII.

For URLs that contain non ASCII characters, I suggest either using UTF8, which is fine in most cases, or use URL-encode to represent all of the characters in ASCII.

Take a look at Wikipedia's robots.txt file - it's UTF8 encoded.

See references:

静赏你的温柔 2024-10-02 23:04:39

我建议您使用 ANSI,因为如果您的 robots.txt 保存为 UTF-8,那么由于添加到其开头的 Unicode 字节顺序标记,它将在 Google 的搜索控制台中被标记为错误(如上面 Roland Illig 中所述) )。

I suggest you to use ANSI, because if your robots.txt is saved as UTF-8, then it will be marked as faulty in Google's Search Console due to the Unicode Byte Order Mark that's added to its beginning (as mentioned from Roland Illig above).

維他命╮ 2024-10-02 23:04:39

大多数答案似乎已经过时了。自 2022 年起,Google 指定 robots.txt 格式如下 (来源):

文件格式

robots.txt 文件必须是 UTF-8 编码的纯文本文件并且各行必须用 CRCR/LFLF 分隔。

Google 会忽略 robots.txt 文件中的无效行,包括 Unicode 字节顺序标记 ( BOM)位于 robots.txt 文件的开头,并且仅使用有效的行。例如,如果下载的内容是 HTML 而不是 robots.txt 规则,Google 将尝试解析该内容并提取规则,并忽略其他所有内容。

同样,如果 robots.txt 文件的字符编码不是 UTF-8,Google 可能会忽略不属于 UTF-8 范围的字符,从而可能导致 robots.txt 规则无效。

Google 目前将 robots.txt 文件大小限制为 500 kibibytes (KiB)。最大文件大小之后的内容将被忽略。您可以通过合并会导致 robots.txt 文件过大的指令来减小 robots.txt 文件的大小。例如,将排除的材料放在单独的目录中。

TL;DR 回答问题:

  • 您可以使用记事本保存 robots.txt 文件。只需使用UTF-8编码即可。
  • 它可能包含也可能不包含 BOM;无论如何都会被忽略。
  • 该文件必须准确命名为 robots.txt。没有大写“R”。
  • 字段名称不区分大小写(来源 )。因此,sitemapSitemap 都可以。

请记住,robots.txt 只是一个事实上的标准。无法保证任何抓取工具都会按照 Google 建议的方式读取此文件,也不会强制任何抓取工具遵守任何已定义的规则。

Most answers seem to be outdated. As of 2022, Google specifies the robots.txt format as follows (source):

File format

The robots.txt file must be a UTF-8 encoded plain text file and the lines must be separated by CR, CR/LF, or LF.

Google ignores invalid lines in robots.txt files, including the Unicode Byte Order Mark (BOM) at the beginning of the robots.txt file, and use only valid lines. For example, if the content downloaded is HTML instead of robots.txt rules, Google will try to parse the content and extract rules, and ignore everything else.

Similarly, if the character encoding of the robots.txt file isn't UTF-8, Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots.txt rules invalid.

Google currently enforces a robots.txt file size limit of 500 kibibytes (KiB). Content which is after the maximum file size is ignored. You can reduce the size of the robots.txt file by consolidating directives that would result in an oversized robots.txt file. For example, place excluded material in a separate directory.

TL;DR to answer the question:

  • You can use Notepad to save a robots.txt file. Just use UTF-8 encoding.
  • It may or may not contain a BOM; It will be ignored anyways.
  • The file has to be named robots.txt exactly. No capital "R".
  • Field names are not case sensitive (source). Therefore, both, sitemap and Sitemap are fine.

Keep in mind that robots.txt is just a de-facto standard. There is no guarantee any crawler will read this file as Google proposes it to do nor is any crawler forced to respect any defined rules.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文