为什么我会收到“错误的状态行” Nokogiri 的错误?

发布于 2024-12-17 21:40:39 字数 760 浏览 0 评论 0原文

我的 Ruby/Nokogiri 脚本是:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

f = File.new("enterret" + ".txt", 'w')

1.upto(100) do |page|
  urltext = "http://xxxxxxx.com/" + "page/"
  urltext << page.to_s + "/"
  doc = Nokogiri::HTML(open(urltext))
  doc.css(".photoPost").each do |post|
    quote = post.css("h1 + p").text
    author = post.css("h1 + p + p").text
    f.puts "#{quote}" + "#{author}"
    f.puts "--------------------------------------------------------"
  end
end

运行此脚本时,出现以下错误:

http.rb:2030:in `read_status_line': wrong status line: "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"" (Net::HTTPBadResponse)

但是我的脚本正确写入文件,只是此错误不断出现。错误是什么意思?

My Ruby/Nokogiri script is:

require 'rubygems'
require 'nokogiri'
require 'open-uri'

f = File.new("enterret" + ".txt", 'w')

1.upto(100) do |page|
  urltext = "http://xxxxxxx.com/" + "page/"
  urltext << page.to_s + "/"
  doc = Nokogiri::HTML(open(urltext))
  doc.css(".photoPost").each do |post|
    quote = post.css("h1 + p").text
    author = post.css("h1 + p + p").text
    f.puts "#{quote}" + "#{author}"
    f.puts "--------------------------------------------------------"
  end
end

When running this script i get the following error:

http.rb:2030:in `read_status_line': wrong status line: "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"" (Net::HTTPBadResponse)

However my script writes to file correctly, it just that this error keeps coming up. What does the error mean?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦亿 2024-12-24 21:40:39

在不知道您正在访问哪个网站的情况下,很难确定,但我怀疑问题不在于 Nokogiri。

该错误是由 http.rb 报告的,它很可能会抱怨返回的 HTTPd 标头。 http.rb 与 HTTPd 服务器的握手有关,并且会抱怨丢失/格式错误的标头,但它不关心有效负载。

另一方面,Nokogiri 会关心有效负载,即 HTML。 DOCTYPE 应该是 HTML 有效负载的一部分,因此我怀疑他们的服务器正在发送 HTML DOCTYPE 而不是 MIME doctype,后者应该是 "text/html"

在 Ruby 1.8.7 http.rb 文件中,您将在代码中的 2030 处看到以下行:

def response_class(code)
  CODE_TO_OBJ[code] or
  CODE_CLASS_TO_OBJ[code[0,1]] or
  HTTPUnknownResponse
end

这似乎是生成您所看到的消息类型的可能位置。

Without knowing what site you are accessing it is hard to say for sure, but I suspect that the problem isn't in Nokogiri.

The error is being reported by http.rb, which would most likely be complaining about the HTTPd headers being returned. http.rb is concerned with the handshake with the HTTPd server and would whine about missing/malformed headers, but it wouldn't care about the payload.

Nokogiri, on the other hand, would be concerned about the payload, i.e., the HTML. The DOCTYPE is supposed to be part of the HTML payload, so I suspect their server is sending a HTML DOCTYPE instead of a MIME doctype, which should be "text/html".

In the Ruby 1.8.7 http.rb file you'll see the following lines at 2030 in the code:

def response_class(code)
  CODE_TO_OBJ[code] or
  CODE_CLASS_TO_OBJ[code[0,1]] or
  HTTPUnknownResponse
end

That seems a likely place to generate the sort of message you're seeing.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文