使用 `open-uri` 打开带有逗号的 WIKI URL

发布于 2024-08-23 18:23:26 字数 429 浏览 8 评论 0原文

我遇到 OpenURI::HTTPError: 403 Forbidden 错误 当我尝试打开带有逗号(或其他特殊字符,如.)的URL时。 我可以在浏览器中打开相同的网址。

require 'open-uri'
url = "http://en.wikipedia.org/wiki/Thor_Industries,_Inc."
f = open(url)
# throws OpenURI::HTTPError: 403 Forbidden error

如何转义此类 URL?

我尝试使用 CGI::escape 转义 url,但出现了相同的错误。

f = open(CGI::escape(url))

I am running in to OpenURI::HTTPError: 403 Forbidden error
when I try to open a URL with a comma (OR other special characters like .).
I am able to open the same url in a browser.

require 'open-uri'
url = "http://en.wikipedia.org/wiki/Thor_Industries,_Inc."
f = open(url)
# throws OpenURI::HTTPError: 403 Forbidden error

How do I escape such URL?

I have tried to escape the url with CGI::escape and I get the same error.

f = open(CGI::escape(url))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

三生池水覆流年 2024-08-30 18:23:26

通常,只需需要模块 cgi,然后使用 CGI::escape(str)

require 'cgi'
require 'open-uri'
escaped_page = CGI::escape("Thor_Industries,_Inc.")
url = "http://en.wikipedia.org/wiki/#{escaped_page}"
f = open(url)

但是,这似乎不适用于您的特定实例,并且仍然返回 403。无论如何,我将把它留在这里供参考。


编辑:维基百科拒绝您的请求,因为它怀疑您是机器人。看起来某些内容明确的页面会授予您,但那些不符合其“安全”模式的页面(例如包含点或逗号的页面)将受到其筛选。如果您实际输出内容(我使用 Net::HTTP 执行此操作),您将得到以下内容:

脚本应使用包含联系信息的信息丰富的用户代理字符串,否则它们可能会被 IP 封锁,恕不另行通知。

但是,提供用户代理字符串可以解决该问题:

open("http://en.wikipedia.org/wiki/Thor_Industries,_Inc.",
  "User-Agent" => "Ruby/#{RUBY_VERSION}")

Typically, one would simply require the module cgi, then use CGI::escape(str).

require 'cgi'
require 'open-uri'
escaped_page = CGI::escape("Thor_Industries,_Inc.")
url = "http://en.wikipedia.org/wiki/#{escaped_page}"
f = open(url)

However, this doesn't seem to work for your particular instance, and still returns a 403. I'll leave this here for reference, regardless.


Edit: Wikipedia is refusing your requests because it suspects that you are a bot. It would seem that certain pages that are clearly content are granted to you, but those that don't match its "safe" pattern (e.g. those that contain dots or commas) are subject to its screening. If you actually output the content (I did this with Net::HTTP), you get the following:

Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice.

Providing a user-agent string, however, solves the issue:

open("http://en.wikipedia.org/wiki/Thor_Industries,_Inc.",
  "User-Agent" => "Ruby/#{RUBY_VERSION}")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文