无法使用 Ruby 通过 HTTPS 获取 XML 数据

发布于 2024-08-08 14:50:02 字数 3115 浏览 5 评论 0原文

我正在尝试从服务器下载帐户交易（XML 文件）。当我从浏览器输入此 URL 时：

https://secure.somesite.com:443/my/account/download_transactions.php?type=xml

它成功下载了正确的 XML 文件（假设我已经登录）。

我想用 Ruby 以编程方式执行此操作，并尝试了以下代码：

require 'open-uri'
require 'rexml/document'
require 'net/http' 
require 'net/https'
include REXML

url = URI.parse("https://secure.somesite.com:443/my/account/download_transactions.php?type=xml")
req = Net::HTTP::Get.new(url.path)
req.basic_auth 'userid', 'password'
req.content_type = 'text/xml'

http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
response = http.start { |http| http.request(req) }

root = Document.new(response.read_body).root

root.elements.each("transaction") do |t|
   id = t.elements["id"].text
   description = t.elements["description"].text
   puts "TRANSACTION ID='#{id}' DESCRIPTION='#{description}'"
end

执行继续，但在“Document.new”上失败：

RuntimeError: Illegal character '&' in raw string "??ࡱ?;??

返回的正文如果打印的话显然不是 XML，并且似乎是一长串大部分不可读的字符串偶尔可见的单词表明它与预期内容有关。我还多次看到字符串“Arial1”与不可读的内容混合在一起，这让我认为我收到的是 XML 以外的格式。

我的问题是，我在这里做错了什么？ XML 文件肯定是可用的（如果您检查浏览器获得的副本，则该文件是正确的）。我指定的 SSL 有问题吗？ HTTPS 请求？有没有不同的、正确的方式来展现正确的身材？预先感谢您的帮助！

检查标题的有趣想法。成功的浏览器序列从 HttpLiveHeaders 中显示了这一点：

https://secure.somesite.com/my/account/download_transactions.php?&type=xml

GET /my/account/download_transactions.php?type=xml HTTP/1.1
Host: secure.somesite.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: <obscured>

HTTP/1.x 200 OK
Date: Wed, 21 Oct 2009 13:13:08 GMT
Server: Apache/2.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: must-revalidate, post-check=0,pre-check=0
Pragma: public
Content-Disposition: attachment; filename=stuff.xml
Connection: close
Transfer-Encoding: chunked
Content-Type: application/xml

我尝试通过将上面的“accepts”逐字剪切并粘贴到我的请求中来匹配所有 HTTP 标头位，但返回的 XML 文件仍然搞砸了。

我的代码返回的响应的十六进制转储显示了大量 00x 和 FFx，并且单词“root”和“entry”彼此靠近。不成功的 ruby 序列的 WireShark 转储的帮助不大，因为它显示了 SSL 编码的应用程序数据。但显然正在返回一大块数据。

START DUMP
00000000: d0 cf 11 e0 a1 b1 1a e1 - 00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00 - 3b 00 03 00 fe ff 09 00  ........;.......
00000020: 06 00 00 00 00 00 00 00 - 00 00 00 00 01 00 00 00  ................
00000030: 04 00 00 00 00 00 00 00 - 00 10 00 00 00 00 00 00  ................
00000040: 01 00 00 00 fe ff ff ff - 00 00 00 00 05 00 00 00  ................
00000050: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
00000060: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
00000070: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
... and so on... non 00 and FF's appear much further down.

我不知道接下来要尝试什么。有什么建议吗？

原文

I'm trying to download account transactions (an XML file) from a server. When I enter this URL from a browser:

https://secure.somesite.com:443/my/account/download_transactions.php?type=xml

it successfully downloads a correct XML file (assuming I've already logged in).

I want to do this programmatically with Ruby, and tried this code:

require 'open-uri'
require 'rexml/document'
require 'net/http' 
require 'net/https'
include REXML

url = URI.parse("https://secure.somesite.com:443/my/account/download_transactions.php?type=xml")
req = Net::HTTP::Get.new(url.path)
req.basic_auth 'userid', 'password'
req.content_type = 'text/xml'

http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
response = http.start { |http| http.request(req) }

root = Document.new(response.read_body).root

root.elements.each("transaction") do |t|
   id = t.elements["id"].text
   description = t.elements["description"].text
   puts "TRANSACTION ID='#{id}' DESCRIPTION='#{description}'"
end

Execution proceeds, but fails on the "Document.new":

RuntimeError: Illegal character '&' in raw string "??ࡱ?;??

The returned body is clearly not XML if printed, and appears to be a long string of mostly unreadables, with the occasional visible word indicating it has something to do with the intended content. I also see the string "Arial1" mixed in with the unreadables several times, which makes me think I'm receiving a format other than XML.

My question is, what am I doing wrong here? The XML file is definitely available (and correct if you examine the browser-obtained copy). Am I specifying something wrong with the SSL? The HTTPS request? Is there a different and proper way to reveal the correct body? Thanks in advance for your assistance!

Interesting idea to check headers. The successful browser sequence shows this from HttpLiveHeaders:

https://secure.somesite.com/my/account/download_transactions.php?&type=xml

GET /my/account/download_transactions.php?type=xml HTTP/1.1
Host: secure.somesite.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Cookie: <obscured>

HTTP/1.x 200 OK
Date: Wed, 21 Oct 2009 13:13:08 GMT
Server: Apache/2.2
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: must-revalidate, post-check=0,pre-check=0
Pragma: public
Content-Disposition: attachment; filename=stuff.xml
Connection: close
Transfer-Encoding: chunked
Content-Type: application/xml

I've tried to match all the HTTP header bits by literally cutting and pasting the "accepts" from the above them into my request, but the XML file returned is still screwed up.

A hexdump of the returned response from my code shows a lot of 00x and FFx, and the words "root" and "entry" near each other. A WireShark dump of the unsuccessful ruby sequence is less helpful since it shows the SSL-encoded Application Data. But clearly a chunk of data is being returned.

START DUMP
00000000: d0 cf 11 e0 a1 b1 1a e1 - 00 00 00 00 00 00 00 00  ................
00000010: 00 00 00 00 00 00 00 00 - 3b 00 03 00 fe ff 09 00  ........;.......
00000020: 06 00 00 00 00 00 00 00 - 00 00 00 00 01 00 00 00  ................
00000030: 04 00 00 00 00 00 00 00 - 00 10 00 00 00 00 00 00  ................
00000040: 01 00 00 00 fe ff ff ff - 00 00 00 00 05 00 00 00  ................
00000050: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
00000060: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
00000070: ff ff ff ff ff ff ff ff - ff ff ff ff ff ff ff ff  ................
... and so on... non 00 and FF's appear much further down.

I'm not sure what to try next. Any suggestions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

物价感观 2024-08-15 14:50:02

我自己解决了这个问题。结果这个特定站点似乎没有使用“基本身份验证”，并且我需要执行特定的登录屏幕来生成可用的 cookie。我还通过使用“Mechanize”简化了解决方案，这是一个处理 HTTP 活动的大部分跑腿工作的 gem。

require 'rubygems'
require 'mechanize'

login_username = "theusername"
login_password = "thepassword"

# get login page
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get('https://somesite.com/login.php')

# fill out login form and submit
form = page.forms[0] # use first form on page
form['form[username]'] = login_username
form['form[password]'] = login_password
page = agent.submit(form)

# process returned page 
if page.uri.to_s.include?("login") 
  puts '---- LOGIN FAILED ----'
else
  puts '---- LOGIN SUCCESSFUL ----'
  xml_data = agent.get('https://secure.somesite.com:443/download_transactions.php?type=xml')
  puts xml_data.body
end

让我困惑的是设置表单字段的方式，由于某种原因，它与我见过的这样做的示例不同。

Fixed the problem myself. Turns out this particular site does not seem to use "basic authentication", and I was required to execute a specific login screen to produce a usable cookie. I also simplified the solution by using "Mechanize", a gem that handles much of the leg-work of HTTP activity.

require 'rubygems'
require 'mechanize'

login_username = "theusername"
login_password = "thepassword"

# get login page
agent = WWW::Mechanize.new
agent.user_agent_alias = 'Mac Safari'
page = agent.get('https://somesite.com/login.php')

# fill out login form and submit
form = page.forms[0] # use first form on page
form['form[username]'] = login_username
form['form[password]'] = login_password
page = agent.submit(form)

# process returned page 
if page.uri.to_s.include?("login") 
  puts '---- LOGIN FAILED ----'
else
  puts '---- LOGIN SUCCESSFUL ----'
  xml_data = agent.get('https://secure.somesite.com:443/download_transactions.php?type=xml')
  puts xml_data.body
end

The thing that threw me was the way to set the form fields, which for some reason were different than the examples I've seen doing this.

回复收藏 0 原文