HTTParty 解析 HTML
我正在寻找一种方法,从结构相当良好但不太完美的 xml 网站中提取特定内容:
<html>
<head>
<title>title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<META HTTP-EQUIV="expires" CONTENT="now">
<meta http-equiv=refresh content=300>
</head>
<body bgcolor="#FFFFFF">
<p><font face="Arial, Helvetica, sans-serif" size="2"><img src="pict.gif" width="503" height="43"><br></font></p>
<p><font face="Arial, Helvetica, sans-serif" size="2">Please Note: ...<br></font></p>
<font face="Arial, Helvetica, sans-serif" size="3"><B>The Schedule</B></font><p></p>
<table border=0 width="100%">
<tr>
<td><font face="Arial, Helvetica, sans-serif" size="2"><B>CONTENT A</B></font> </td>
<td><font face="Arial, Helvetica, sans-serif" size="2"><B>CONTENT B</B></font> </td>
<td><font face="Arial, Helvetica, sans-serif" size="2"><B>CONTENT C</B></font> </td>
<td><font face="Arial, Helvetica, sans-serif" size="2"><B>CONTENT D</B></font> </td>
<td><font face="Arial, Helvetica, sans-serif" size="2"><B>CONTENT E</B></font></td>
</tr>
...
请注意未终止的 br。
我正在尝试使用 HTTParty,并且想做这样的事情:
include HTTParty
base_uri "http://website.com/"
basic_auth "name", "pw"
format :xml
def download_and_process_index_file
s = self.class.get("theurl.html")
thehtml = s.parsed_response
#print CONTENT A
puts thehtml['html']['body']['table']['tr']['td']['font']['b']
end
但是 xml 无法解析,如果我切换到 format :html
那么我似乎没有得到任何解析的好处。我的想法是否接近于此?
谢谢, 彼得
I'm looking for a way to pull specific content off of a web site that is fairly well formed, but not quite perfect xml:
<html>
<head>
<title>title</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<META HTTP-EQUIV="expires" CONTENT="now">
<meta http-equiv=refresh content=300>
</head>
<body bgcolor="#FFFFFF">
<p><font face="Arial, Helvetica, sans-serif" size="2"><img src="pict.gif" width="503" height="43"><br></font></p>
<p><font face="Arial, Helvetica, sans-serif" size="2">Please Note: ...<br></font></p>
<font face="Arial, Helvetica, sans-serif" size="3"><B>The Schedule</B></font><p></p>
<table border=0 width="100%">
<tr>
<td><font face="Arial, Helvetica, sans-serif" size="2"><B>CONTENT A</B></font> </td>
<td><font face="Arial, Helvetica, sans-serif" size="2"><B>CONTENT B</B></font> </td>
<td><font face="Arial, Helvetica, sans-serif" size="2"><B>CONTENT C</B></font> </td>
<td><font face="Arial, Helvetica, sans-serif" size="2"><B>CONTENT D</B></font> </td>
<td><font face="Arial, Helvetica, sans-serif" size="2"><B>CONTENT E</B></font></td>
</tr>
...
Note the unterminated br's.
I'm trying to use HTTParty, and would like to do something like this:
include HTTParty
base_uri "http://website.com/"
basic_auth "name", "pw"
format :xml
def download_and_process_index_file
s = self.class.get("theurl.html")
thehtml = s.parsed_response
#print CONTENT A
puts thehtml['html']['body']['table']['tr']['td']['font']['b']
end
But the xml won't parse, and if I switch to format :html
then I don't seem to get any of the parsing goodness. Am I even close in my thinking here?
Thanks,
Peter
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论