python-beautifulsoup 误报了我的 html 吗?

发布于 2024-07-18 04:02:14 字数 3239 浏览 8 评论 0原文

据我所知,我每台机器都有两台,运行 python 2.5 和 BeautifulSoup 3.1.0.1。

我正在尝试抓取 http://utahcritseries.com/RawResults.aspx,使用

from BeautifulSoup import BeautifulSoup
import urllib2

base_url = "http://www.utahcritseries.com/RawResults.aspx"

data=urllib2.urlopen(base_url)
soup=BeautifulSoup(data)
i = 0
table=soup.find("table",id='ctl00_ContentPlaceHolder1_gridEvents')
#table=soup.table
print "begin table"
for row in table.findAll('tr')[1:10]:
    i=i + 1
    col = row.findAll('td')
    date = col[0].string
    event = col[1].a.string
    confirmed = col[2].string
    print '%s - %s' % (date, event)
print "end table"
print "%s rows processed" % i

: windows机器,我得到了正确的结果,它是日期和事件名称的列表。 在我的 Mac 上,我没有。 相反,我注意到

3/2/2002 - Rocky Mtn Raceway Criterium
None - Rocky Mtn Raceway Criterium
3/23/2002 - Rocky Mtn Raceway Criterium
None - Rocky Mtn Raceway Criterium
4/2/2002 - Rocky Mtn Raceway Criterium
None - Saltair Time Trial
4/9/2002 - Rocky Mtn Raceway Criterium
None - DMV Criterium
4/16/2002 - Rocky Mtn Raceway Criterium

,当我

print row

在 Windows 机器上时,tr 数据看起来与源 html 完全相同。 请注意第二个表格行上的样式标签。 这是前两行:

<tr>
<td>
 3/2/2002
</td>
<td>
 <a href="Event.aspx?id=226">
  Rocky Mtn Raceway Criterium
 </a>
</td>
<td>
 Confirmed
</td>
<td>
 <a href="Event.aspx?id=226">
  Points
 </a>
</td>
<td>
 <a disabled="disabled">
  Results
 </a>
</td>
</tr>

<tr style="color:#333333;background-color:#EFEFEF;">
<td>
 3/16/2002
</td>
<td>
 <a href="Event.aspx?id=227">
  Rocky Mtn Raceway Criterium
 </a>
</td>
<td>
 Confirmed
</td>
<td>
 <a href="Event.aspx?id=227">
  Points
 </a>
</td>
<td>
 <a disabled="disabled">
  Results
 </a>
</td>
</tr>

在我的 Mac 上,当我打印前两行时,样式信息将从 tr 标记中删除,并移至每个 td 字段中。 我不明白为什么会发生这种情况。 对于每个其他日期值,我都没有得到任何值,因为 BeautifulSoup 在每隔一个日期周围放置一个字体标签。 这是 Mac 的输出:

<tr>
<td>
 3/2/2002
</td>
<td>
 <a href="Event.aspx?id=226">
  Rocky Mtn Raceway Criterium
 </a>
</td>
<td>
 Confirmed
</td>
<td>
 <a href="Event.aspx?id=226">
  Points
 </a>
</td>
<td>
 <a disabled="disabled">
  Results
 </a>
</td>
</tr>

<tr bgcolor="#EFEFEF">
<td>
 <font color="#333333">
  3/16/2002
 </font>
</td>
<td>
 <font color="#333333">
  <a href="Event.aspx?id=227">
   Rocky Mtn Raceway Criterium
  </a>
 </font>
</td>
<td>
 <font color="#333333">
  Confirmed
 </font>
</td>
<td>
 <font color="#333333">
  <a href="Event.aspx?id=227">
   Points
  </a>
 </font>
</td>
<td>
 <font color="#333333">
  <a disabled="disabled">
   Results
  </a>
 </font>
</td>
</tr>

我的脚本在 Windows 下显示正确的结果 - 我需要做什么才能让我的 Mac 正常工作?

I have two machines each, to the best of my knowledge, running python 2.5 and BeautifulSoup 3.1.0.1.

I'm trying to scrape http://utahcritseries.com/RawResults.aspx, using:

from BeautifulSoup import BeautifulSoup
import urllib2

base_url = "http://www.utahcritseries.com/RawResults.aspx"

data=urllib2.urlopen(base_url)
soup=BeautifulSoup(data)
i = 0
table=soup.find("table",id='ctl00_ContentPlaceHolder1_gridEvents')
#table=soup.table
print "begin table"
for row in table.findAll('tr')[1:10]:
    i=i + 1
    col = row.findAll('td')
    date = col[0].string
    event = col[1].a.string
    confirmed = col[2].string
    print '%s - %s' % (date, event)
print "end table"
print "%s rows processed" % i

On my windows machine,I get the correct result, which is a list of dates and event names. On my mac, I don't. instead, I get

3/2/2002 - Rocky Mtn Raceway Criterium
None - Rocky Mtn Raceway Criterium
3/23/2002 - Rocky Mtn Raceway Criterium
None - Rocky Mtn Raceway Criterium
4/2/2002 - Rocky Mtn Raceway Criterium
None - Saltair Time Trial
4/9/2002 - Rocky Mtn Raceway Criterium
None - DMV Criterium
4/16/2002 - Rocky Mtn Raceway Criterium

What I'm noticing is that when I

print row

on my windows machine, the tr data looks exactly the same as the source html. Note the style tag on the second table row. Here's the first two rows:

<tr>
<td>
 3/2/2002
</td>
<td>
 <a href="Event.aspx?id=226">
  Rocky Mtn Raceway Criterium
 </a>
</td>
<td>
 Confirmed
</td>
<td>
 <a href="Event.aspx?id=226">
  Points
 </a>
</td>
<td>
 <a disabled="disabled">
  Results
 </a>
</td>
</tr>

<tr style="color:#333333;background-color:#EFEFEF;">
<td>
 3/16/2002
</td>
<td>
 <a href="Event.aspx?id=227">
  Rocky Mtn Raceway Criterium
 </a>
</td>
<td>
 Confirmed
</td>
<td>
 <a href="Event.aspx?id=227">
  Points
 </a>
</td>
<td>
 <a disabled="disabled">
  Results
 </a>
</td>
</tr>

On my mac when I print the first two rows, the style information is removed from the tr tag and it's moved into each td field. I don't understand why this is happening. I'm getting None for every other date value, because BeautifulSoup is putting a font tag around every other date. Here's the mac's output:

<tr>
<td>
 3/2/2002
</td>
<td>
 <a href="Event.aspx?id=226">
  Rocky Mtn Raceway Criterium
 </a>
</td>
<td>
 Confirmed
</td>
<td>
 <a href="Event.aspx?id=226">
  Points
 </a>
</td>
<td>
 <a disabled="disabled">
  Results
 </a>
</td>
</tr>

<tr bgcolor="#EFEFEF">
<td>
 <font color="#333333">
  3/16/2002
 </font>
</td>
<td>
 <font color="#333333">
  <a href="Event.aspx?id=227">
   Rocky Mtn Raceway Criterium
  </a>
 </font>
</td>
<td>
 <font color="#333333">
  Confirmed
 </font>
</td>
<td>
 <font color="#333333">
  <a href="Event.aspx?id=227">
   Points
  </a>
 </font>
</td>
<td>
 <font color="#333333">
  <a disabled="disabled">
   Results
  </a>
 </font>
</td>
</tr>

My script is displaying the correct result under windows-what do I need to do in order to get my Mac to work correctly?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

┼── 2024-07-25 04:02:14

BeautifulSoup 3.1 版存在已记录的问题

您可能需要仔细检查您实际上正在使用的版本,如果是,则降级。

There are documented problems with version 3.1 of BeautifulSoup.

You might want to double check that is the version you in fact are using, and if so downgrade.

嘿嘿嘿 2024-07-25 04:02:14

我怀疑问题出在 urlib2 请求中,而不是 BeautifulSoup:

如果您向我们展示两台机器上此命令返回的原始数据的同一部分,可能会有所帮助:

urllib2.urlopen(base_url)

此页面看起来可能会有所帮助:
http://bytes.com/groups/python/635923- Building-browser-like-get-request

最简单的解决方案可能只是检测脚本运行的环境并相应地更改解析逻辑。

>>> import os
>>> os.uname() 
('Darwin', 'skom.local', '9.6.0', 'Darwin Kernel Version 9.6.0: Mon Nov 24 17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386', 'i386')

或者让微软使用网络标准:)

另外,你没有使用 mechanize 来获取页面吗? 如果是这样,问题可能就在那里。

I suspect the problem is in the urlib2 request, not BeautifulSoup:

It might help if you show us the same section of the raw data as returned by this command on both machines:

urllib2.urlopen(base_url)

This page looks like it might help:
http://bytes.com/groups/python/635923-building-browser-like-get-request

The simplest solution is probably just to detect which environment the script is running in and change the parsing logic accordingly.

>>> import os
>>> os.uname() 
('Darwin', 'skom.local', '9.6.0', 'Darwin Kernel Version 9.6.0: Mon Nov 24 17:37:00 PST 2008; root:xnu-1228.9.59~1/RELEASE_I386', 'i386')

Or get microsoft to use web standards :)

Also, didn't you use mechanize to fetch the pages? If so, the problem may be there.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文