Mechanize 无法通过 CSS 选择器方法识别锚标记

发布于 2024-08-20 11:49:12 字数 1940 浏览 10 评论 0原文

(希望这不是违反礼仪:我在 RailsForum 上发布了此内容,但最近我没有从那里得到太多回应。)还有

其他人遇到过 Mechanize 无法通过 CSS 选择器识别锚标记的问题吗?

HTML 看起来像这样(为了清晰起见,删除了空白的片段):

<td class='calendarCell' align='left'>
<a href="http://www.mysite.org/index.php/site/ActivitiesCalendar/2010/02/10/">10</a>
<p style="margin-bottom:15px; line-height:14px; text-align:left;">
<span class="sidenavHeadType">
 Current Events</span><br />
<b><a href="http://www.mysite.org/index.php/site/
Clubs/banks_and_the_fed" class="a2">Banks and the Fed</a></b>
<br />
10:30am- 11:45am
</p>

我正在尝试从这些事件中收集数据。除了在

中获取锚点之外,一切正常。 内显然有一个 标记,我需要点击该链接来获取有关此事件的更多详细信息。

在我的 rake 任务中,我有:

agent.page.search(".calendarCell,.calendarToday").each do |item|
  day = item.at("a").text

  item.search("p").each do |e|
    anchor   = e.at("a")
    puts anchor
    puts e.inner_html

  end
end

有趣的是 item.at("a") 总是返回锚点。但 e.at("a") 返回 nil。当我在 p 元素上执行inner_html 时,它完全忽略锚点。示例输出:

nil

<span class="sidenavHeadType">
 Photo Club</span><br><b>Indexing Slide Collections</b>
<br>
2:00pm- 3:00pm

但是,当我直接使用 Nokogiri 运行相同的抓取时:

doc.css(".calendarCell,.calendarToday").each do |item|
  day = item.at_css("a").text
  item.css("p").each do |e|
    link     = e.at_css("a")[:href]
    puts e.inner_html
  end
end

它会识别 的内部

,并且会返回 href 等。Mechanize

<span class="sidenavHeadType">
 Bridge Party</span><br><b><a href="http://www.mysite.org/index.php/site/Clubs/party_bridge_51209" class="a2">Party Bridge</a></b>
<br>
7:00pm- 9:00pm

应该使用 Nokogiri,所以我想知道我是否有一个错误的版本或者如果这也会影响其他人。

感谢您提供任何线索。

(Hope this isn't a breach of etiquette: I posted this on RailsForum, but I haven't been getting much response from there recently.)

Has anyone else had problems with Mechanize not recognizing anchor tags via CSS selectors?

The HTML looks like this (snippet with white space removed for clarity):

<td class='calendarCell' align='left'>
<a href="http://www.mysite.org/index.php/site/ActivitiesCalendar/2010/02/10/">10</a>
<p style="margin-bottom:15px; line-height:14px; text-align:left;">
<span class="sidenavHeadType">
 Current Events</span><br />
<b><a href="http://www.mysite.org/index.php/site/
Clubs/banks_and_the_fed" class="a2">Banks and the Fed</a></b>
<br />
10:30am- 11:45am
</p>

I'm trying to collect the data from these events. Everything is working except getting the anchor within the <p>. There's clearly an <a> tag inside the <b>, and I'm going to need to follow that link to get further details on this event.

In my rake task, I have:

agent.page.search(".calendarCell,.calendarToday").each do |item|
  day = item.at("a").text

  item.search("p").each do |e|
    anchor   = e.at("a")
    puts anchor
    puts e.inner_html

  end
end

What's interesting is that the item.at("a") always returns the anchor. But the e.at("a") returns nil. And when I do inner_html on the p element, it ignores the anchor entirely. Example output:

nil

<span class="sidenavHeadType">
 Photo Club</span><br><b>Indexing Slide Collections</b>
<br>
2:00pm- 3:00pm

However, when I run the same scrape directly with Nokogiri:

doc.css(".calendarCell,.calendarToday").each do |item|
  day = item.at_css("a").text
  item.css("p").each do |e|
    link     = e.at_css("a")[:href]
    puts e.inner_html
  end
end

It recognizes the inside the

, and it will return the href, etc.

<span class="sidenavHeadType">
 Bridge Party</span><br><b><a href="http://www.mysite.org/index.php/site/Clubs/party_bridge_51209" class="a2">Party Bridge</a></b>
<br>
7:00pm- 9:00pm

Mechanize is supposed to use Nokogiri, so I'm wondering if I have a bad version or if this affects others as well.

Thanks for any leads.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

活雷疯 2024-08-27 11:49:12

没关系。虚惊。在我的 Nokogiri 任务中,我指向包含锚点的页面的本地副本。实时页面需要登录,因此当我浏览它时,我可以看到 a 标签。将登录添加到 rake 任务中解决了这个问题。

Never mind. False alarm. In my Nokogiri task, I was pointing to a local copy of the page that included the anchors. The live page required a login, so when I browsed to it, I could see the a tags. Adding the login to the rake task solved it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文