抓取 .HTM 文件时遇到问题
我刚刚开始从网页上抓取基本文本,目前正在使用 HTMLAgilityPack C# 库。我在竞争对手.yahoo.com 上取得了一些成功(体育是我的爱好,所以为什么不抓取一些有趣的东西呢?),但我被困在 NHL 的比赛摘要页面上。我认为这是一个有趣的问题,所以我将其发布在这里。
我正在测试的页面是: http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM 。
乍一看,它似乎是基本文本,没有 ajax 或搞乱基本抓取工具的东西 然后我意识到由于某些 JavaScript 我无法右键单击,所以我解决了这个问题。我在 Firefox 中右键单击并使用 XPather 获取主队的 xpath,我得到:
/html/body/table[@id='MainTable']/tbody/tr[1]/td/table[@id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[@id='Home']/tbody/tr[3]/td
当我尝试抓取该节点/内部文本时,htmlagilitypack 将找不到它。有人在页面的源代码中看到任何可能阻止我的奇怪的东西吗?
我对此很陌生,并且仍在学习人们如何阻止我刮擦,我们很乐意感谢任何提示或技巧!
ps 我遵守有关机器人等的所有网站规则,但我注意到这种奇怪的行为并将其视为一个挑战。
I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.
The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM
Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:
/html/body/table[@id='MainTable']/tbody/tr[1]/td/table[@id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[@id='Home']/tbody/tr[3]/td
When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?
I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!
p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,看来我的 xpath 中有 tbody。当我从 xpath 中手动删除这些 tbody 时,HTMLAgilityPack 可以很好地处理它。
我仍然想知道为什么我得到无效的 xpath,但现在我已经回答了我的问题。
Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.
I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.
我认为除非我的 xpath 知识有很多缺陷(可能),否则问题出在 xpath 表达式中的 /tbody 节点。
当我
这样做时效果很好..返回
“哥伦布蓝衣队第 5 场主场第 3 场比赛”
我希望这是你想要的字符串。
检查 html 我找不到 /tbody。
I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.
When I do
That works fine.. returns a
"COLUMBUS BLUE JACKETSGame 5 Home Game 3"
which I hope is the string you wanted.
Examining the html I couldn't find a /tbody.