抓取 .HTM 文件时遇到问题

发布于 2024-09-28 02:19:38 字数 782 浏览 9 评论 0原文

我刚刚开始从网页上抓取基本文本,目前正在使用 HTMLAgilityPack C# 库。我在竞争对手.yahoo.com 上取得了一些成功(体育是我的爱好,所以为什么不抓取一些有趣的东西呢?),但我被困在 NHL 的比赛摘要页面上。我认为这是一个有趣的问题,所以我将其发布在这里。

我正在测试的页面是: http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM 。

乍一看,它似乎是基本文本,没有 ajax 或搞乱基本抓取工具的东西 然后我意识到由于某些 JavaScript 我无法右键单击,所以我解决了这个问题。我在 Firefox 中右键单击并使用 XPather 获取主队的 xpath,我得到:

/html/body/table[@id='MainTable']/tbody/tr[1]/td/table[@id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[@id='Home']/tbody/tr[3]/td

当我尝试抓取该节点/内部文本时,htmlagilitypack 将找不到它。有人在页面的源代码中看到任何可能阻止我的奇怪的东西吗?

我对此很陌生,并且仍在学习人们如何阻止我刮擦,我们很乐意感谢任何提示或技巧!

ps 我遵守有关机器人等的所有网站规则,但我注意到这种奇怪的行为并将其视为一个挑战。

I have just begun scraping basic text off web pages, and am currently using the HTMLAgilityPack C# library. I had some success with boxscores off rivals.yahoo.com (sports is my thing so why not scrape something interesting?) but I am stuck on NHL's game summary pages. I think this is kind of an interesting problem so I would post it here.

The page I am testing is:
http://www.nhl.com/scores/htmlreports/20102011/GS020079.HTM

Upon first glance, it seems like basic text with no ajax or stuff to mess up a basic scraper. Then I realize I can't right click due to some javascript, so I work around that. I right click in firefox and get the xpath of the home team using XPather and I get:

/html/body/table[@id='MainTable']/tbody/tr[1]/td/table[@id='StdHeader']/tbody/tr/td/table/tbody/tr/td[3]/table[@id='Home']/tbody/tr[3]/td

When I try to grab that node / inner text, htmlagilitypack won't find it. Does anyone see anything strange in the page's source code that might be stopping me?

I am new to this and still learning how people might stop me from scraping, any tips or tricks are gladly appreciated!

p.s. I observe all site rules regarding bots, etc, but I noticed this strange behavior and saw it as a challenge.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不如归去 2024-10-05 02:19:38

好吧,看来我的 xpath 中有 tbody。当我从 xpath 中手动删除这些 tbody 时,HTMLAgilityPack 可以很好地处理它。

我仍然想知道为什么我得到无效的 xpath,但现在我已经回答了我的问题。

Ok so it appears that my xpaths have tbody's in them. When I remove these tbodys manually from the xpath, HTMLAgilityPack can handle it fine.

I'd still like to know why I am getting invalid xpaths, but for now I have answered my question.

玉环 2024-10-05 02:19:38

我认为除非我的 xpath 知识有很多缺陷(可能),否则问题出在 xpath 表达式中的 /tbody 节点。

当我

 string test = string.Empty;
StreamReader sr = new StreamReader(@"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = @"//table[@id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;

这样做时效果很好..返回
“哥伦布蓝衣队第 5 场主场第 3 场比赛”
我希望这是你想要的字符串。

检查 html 我找不到 /tbody。

I think unless my xpath knowledge is heaps flawed(probably) the problem is with the /tbody node in your xpath expression.

When I do

 string test = string.Empty;
StreamReader sr = new StreamReader(@"C:\gs.htm");
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
sr.Close();
sr = null;
string xpath = @"//table[@id='Home']/tr[3]/td";
test = doc.DocumentNode.SelectSingleNode(xpath).InnerText;

That works fine.. returns a
"COLUMBUS BLUE JACKETSGame 5 Home Game 3"
which I hope is the string you wanted.

Examining the html I couldn't find a /tbody.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文