Html 解析器获取博客文章
我需要创建一个 html 解析器,给定一个博客 url,它返回一个列表,其中包含页面中的所有帖子。
- 即如果一个页面有 10 个帖子,则 应该返回一个包含 10 个 div 的列表, 其中每个 div 包含 h1 和 a p
我无法使用它的 rss feed,因为我需要确切地知道它对用户来说是什么样子,是否有任何广告、图像等以及相反有些博客只有内容摘要,而提要则包含全部内容,反之亦然。
无论如何,我已经制作了一个下载其提要并在 html 中搜索类似内容的博客,它对于某些博客非常有效,但对于其他博客则不然。
我不认为我可以制作一个适用于 100% 解析的博客的解析器,但我想做到最好。
最好的方法应该是什么?寻找 id 属性等于“post”、“content”的标签?寻找 p 标签?等等等等...
提前感谢您的帮助!
I need to create a html parser, that given a blog url, it returns a list, with all the posts in the page.
- I.e. if a page has 10 posts, it
should return a list of 10 divs,
where each div contains h1 and
a p
I can't use its rss feed, because I need to know exactly how it looks like for the user, if it has any ad, image etc and in contrast some blogs have just a summary of its content and the feed has it all, and vice-versa.
Anyway, I've made one that download its feed, and search the html for similar content, it works very well for some blogs, but not for others.
I don't think I can make a parser that works for 100% of the blogs it parses, but I want to make the best possible.
What should be the best approach? Look for tags that have its id attribute equal "post", "content"? Look for p tags? etc etc etc...
Thanks in advance for any help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我认为你不会在这方面取得成功。您也许能够解析一个博客,但如果博客引擎发生了变化,它就不再起作用了。我也不认为您能够编写通用解析器。你甚至可能会取得部分成功,但这将是一次虚幻的成功,因为在这种情况下一切都很容易出错。如果您需要内容,您应该使用 RSS。如果您需要存储(简单存储)它的外观,您也可以这样做。但是按照它看起来的方式解析呢?我认为这方面没有取得具体的成功。
I don't think you will be successful on that. You might be able to parse one blog, but if the blog engine changes stuff, it won't work any more. I also don't think you'll be able to write a generic parser. You might even be partially successful, but it's going to be an ethereal success, because everything is so error prone on this context. If you need content, you should go with RSS. If you need to store (simply store) how it looks, you can also do that. But parsing by the way it looks? I don't see concrete success on that.
“最好的可能”结果是“最好的合理”,你可以定义什么是合理的。通过查看常见的博客工具(WordPress、LiveJournal 等)如何生成页面以及专门为每个博客编写的代码,您可以获得大量博客。
一般情况是一个非常困难的问题,因为每个博客工具都有自己的格式。您也许可以使用“标准”标识符(例如“帖子”、“内容”等)来推断事物,但这值得怀疑。
您还会遇到广告方面的困难。很多广告都是用 JavaScript 生成的。因此,下载该页面只会为您提供 JavaScript 代码,而不是生成的 HTML。如果您确实想要识别广告,则必须识别生成它们的 JavaScript 代码。或者,您的程序必须执行 JavaScript 来创建最终的 DOM。然后您会遇到与上面类似的问题:确定 HTML 的某些特定部分是否是广告。
有一些启发式方法取得了一定的成功。请查看识别页面的主要内容以获取类似问题的答案。
"Best possible" turns out to be "best reasonable," and you get to define what is reasonable. You can get a very large number of blogs by looking at how common blogging tools (WordPress, LiveJournal, etc.) generate their pages, and code specially for each one.
The general case turns out to be a very hard problem because every blogging tool has its own format. You might be able to infer things using "standard" identifiers like "post", "content", etc., but it's doubtful.
You'll also have difficulty with ads. A lot of ads are generated with JavaScript. So downloading the page will give you just the JavaScript code rather than the HTML that gets generated. If you really want to identify the ads, you'll have to identify the JavaScript code that generates them. Or, your program will have to execute the JavaScript to create the final DOM. And then you're faced with a problem similar to that above: figuring out if some particular bit of HTML is an ad.
There are heuristic methods that are somewhat successful. Check out Identifying a Page's Primary Content for answers to a similar question.
使用 HTML Agility 包。它是为此而设计的 HTML 解析器。
Use the HTML Agility pack. It is an HTML parser made for this.
我刚刚为我们公司使用 WordPress 的博客做了类似的事情。这对我们有好处,因为我们的 wordress 博客多年来没有改变,但其他人是对的,如果你的 html 变化很大,解析就会成为一个麻烦的解决方案。
这是我的建议:
使用 Nuget 安装 RestSharp 和 HtmlAgilityPack。然后下载 fizzler 并将这些引用包含在您的项目中 (http://code.google.com/p/fizzler/downloads/list)。
这是我用来在我的网站上实现博客搜索的一些示例代码。
祝你好运,
埃里克
I just did something like this for our company's blog which uses wordpress. This is good for us because our wordress blog hasn't changed in years, but the others are right in that if your html changes a lot, parsing becomes a cumbersome solution.
Here is what I recommend:
Using Nuget install RestSharp and HtmlAgilityPack. Then download fizzler and include those references in your project (http://code.google.com/p/fizzler/downloads/list).
Here is some sample code I used to implement the blog's search on my site.
Good luck,
Eric