使用PHP解析HTML以获取多篇同类文章的数据
我正在开发一个网站,该网站解析优惠券网站并列出这些优惠券。有些网站以 XML 文件形式提供列表 - 这些没有问题。但也有一些网站不提供 XML。我正在考虑解析他们的网站并从网站内容中获取优惠券信息 - 使用 PHP 从 HTML 中获取该数据。例如,您可以看到以下网站:
我正在使用 PHP。所以,我的问题是 - 是否有一种相对简单的方法来解析 HTML 并获取该网站上列出的每张优惠券的数据,就像我在解析 XML 时获取的数据一样?
感谢您的帮助。
I'm working on a web-site which parses coupon sites and lists those coupons. There are some sites which provide their listings as an XML file - no problem with those. But there are also some sites which do not provide XML. I'm thinking of parsing their sites and get the coupon information from the site content - grabbing that data from HTML with PHP. As an example, you can see the following site:
I'm working with PHP. So, my question is - is there a relatively easy way to parse HTML and get the data for each coupon listed on that site just like I get while parsing XML?
Thanks for the help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您始终可以使用 DOM 解析器,但从网站抓取内容充其量是不可靠的。
如果它们的布局稍有变化,您的应用程序可能会失败。哦,在大多数情况下,这样做也不符合大多数网站的服务条款。
You can always use a DOM parser, but scraping content from sites is unreliable at best.
If their layout changes every so slightly, your app could fail. Oh, and in most cases it's also against most sites TOSs to do so..
虽然使用 DOM 解析器似乎是个好主意,但我通常更喜欢使用旧的正则表达式来进行抓取。这样的工作量要少得多,而且如果网站改变了布局,无论你采取什么方法,你都会完蛋。但是,如果使用足够智能的正则表达式,您的代码应该不会受到不会直接影响您感兴趣的部分的更改的影响。
要记住的一件事是在提供正则表达式时在正则表达式中包含一些类名称,但假设您需要的信息之间可以有任何内容。例如
While using a DOM parser might seem a good idea, I usually prefer good old regular expressions for scraping. It's much less work, and if the site changes it's layout you're screwed anyway, whatever your approach is. But, if using a smart enough regex, your code should be immune to changes that do not directly impact the part you're interested in.
One thing to remember is to include some class names in regex when they're provided, but to assume anything can be between the info you need. E.g.
如果您更喜欢使用 php,最可靠的方法是 Php DOM 解析器。
这是仅解析元素的示例。
我还提供了关于解析其他 html 元素的一些更多信息。
我希望这对你有用。
The most reliable method is the Php DOM Parser if you prefer working with php.
Here is an example of parsing only the elements.
I am providing some more information about parsing the other html elements too.
I hope that will be useful to you.