使用PHP解析HTML以获取多篇同类文章的数据

发布于 2024-12-22 01:55:38 字数 346 浏览 0 评论 0原文

我正在开发一个网站，该网站解析优惠券网站并列出这些优惠券。有些网站以 XML 文件形式提供列表 - 这些没有问题。但也有一些网站不提供 XML。我正在考虑解析他们的网站并从网站内容中获取优惠券信息 - 使用 PHP 从 HTML 中获取该数据。例如，您可以看到以下网站：

http://www.biglion.ru/moscow/

我正在使用 PHP。所以，我的问题是 - 是否有一种相对简单的方法来解析 HTML 并获取该网站上列出的每张优惠券的数据，就像我在解析 XML 时获取的数据一样？

感谢您的帮助。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

时光瘦了 2024-12-29 01:55:38

您始终可以使用 DOM 解析器，但从网站抓取内容充其量是不可靠的。

如果它们的布局稍有变化，您的应用程序可能会失败。哦，在大多数情况下，这样做也不符合大多数网站的服务条款。

回复收藏 0 原文

不及他 2024-12-29 01:55:38

虽然使用 DOM 解析器似乎是个好主意，但我通常更喜欢使用旧的正则表达式来进行抓取。这样的工作量要少得多，而且如果网站改变了布局，无论你采取什么方法，你都会完蛋。但是，如果使用足够智能的正则表达式，您的代码应该不会受到不会直接影响您感兴趣的部分的更改的影响。

要记住的一件事是在提供正则表达式时在正则表达式中包含一些类名称，但假设您需要的信息之间可以有任何内容。例如

preg_match_all('#class="actionsItemHeadding".*?<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>#s', file_get_contents('http://www.biglion.ru/moscow/'), $matches, PREG_SET_ORDER);
print_r($matches);

While using a DOM parser might seem a good idea, I usually prefer good old regular expressions for scraping. It's much less work, and if the site changes it's layout you're screwed anyway, whatever your approach is. But, if using a smart enough regex, your code should be immune to changes that do not directly impact the part you're interested in.

One thing to remember is to include some class names in regex when they're provided, but to assume anything can be between the info you need. E.g.

preg_match_all('#class="actionsItemHeadding".*?<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>#s', file_get_contents('http://www.biglion.ru/moscow/'), $matches, PREG_SET_ORDER);
print_r($matches);

回复收藏 0 原文

我们只是彼此的过ke 2024-12-29 01:55:38

如果您更喜欢使用 php，最可靠的方法是 Php DOM 解析器。
这是仅解析元素的示例。

// Include the library
include('simple_html_dom.php');


// Retrieve the DOM from a given URL
$html = file_get_html('http://mypage.com/');
// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e) 
echo $e->href . '<br>';

我还提供了关于解析其他 html 元素的一些更多信息。
我希望这对你有用。

The most reliable method is the Php DOM Parser if you prefer working with php.
Here is an example of parsing only the elements.

// Include the library
include('simple_html_dom.php');


// Retrieve the DOM from a given URL
$html = file_get_html('http://mypage.com/');
// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e) 
echo $e->href . '<br>';

I am providing some more information about parsing the other html elements too.
I hope that will be useful to you.

回复收藏 0 原文

~没有更多了~