PHP Magento 屏幕抓取

发布于 2024-10-10 16:09:23 字数 224 浏览 8 评论 0原文

我正在尝试抓取供应商 magento 网站，以节省一些时间，因为我需要收集大约 2000 种产品的信息。我完全可以为几乎任何事情编写屏幕抓取工具，但我遇到了一个主要问题。我使用 get_file_contents 来收集产品页面的 html。

问题是：

您需要登录才能查看产品页面。这是一个标准的 magento 登录，那么我该如何在屏幕抓取中解决这个问题呢？我不需要完整的脚本，只需要有关方法的建议。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怕倦 2024-10-17 16:09:23

使用 stream_context_create 您可以指定在以下情况下发送的标头：调用您的file_get_contents。

我的建议是，打开浏览器并登录该网站。打开 Firebug（或您最喜欢的 Cookie 查看器）并获取 Cookie 并将其与您的请求一起发送。

编辑：这是来自 PHP.net 的示例：

<?php
// Create a stream
$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$context = stream_context_create($opts);

// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);
?>

编辑（2）：这超出了您的问题范围，但如果您想知道之后如何抓取网站，您可以查看 DOMDocument::loadHTML 方法。这本质上将为您提供所需的功能（即 XPath 查询，< a href="http://ca.php.net/manual/en/domdocument.getelementsbytagname.php" rel="nofollow noreferrer">getElementsByTagName, getElementsById) 来抓取您需要的内容。

如果您想抓取一些简单的内容，还可以将 RegEx 与 preg_match_all 一起使用。

Using stream_context_create you can specify headers to be sent when calling your file_get_contents.

What I'd suggest is, open your browser and login to the site. Open up Firebug (or your favorite Cookie viewer) and grab the cookies and send them with your request.

Edit: Here's an example from PHP.net:

<?php
// Create a stream
$opts = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$context = stream_context_create($opts);

// Open the file using the HTTP headers set above
$file = file_get_contents('http://www.example.com/', false, $context);
?>

Edit (2): This is out of the scope of your question, but if you are wondering how to scrape the website afterwards you could look into the DOMDocument::loadHTML method. This will essentially give you the required functions (i.e. XPath query, getElementsByTagName, getElementsById) to scrape what you need.

If you want to scrape something simple, you can also use RegEx with preg_match_all.

回复收藏 0 原文