使用通配符和 php 进行抓取

发布于 2024-11-05 12:32:51 字数 695 浏览 0 评论 0原文

我很难想象和想象如何抓取此页面:http://www.morewords。 com/ends-with/aw 代表单词本身。给定一个 URL,我想获取内容,然后生成一个包含所有列出的单词的 php 数组,在源代码中看起来像

<a href="/word/word1/">word1</a><br />
<a href="/word/word2/">word2</a><br />
<a href="/word/word3/">word3</a><br />
<a href="/word/word4/">word4</a><br />

有几种方法我一直在考虑这样做,如果你能,我将不胜感激帮助我决定最有效的方法。另外,我很感激有关如何实现这一目标的任何建议或示例。我知道这并不复杂,但我可以得到你们高级黑客的帮助。

  • 使用某种 jquery $.each() 循环并以某种方式将它们放入 JS 数组中,然后转录(可能很繁重)
  • 使用某种curl(真的没有太多经验与curl)
  • 使用一些复杂的查找并用正则表达式替换。

I have a hard time visualizing and conceiving away to scrape this page: http://www.morewords.com/ends-with/aw for the words themselves. Given a URL, I'd like to get the contents and then generate a php array with all the words listed, which in the source look like

<a href="/word/word1/">word1</a><br />
<a href="/word/word2/">word2</a><br />
<a href="/word/word3/">word3</a><br />
<a href="/word/word4/">word4</a><br />

There are a few ways I have been thinking about doing this, i'd appreciate if you could help me decide the most efficient way. Also, i'd appreciate any advice or examples on how to achieve this. I understand it's not incredibly complicated, but I could use the help of you advanced hackers.

  • Use some sort of jquery $.each() to loop through and somehow case them into a JS array, and then transcribe (probably heavily taxing)
  • use some sort of curl (don't really have much experience with curl)
  • use some sophisticated find and replace with regex.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

甜尕妞 2024-11-12 12:32:51

您将其标记为 PHP,因此这是一个 PHP 解决方案:)

$dom = new DOMDocument;

$dom->loadHTMLFile('http://www.morewords.com/ends-with/aw');

$anchors = $dom->getElementsByTagName('a');

$words = array();

foreach($anchors as $anchor) {
    if ($anchor->hasAttribute('href') AND preg_match('~/word/\w+/~', $anchor->getAttribute('href'))) {
        $words[] = $anchor->nodeValue;
    }
}

CodePad

如果 allow_url_fopen 在 php.ini 中被禁用,您可以使用 cURL 来获取 HTML。

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.morewords.com/ends-with/aw'); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($curl);    
curl_close($curl);   

You tagged it as PHP, so here is a PHP solution :)

$dom = new DOMDocument;

$dom->loadHTMLFile('http://www.morewords.com/ends-with/aw');

$anchors = $dom->getElementsByTagName('a');

$words = array();

foreach($anchors as $anchor) {
    if ($anchor->hasAttribute('href') AND preg_match('~/word/\w+/~', $anchor->getAttribute('href'))) {
        $words[] = $anchor->nodeValue;
    }
}

CodePad.

If allow_url_fopen is disabled in php.ini, you could use cURL to get the HTML.

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.morewords.com/ends-with/aw'); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($curl);    
curl_close($curl);   
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文