使用通配符和 php 进行抓取

发布于 2024-11-05 12:32:51 字数 695 浏览 2 评论 0原文

我很难想象和想象如何抓取此页面：http://www.morewords。 com/ends-with/aw 代表单词本身。给定一个 URL，我想获取内容，然后生成一个包含所有列出的单词的 php 数组，在源代码中看起来像

<a href="/word/word1/">word1</a><br />
<a href="/word/word2/">word2</a><br />
<a href="/word/word3/">word3</a><br />
<a href="/word/word4/">word4</a><br />

有几种方法我一直在考虑这样做，如果你能，我将不胜感激帮助我决定最有效的方法。另外，我很感激有关如何实现这一目标的任何建议或示例。我知道这并不复杂，但我可以得到你们高级黑客的帮助。

使用某种 jquery $.each() 循环并以某种方式将它们放入 JS 数组中，然后转录（可能很繁重）
使用某种curl（真的没有太多经验与curl）
使用一些复杂的查找并用正则表达式替换。

原文

I have a hard time visualizing and conceiving away to scrape this page: http://www.morewords.com/ends-with/aw for the words themselves. Given a URL, I'd like to get the contents and then generate a php array with all the words listed, which in the source look like

<a href="/word/word1/">word1</a><br />
<a href="/word/word2/">word2</a><br />
<a href="/word/word3/">word3</a><br />
<a href="/word/word4/">word4</a><br />

There are a few ways I have been thinking about doing this, i'd appreciate if you could help me decide the most efficient way. Also, i'd appreciate any advice or examples on how to achieve this. I understand it's not incredibly complicated, but I could use the help of you advanced hackers.

Use some sort of jquery $.each() to loop through and somehow case them into a JS array, and then transcribe (probably heavily taxing)
use some sort of curl (don't really have much experience with curl)
use some sophisticated find and replace with regex.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

甜尕妞 2024-11-12 12:32:51

您将其标记为 PHP，因此这是一个 PHP 解决方案:)

$dom = new DOMDocument;

$dom->loadHTMLFile('http://www.morewords.com/ends-with/aw');

$anchors = $dom->getElementsByTagName('a');

$words = array();

foreach($anchors as $anchor) {
    if ($anchor->hasAttribute('href') AND preg_match('~/word/\w+/~', $anchor->getAttribute('href'))) {
        $words[] = $anchor->nodeValue;
    }
}

CodePad。

如果 allow_url_fopen 在 php.ini 中被禁用，您可以使用 cURL 来获取 HTML。

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.morewords.com/ends-with/aw'); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($curl);    
curl_close($curl);

You tagged it as PHP, so here is a PHP solution :)

$dom = new DOMDocument;

$dom->loadHTMLFile('http://www.morewords.com/ends-with/aw');

$anchors = $dom->getElementsByTagName('a');

$words = array();

foreach($anchors as $anchor) {
    if ($anchor->hasAttribute('href') AND preg_match('~/word/\w+/~', $anchor->getAttribute('href'))) {
        $words[] = $anchor->nodeValue;
    }
}

CodePad.

If allow_url_fopen is disabled in php.ini, you could use cURL to get the HTML.

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.morewords.com/ends-with/aw'); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($curl);    
curl_close($curl);

回复收藏 0 原文

~没有更多了~