strpos 问题:获取返回值 UBLIC
我正在创建一个类来打开网页并存储页面上所有出站链接的 href 值。由于某种原因,它对前 3 个有效,然后就变得很奇怪。下面是我的代码:
class Crawler {
var $url;
function construct($url) {
$this->url = 'http://'.$url;
$this->crawl();
}
function crawl() {
$str = file_get_contents($this->url);
$start = 0;
for($i=0; $i<10; $i++) {
$beg = strpos($str, '<a href="http://',$start)+16;
$end = strpos($str,'"',$beg);
$diff = $end - $beg;
$links[$i] = substr($str,$beg, $diff);
$start = $start + $beg;
}
print_r($links);
}
}
$crawler = new Crawler;
$crawler->construct('www.yahoo.com');
暂时忽略 for 循环 我知道这只会返回前 10 个,不会执行整个文档。但是,如果您运行此代码,前 3 个值可以正常工作,但所有其他值都是 UBLIC。 有人可以帮忙吗?谢谢
I am making a class to open a webpage and store the href values of all outbound links on the page. For some reason it works for the first 3 then goes wierd. Below is my code:
class Crawler {
var $url;
function construct($url) {
$this->url = 'http://'.$url;
$this->crawl();
}
function crawl() {
$str = file_get_contents($this->url);
$start = 0;
for($i=0; $i<10; $i++) {
$beg = strpos($str, '<a href="http://',$start)+16;
$end = strpos($str,'"',$beg);
$diff = $end - $beg;
$links[$i] = substr($str,$beg, $diff);
$start = $start + $beg;
}
print_r($links);
}
}
$crawler = new Crawler;
$crawler->construct('www.yahoo.com');
Ignore the for loop for the time being I know this will only return the first 10 and won't do the whole document. But if you run this code the first 3 work fine but then all the other values are UBLIC.
Can anyone help? Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
而不是:
尝试:
这可能就是您只看到前三场比赛的原因。
另外,您需要插入一个检查,以确保
$beg
不是FALSE
:但是请注意,您确实应该使用
DOMDocument
查找文档中具有给定标签名称的所有标签 (a< /代码> 此处)。特别是,因为这是 HTML,可能不是有效的 XHTML,所以您应该考虑使用
loadHTML
方法。Instead of:
try:
That's likely why you are only seeing the first three matches.
Also, you need to insert a check that
$beg
is notFALSE
:Note, however, that you really should be using
DOMDocument
to find all tags in a document with a given tag name (a
here). In particular, because this is HTML that might not be valid XHTML, you should consider using theloadHTML
method.我认为您的逻辑有问题:
您使用 $start 来标记开始查找 href 的位置,但生成的
$beg
仍然是完整字符串的索引。因此,当您通过添加$beg
更新$start
时,您会获得较高的值。您应该尝试$start = $beg + 1
而不是$start = $start + $beg
I think you have a problem in your logic:
you use $start to mark the place where to start looking for the href, but the resulting
$beg
will still be an index into the complete string. So when you update$start
by adding$beg
you get to high values. You should try$start = $beg + 1
instead of$start = $start + $beg