在 PHP 中测试 404 URL 的简单方法?
我正在自学一些基本的抓取,我发现有时我输入代码的 URL 返回 404,这会搞乱我的代码的所有其余部分。
所以我需要在代码顶部进行测试来检查 URL 是否返回 404。
这看起来是一个非常简单的任务,但谷歌没有给我任何答案。 我担心我正在寻找错误的东西。
一篇博客建议我使用这个:
$valid = @fsockopen($url, 80, $errno, $errstr, 30);
然后测试 $valid 是否为空。
但我认为给我带来问题的 URL 上有一个重定向,因此 $valid 对于所有值来说都是空的。 或者也许我做错了什么。
我还研究了“head request”,但我还没有找到任何可以使用或尝试的实际代码示例。
建议? 卷曲是怎么回事?
I'm teaching myself some basic scraping and I've found that sometimes the URL's that I feed into my code return 404, which gums up all the rest of my code.
So I need a test at the top of the code to check if the URL returns 404 or not.
This would seem like a pretty straightfoward task, but Google's not giving me any answers. I worry I'm searching for the wrong stuff.
One blog recommended I use this:
$valid = @fsockopen($url, 80, $errno, $errstr, 30);
and then test to see if $valid if empty or not.
But I think the URL that's giving me problems has a redirect on it, so $valid is coming up empty for all values. Or perhaps I'm doing something else wrong.
I've also looked into a "head request" but I've yet to find any actual code examples I can play with or try out.
Suggestions? And what's this about curl?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(15)
正如 strager 建议的那样,考虑使用 cURL。 您可能还有兴趣使用 curl_setopt 设置 CURLOPT_NOBODY 以跳过下载整个内容页面(您只需要标题)。
As strager suggests, look into using cURL. You may also be interested in setting CURLOPT_NOBODY with curl_setopt to skip downloading the whole page (you just want the headers).
如果您正在寻找一种最简单的解决方案并且可以在 php5 上一次性尝试的解决方案
If you are looking for an easiest solution and the one you can try in one go on php5 do
此函数返回 PHP 中 URL 的状态代码,而无需使用 HEAD 请求下载正文(html 代码):
示例:
This function return the status code of an URL in PHP without downloading the body (the html code) by using a HEAD request:
Example:
我在此处找到了这个答案:
本质上,您使用“文件获取内容”方法来检索 URL,该方法会自动使用状态代码填充 http 响应标头变量。
I found this answer here:
Essentially, you use the "file get contents" method to retrieve the URL, which automatically populates the http response header variable with the status code.
如果 url 不返回 200 OK,这将为您提供 true
This will give you true if url does not return 200 OK
附录;考虑性能测试了这 3 种方法。
结果,至少在我的测试环境中:
Curl wons
这个测试是在只需要 headers (noBody) 的情况下完成的。
测试自己:
addendum;tested those 3 methods considering performance.
The result, at least in my testing environment:
Curl wins
This test is done under the consideration that only the headers (noBody) is needed.
Test yourself:
这是一个简短的解决方案。
根据您的情况,您可以将
application/rdf+xml
更改为您使用的任何内容。Here is a short solution.
In your case, you can change
application/rdf+xml
to whatever you use.作为对已接受的答案的额外提示:
当使用建议的解决方案的变体时,由于 php 设置“max_execution_time”,我收到错误。 所以我所做的如下:
首先我将时间限制设置为更高的秒数,最后我将其设置回 php 设置中定义的值。
As an additional hint to the great accepted answer:
When using a variation of the proposed solution, I got errors because of php setting 'max_execution_time'. So what I did was the following:
First I set the time limit to a higher number of seconds, in the end I set it back to the value defined in the php settings.
您也可以使用此代码来查看任何链接的状态:
You can use this code too, to see the status of any link:
这是一个方法!
这个简单的脚本只是向 URL 发出请求以获取其源代码。 如果请求成功完成,则会输出“URL Exists!”。 如果不存在,则会输出“URL不存在!”。
Here's a way!
This simple script simply makes a request to the URL for its source code. If the request is completed successfully, it will output "URL Exists!". If not, it will output "URL Doesn't Exist!".
这只是代码片段,
希望对你有用
this is just and slice of code,
hope works for you
如果您使用 PHP 的
curl
绑定 ,您可以使用curl_getinfo
检查错误代码 这样:If you are using PHP's
curl
bindings, you can check the error code usingcurl_getinfo
as such:如果您正在运行 php5,您可以使用:
或者使用 php4,用户贡献了以下内容:
两者都会有类似于以下的结果:
因此,您只需检查标头响应是否正常,例如:
W3C 代码和定义
If your running php5 you can use:
Alternatively with php4 a user has contributed the following:
Both would have a result similar to:
Therefore you could just check to see that the header response was OK eg:
W3C Codes and Definitions
使用strager的代码,您还可以检查CURLINFO_HTTP_CODE是否有其他代码。 有些网站不报告 404,而是简单地重定向到自定义 404 页面并返回 302(重定向)或类似的内容。 我用它来检查服务器上是否存在实际文件(例如robots.txt)。 显然,这种文件如果存在就不会导致重定向,但如果不存在,它就会重定向到 404 页面,正如我之前所说,该页面可能没有 404 代码。
With strager's code, you can also check the CURLINFO_HTTP_CODE for other codes. Some websites do not report a 404, rather they simply redirect to a custom 404 page and return 302 (redirect) or something similar. I used this to check if an actual file (eg. robots.txt) existed on the server or not. Clearly this kind of file would not cause a redirect if it existed, but if it didn't it would redirect to a 404 page, which as I said before may not have a 404 code.