curl and file_get_contents()在某些网站上不起作用
我正在尝试刮擦此网站: https://bartleby.com ,我尝试使用Python请求和有用。但是我试图将其转换为PHP,因为我希望将结果打印在我的网站上,而我的CPanel没有阅读Python,因此我被迫使用Curl来执行此操作,但没有工作以下代码:
Not Found
This page you were trying to reach at this address doesn't seem to exist.
What can I do now?
Sign up for your own free account.
所以我是我的 。只是想知道该网站如何阻止PHP上的卷曲,但在Python上没有要求?有没有无法检测到的php卷曲的替代方法?谢谢。
我的php代码(无法工作):
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'authority' => 'www.bartleby.com',
'accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language' => 'en-US;q=0.6',
'cache-control' => 'max-age=0',
'sec-fetch-dest' => 'document',
'sec-fetch-mode' => 'navigate',
'sec-fetch-site' => 'same-origin',
'sec-fetch-user' => '?1',
'sec-gpc' => '1',
'upgrade-insecure-requests' => '1',
'user-agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36',
'Accept-Encoding' => 'gzip',
]);
curl_setopt($ch, CURLOPT_COOKIE, 'G_ENABLED_IDPS=google; refreshToken=330bb387263aa6673c3e39e975d729f723b38002; userId=4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3; userStatus=A1; promotionId=; sku=bb999_bookstore; endCycleWhenQuestionsRemainingWasClosed=2022-06-19T07:00:00.000Z; btbHomeDashboardTooltipAnimationCount=0; isNoQuestionAskedModalClosed=true; accessToken=34ceed9609a07bd0238a74b5650d5c5362990498; bartlebyRefreshTokenExpiresAt=2022-07-16T12:37:57.217Z; btbHomeDashboardAnimationTriggerDate=2022-06-17T12:39:25.907Z; OptanonConsent=isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false');
$response = curl_exec($ch);
echo $response;
curl_close($ch);
我还尝试使用file_get_contents(),但它返回错误:警告:file_get_contents( https://bartleby.com ):未能打开流:http请求失败! http/1.1 503在d:\ xampp \ htdocs \ bartleby \ bartleby \ index.php中暂时无法获得服务
。
完整代码(不工作):
<?php
$url= 'https://bartleby.com';
$arrContextOptions=array(
"ssl"=>array(
"verify_peer"=>false,
"verify_peer_name"=>false,
),
);
$response = file_get_contents($url, false, stream_context_create($arrContextOptions));
echo $response;
我的Python代码(工作):
import requests
cookies = {
'G_ENABLED_IDPS': 'google',
'refreshToken': '330bb387263aa6673c3e39e975d729f723b38002',
'userId': '4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3',
'userStatus': 'A1',
'promotionId': '',
'sku': 'bb999_bookstore',
'endCycleWhenQuestionsRemainingWasClosed': '2022-06-19T07:00:00.000Z',
'btbHomeDashboardTooltipAnimationCount': '0',
'isNoQuestionAskedModalClosed': 'true',
'accessToken': '34ceed9609a07bd0238a74b5650d5c5362990498',
'bartlebyRefreshTokenExpiresAt': '2022-07-16T12:37:57.217Z',
'btbHomeDashboardAnimationTriggerDate': '2022-06-17T12:39:25.907Z',
'OptanonConsent': 'isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false',
}
headers = {
'authority': 'www.bartleby.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US;q=0.6',
'cache-control': 'max-age=0',
# Requests sorts cookies= alphabetically
# 'cookie': 'G_ENABLED_IDPS=google; refreshToken=330bb387263aa6673c3e39e975d729f723b38002; userId=4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3; userStatus=A1; promotionId=; sku=bb999_bookstore; endCycleWhenQuestionsRemainingWasClosed=2022-06-19T07:00:00.000Z; btbHomeDashboardTooltipAnimationCount=0; isNoQuestionAskedModalClosed=true; accessToken=34ceed9609a07bd0238a74b5650d5c5362990498; bartlebyRefreshTokenExpiresAt=2022-07-16T12:37:57.217Z; btbHomeDashboardAnimationTriggerDate=2022-06-17T12:39:25.907Z; OptanonConsent=isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'sec-gpc': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36',
}
response = requests.get('https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e', cookies=cookies, headers=headers)
print(response.text)
I am trying to scrape this website: https://bartleby.com, I tried to write a code using Python requests and it works. But I am trying to convert it to PHP because I want the result to be printed on my website and my Cpanel does not read python, so I am forced to use CURL to do this but did not work the code below returns:
Not Found
This page you were trying to reach at this address doesn't seem to exist.
What can I do now?
Sign up for your own free account.
So I am just wondering how this website blocks CURL on PHP but not Requests on Python? Are there any undetectable alternatives to CURL on PHP? Thanks.
My PHP Code (Not Working):
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_HTTPHEADER, [
'authority' => 'www.bartleby.com',
'accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language' => 'en-US;q=0.6',
'cache-control' => 'max-age=0',
'sec-fetch-dest' => 'document',
'sec-fetch-mode' => 'navigate',
'sec-fetch-site' => 'same-origin',
'sec-fetch-user' => '?1',
'sec-gpc' => '1',
'upgrade-insecure-requests' => '1',
'user-agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36',
'Accept-Encoding' => 'gzip',
]);
curl_setopt($ch, CURLOPT_COOKIE, 'G_ENABLED_IDPS=google; refreshToken=330bb387263aa6673c3e39e975d729f723b38002; userId=4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3; userStatus=A1; promotionId=; sku=bb999_bookstore; endCycleWhenQuestionsRemainingWasClosed=2022-06-19T07:00:00.000Z; btbHomeDashboardTooltipAnimationCount=0; isNoQuestionAskedModalClosed=true; accessToken=34ceed9609a07bd0238a74b5650d5c5362990498; bartlebyRefreshTokenExpiresAt=2022-07-16T12:37:57.217Z; btbHomeDashboardAnimationTriggerDate=2022-06-17T12:39:25.907Z; OptanonConsent=isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false');
$response = curl_exec($ch);
echo $response;
curl_close($ch);
I also tried to use file_get_contents() but it returns an error: Warning: file_get_contents(https://bartleby.com): Failed to open stream: HTTP request failed! HTTP/1.1 503 Service Temporarily Unavailable in D:\xampp\htdocs\bartleby\index.php on line 11
Line 11 is $response = file_get_contents($url, false, stream_context_create($arrContextOptions));
Full code (Not Working):
<?php
$url= 'https://bartleby.com';
$arrContextOptions=array(
"ssl"=>array(
"verify_peer"=>false,
"verify_peer_name"=>false,
),
);
$response = file_get_contents($url, false, stream_context_create($arrContextOptions));
echo $response;
My Python Code (Working):
import requests
cookies = {
'G_ENABLED_IDPS': 'google',
'refreshToken': '330bb387263aa6673c3e39e975d729f723b38002',
'userId': '4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3',
'userStatus': 'A1',
'promotionId': '',
'sku': 'bb999_bookstore',
'endCycleWhenQuestionsRemainingWasClosed': '2022-06-19T07:00:00.000Z',
'btbHomeDashboardTooltipAnimationCount': '0',
'isNoQuestionAskedModalClosed': 'true',
'accessToken': '34ceed9609a07bd0238a74b5650d5c5362990498',
'bartlebyRefreshTokenExpiresAt': '2022-07-16T12:37:57.217Z',
'btbHomeDashboardAnimationTriggerDate': '2022-06-17T12:39:25.907Z',
'OptanonConsent': 'isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false',
}
headers = {
'authority': 'www.bartleby.com',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'en-US;q=0.6',
'cache-control': 'max-age=0',
# Requests sorts cookies= alphabetically
# 'cookie': 'G_ENABLED_IDPS=google; refreshToken=330bb387263aa6673c3e39e975d729f723b38002; userId=4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3; userStatus=A1; promotionId=; sku=bb999_bookstore; endCycleWhenQuestionsRemainingWasClosed=2022-06-19T07:00:00.000Z; btbHomeDashboardTooltipAnimationCount=0; isNoQuestionAskedModalClosed=true; accessToken=34ceed9609a07bd0238a74b5650d5c5362990498; bartlebyRefreshTokenExpiresAt=2022-07-16T12:37:57.217Z; btbHomeDashboardAnimationTriggerDate=2022-06-17T12:39:25.907Z; OptanonConsent=isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'sec-gpc': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36',
}
response = requests.get('https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e', cookies=cookies, headers=headers)
print(response.text)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您没有设置用户代理。
看起来该网站需要来自真实用户的用户代理,例如 mozilla/5.0(Windows NT 10.0; Win64; x64; x64; X64; rv:101.0)gecko/gecko/20100101 firefox/101.0 。
这是我的代码,只是工作。
您的代码设置请求标头使用错误数组格式。
这是错误的!
应该是...
您可以使用
$ reqheaders = curl_getinfo($ ch,curlinfo_header_out);
来调试请求标头。您当前的代码未发送
用户代理
,这就是为什么它不起作用的原因。You did not set user agent.
It's look like that website required user agent from real user such as Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0.
Here is my code that just work.
Your code set request headers using wrong array format.
This is WRONG!
It should be...
You can use
$reqHeaders = curl_getinfo($ch, CURLINFO_HEADER_OUT);
to debug request headers.Your current code did not sent
user-agent
at all that's why it doesn't work.