curl and file_get_contents()在某些网站上不起作用

发布于 2025-02-07 19:04:09 字数 6012 浏览 1 评论 0原文

我正在尝试刮擦此网站: https://bartleby.com ,我尝试使用Python请求和有用。但是我试图将其转换为PHP,因为我希望将结果打印在我的网站上,而我的CPanel没有阅读Python,因此我被迫使用Curl来执行此操作,但没有工作以下代码:

Not Found
This page you were trying to reach at this address doesn't seem to exist.
What can I do now?
Sign up for your own free account.

所以我是我的 。只是想知道该网站如何阻止PHP上的卷曲,但在Python上没有要求?有没有无法检测到的php卷曲的替代方法?谢谢。

我的php代码(无法工作):

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'authority' => 'www.bartleby.com',
    'accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language' => 'en-US;q=0.6',
    'cache-control' => 'max-age=0',
    'sec-fetch-dest' => 'document',
    'sec-fetch-mode' => 'navigate',
    'sec-fetch-site' => 'same-origin',
    'sec-fetch-user' => '?1',
    'sec-gpc' => '1',
    'upgrade-insecure-requests' => '1',
    'user-agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36',
    'Accept-Encoding' => 'gzip',
]);
curl_setopt($ch, CURLOPT_COOKIE, 'G_ENABLED_IDPS=google; refreshToken=330bb387263aa6673c3e39e975d729f723b38002; userId=4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3; userStatus=A1; promotionId=; sku=bb999_bookstore; endCycleWhenQuestionsRemainingWasClosed=2022-06-19T07:00:00.000Z; btbHomeDashboardTooltipAnimationCount=0; isNoQuestionAskedModalClosed=true; accessToken=34ceed9609a07bd0238a74b5650d5c5362990498; bartlebyRefreshTokenExpiresAt=2022-07-16T12:37:57.217Z; btbHomeDashboardAnimationTriggerDate=2022-06-17T12:39:25.907Z; OptanonConsent=isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false');

$response = curl_exec($ch);
echo $response;
curl_close($ch);

我还尝试使用file_get_contents(),但它返回错误:警告:file_get_contents( https://bartleby.com ):未能打开流:http请求失败! http/1.1 503在d:\ xampp \ htdocs \ bartleby \ bartleby \ index.php中暂时无法获得服务

完整代码(不工作):

<?php
$url= 'https://bartleby.com';

$arrContextOptions=array(
      "ssl"=>array(
            "verify_peer"=>false,
            "verify_peer_name"=>false,
        ),
    );  

$response = file_get_contents($url, false, stream_context_create($arrContextOptions));
echo $response;

我的Python代码(工作):

import requests

cookies = {
    'G_ENABLED_IDPS': 'google',
    'refreshToken': '330bb387263aa6673c3e39e975d729f723b38002',
    'userId': '4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3',
    'userStatus': 'A1',
    'promotionId': '',
    'sku': 'bb999_bookstore',
    'endCycleWhenQuestionsRemainingWasClosed': '2022-06-19T07:00:00.000Z',
    'btbHomeDashboardTooltipAnimationCount': '0',
    'isNoQuestionAskedModalClosed': 'true',
    'accessToken': '34ceed9609a07bd0238a74b5650d5c5362990498',
    'bartlebyRefreshTokenExpiresAt': '2022-07-16T12:37:57.217Z',
    'btbHomeDashboardAnimationTriggerDate': '2022-06-17T12:39:25.907Z',
    'OptanonConsent': 'isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false',
}

headers = {
    'authority': 'www.bartleby.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'en-US;q=0.6',
    'cache-control': 'max-age=0',
    # Requests sorts cookies= alphabetically
    # 'cookie': 'G_ENABLED_IDPS=google; refreshToken=330bb387263aa6673c3e39e975d729f723b38002; userId=4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3; userStatus=A1; promotionId=; sku=bb999_bookstore; endCycleWhenQuestionsRemainingWasClosed=2022-06-19T07:00:00.000Z; btbHomeDashboardTooltipAnimationCount=0; isNoQuestionAskedModalClosed=true; accessToken=34ceed9609a07bd0238a74b5650d5c5362990498; bartlebyRefreshTokenExpiresAt=2022-07-16T12:37:57.217Z; btbHomeDashboardAnimationTriggerDate=2022-06-17T12:39:25.907Z; OptanonConsent=isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'sec-gpc': '1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36',
}

response = requests.get('https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e', cookies=cookies, headers=headers)
print(response.text)

I am trying to scrape this website: https://bartleby.com, I tried to write a code using Python requests and it works. But I am trying to convert it to PHP because I want the result to be printed on my website and my Cpanel does not read python, so I am forced to use CURL to do this but did not work the code below returns:

Not Found
This page you were trying to reach at this address doesn't seem to exist.
What can I do now?
Sign up for your own free account.

So I am just wondering how this website blocks CURL on PHP but not Requests on Python? Are there any undetectable alternatives to CURL on PHP? Thanks.

My PHP Code (Not Working):

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'authority' => 'www.bartleby.com',
    'accept' => 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language' => 'en-US;q=0.6',
    'cache-control' => 'max-age=0',
    'sec-fetch-dest' => 'document',
    'sec-fetch-mode' => 'navigate',
    'sec-fetch-site' => 'same-origin',
    'sec-fetch-user' => '?1',
    'sec-gpc' => '1',
    'upgrade-insecure-requests' => '1',
    'user-agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36',
    'Accept-Encoding' => 'gzip',
]);
curl_setopt($ch, CURLOPT_COOKIE, 'G_ENABLED_IDPS=google; refreshToken=330bb387263aa6673c3e39e975d729f723b38002; userId=4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3; userStatus=A1; promotionId=; sku=bb999_bookstore; endCycleWhenQuestionsRemainingWasClosed=2022-06-19T07:00:00.000Z; btbHomeDashboardTooltipAnimationCount=0; isNoQuestionAskedModalClosed=true; accessToken=34ceed9609a07bd0238a74b5650d5c5362990498; bartlebyRefreshTokenExpiresAt=2022-07-16T12:37:57.217Z; btbHomeDashboardAnimationTriggerDate=2022-06-17T12:39:25.907Z; OptanonConsent=isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false');

$response = curl_exec($ch);
echo $response;
curl_close($ch);

I also tried to use file_get_contents() but it returns an error: Warning: file_get_contents(https://bartleby.com): Failed to open stream: HTTP request failed! HTTP/1.1 503 Service Temporarily Unavailable in D:\xampp\htdocs\bartleby\index.php on line 11

Line 11 is $response = file_get_contents($url, false, stream_context_create($arrContextOptions));

Full code (Not Working):

<?php
$url= 'https://bartleby.com';

$arrContextOptions=array(
      "ssl"=>array(
            "verify_peer"=>false,
            "verify_peer_name"=>false,
        ),
    );  

$response = file_get_contents($url, false, stream_context_create($arrContextOptions));
echo $response;

My Python Code (Working):

import requests

cookies = {
    'G_ENABLED_IDPS': 'google',
    'refreshToken': '330bb387263aa6673c3e39e975d729f723b38002',
    'userId': '4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3',
    'userStatus': 'A1',
    'promotionId': '',
    'sku': 'bb999_bookstore',
    'endCycleWhenQuestionsRemainingWasClosed': '2022-06-19T07:00:00.000Z',
    'btbHomeDashboardTooltipAnimationCount': '0',
    'isNoQuestionAskedModalClosed': 'true',
    'accessToken': '34ceed9609a07bd0238a74b5650d5c5362990498',
    'bartlebyRefreshTokenExpiresAt': '2022-07-16T12:37:57.217Z',
    'btbHomeDashboardAnimationTriggerDate': '2022-06-17T12:39:25.907Z',
    'OptanonConsent': 'isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false',
}

headers = {
    'authority': 'www.bartleby.com',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'accept-language': 'en-US;q=0.6',
    'cache-control': 'max-age=0',
    # Requests sorts cookies= alphabetically
    # 'cookie': 'G_ENABLED_IDPS=google; refreshToken=330bb387263aa6673c3e39e975d729f723b38002; userId=4c28bc2c-1eec-4d2c-b44d-7bfa78216ba3; userStatus=A1; promotionId=; sku=bb999_bookstore; endCycleWhenQuestionsRemainingWasClosed=2022-06-19T07:00:00.000Z; btbHomeDashboardTooltipAnimationCount=0; isNoQuestionAskedModalClosed=true; accessToken=34ceed9609a07bd0238a74b5650d5c5362990498; bartlebyRefreshTokenExpiresAt=2022-07-16T12:37:57.217Z; btbHomeDashboardAnimationTriggerDate=2022-06-17T12:39:25.907Z; OptanonConsent=isGpcEnabled=1&datestamp=Thu+Jun+16+2022+20%3A39%3A43+GMT%2B0800+(China+Standard+Time)&version=6.32.0&isIABGlobal=false&hosts=&consentId=9432e357-0639-4883-9f99-39bed0bb5cd9&interactionCount=0&landingPath=NotLandingPage&groups=C0001%3A1%2CC0003%3A1%2CBG142%3A0%2CC0002%3A0%2CC0005%3A0%2CC0004%3A0&AwaitingReconsent=false',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-user': '?1',
    'sec-gpc': '1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36',
}

response = requests.get('https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e', cookies=cookies, headers=headers)
print(response.text)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

负佳期 2025-02-14 19:04:09

您没有设置用户代理。

看起来该网站需要来自真实用户的用户代理,例如 mozilla/5.0(Windows NT 10.0; Win64; x64; x64; X64; rv:101.0)gecko/gecko/20100101 firefox/101.0

这是我的代码,只是工作。

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);// this is needed.
// But in my code is using user agent from web browser directly.
// You may change this to other.
curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'headerFunction');// for debug response headers only.

$response = curl_exec($ch);

if (curl_errno($ch)) {
    echo 'cURL error: ' . curl_error($ch);
    echo '<br>';
    exit();
}

echo '<hr>' . PHP_EOL;
echo '<h4>cURL response body</h4>' . PHP_EOL;
echo $response;
curl_close($ch);

unset($ch, $response);


/**
 * Header function for debugging
 */
function headerFunction($ch, $header)
{
    echo $header;
    echo '<br>';
    return mb_strlen($header);
}

您的代码设置请求标头使用错误数组格式。

curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'authority' => 'www.bartleby.com',
    //...
]);

这是错误的!
应该是...

curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'authority: www.bartleby.com',
    //...
]);

您可以使用$ reqheaders = curl_getinfo($ ch,curlinfo_header_out);来调试请求标头。

您当前的代码未发送用户代理,这就是为什么它不起作用的原因。

You did not set user agent.

It's look like that website required user agent from real user such as Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0.

Here is my code that just work.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.bartleby.com/questions-and-answers/1.-a-given-the-lines-l-7-124-tk-13k-1-k-3-and-l-x2-3s-y-1-10s-z-3-5s-determine-the-values-of-k-if-po/b88e3e3d-bfd6-4158-8335-6a3ca420430e');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);// this is needed.
// But in my code is using user agent from web browser directly.
// You may change this to other.
curl_setopt($ch, CURLOPT_HEADERFUNCTION, 'headerFunction');// for debug response headers only.

$response = curl_exec($ch);

if (curl_errno($ch)) {
    echo 'cURL error: ' . curl_error($ch);
    echo '<br>';
    exit();
}

echo '<hr>' . PHP_EOL;
echo '<h4>cURL response body</h4>' . PHP_EOL;
echo $response;
curl_close($ch);

unset($ch, $response);


/**
 * Header function for debugging
 */
function headerFunction($ch, $header)
{
    echo $header;
    echo '<br>';
    return mb_strlen($header);
}

Your code set request headers using wrong array format.

curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'authority' => 'www.bartleby.com',
    //...
]);

This is WRONG!
It should be...

curl_setopt($ch, CURLOPT_HTTPHEADER, [
    'authority: www.bartleby.com',
    //...
]);

You can use $reqHeaders = curl_getinfo($ch, CURLINFO_HEADER_OUT); to debug request headers.

Your current code did not sent user-agent at all that's why it doesn't work.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文