如何保证在php中使用CURL准确抓取utf-8字符？

发布于 2024-07-29 17:37:01 字数 500 浏览 4 评论 0原文

我正在抓取带有重音字符（如“é”）的网页（使用php的curl）。在这些网页的源代码中，这些字符是使用 utf-8 编写的（它们不是 html 编码的）。

但是，当使用以下代码生成结果时，我得到的是问号而不是重音字符。

$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, $website);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file = curl_exec($ch);
curl_close($ch);

从抓取的网页返回的标头信息表明内容设置为“html/text”。没有迹象表明它是 utf-8 编码的。我尝试使用 CURLOPT_HTTPHEADER 卷曲选项来更改文本编码，但这没有任何作用。

我缺少什么？

原文

I am scraping webpages (using php's curl) that have accented characters (like "é").
In the source of those webpages, those characters are written using utf-8 (they are not html encoded.)

However, when the result is produced using the following code, I get question marks instead of the accented characters.

$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, $website);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file = curl_exec($ch);
curl_close($ch);

The header info returned from the scraped webpage indicates that the Content is set to "html/text." There's no indication that it's utf-8 encoded. I've tried using CURLOPT_HTTPHEADER curl option to change the text encoding, but that doesn't do anything.

What am I missing?

分享到QQ

分享到微博