如何保证在php中使用CURL准确抓取utf-8字符?
我正在抓取带有重音字符(如“é”)的网页(使用php的curl)。 在这些网页的源代码中,这些字符是使用 utf-8 编写的(它们不是 html 编码的)。
但是,当使用以下代码生成结果时,我得到的是问号而不是重音字符。
$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, $website);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file = curl_exec($ch);
curl_close($ch);
从抓取的网页返回的标头信息表明内容设置为“html/text”。 没有迹象表明它是 utf-8 编码的。 我尝试使用 CURLOPT_HTTPHEADER 卷曲选项来更改文本编码,但这没有任何作用。
我缺少什么?
I am scraping webpages (using php's curl) that have accented characters (like "é").
In the source of those webpages, those characters are written using utf-8 (they are not html encoded.)
However, when the result is produced using the following code, I get question marks instead of the accented characters.
$ch = curl_init();
$timeout = 5;
curl_setopt ($ch, CURLOPT_URL, $website);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$file = curl_exec($ch);
curl_close($ch);
The header info returned from the scraped webpage indicates that the Content is set to "html/text." There's no indication that it's utf-8 encoded. I've tried using CURLOPT_HTTPHEADER curl option to change the text encoding, but that doesn't do anything.
What am I missing?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
根据我的问题的答案,看看
Curl 请求中的字符发生了变化
答案 Dominic Rodger 刚刚用他的方法拯救了我的一天回复..
As per the answer to my question, have a look at
characters changed in a Curl request
The answer Dominic Rodger just saved my day with his reply..