simple_html_dom乱码问题

发布于 2022-09-01 23:10:58 字数 1829 浏览 25 评论 0

我用simple_html_dom爬取网页,原网页的编码是gb2312,用mb_convert_encoding转换编码为utf-8

mb_convert_encoding($innertext, 'UTF-8', 'GB2312');

爬取地址

http://www.cba.gov.cn/cbastats/teamdetail.aspx?id=Te013

核心代码

$shd = new simple_html_dom();
$shd->load_file('http://www.cba.gov.cn/cbastats/teamdetail.aspx?id=Te013');
$playerNodes = $shd->find('table#DataGrid2 tr');
unset($playerNodes[0]);
foreach ($playerNodes as $ky => $playerNode) {
    //查找节点
    $playerNodeTds = $playerNode->children();
    $playerNodeA = $playerNodeTds[0]->children();
    $playerId = explode('=', $playerNodeA[0]->href);
    
    //获取内容
    $player['team_id'] = $team['team_id'];
    $player['player_id'] = $playerId[1];
    $player['player_name'] = mb_convert_encoding($playerNodeA[0]->innertext, 'UTF-8', 'GB2312');
    $player['number'] = $playerNodeTds[1]->innertext;
    $player['birthday'] = mb_convert_encoding($playerNodeTds[2]->innertext, 'UTF-8', 'GB2312');
    $player['position']  = mb_convert_encoding($playerNodeTds[3]->innertext, 'UTF-8', 'GB2312');
    $player['height'] = $playerNodeTds[4]->innertext;
    $player['weight'] = $playerNodeTds[5]->innertext;

    var_dump($player['player_name']);
}

结果


string(6) "李航"
string(9) "周启新"
string(6) "郭磊"
string(9) "赵泰隆"
string(4) "孙?"
string(9) "孙伟博"
string(9) "谢亚财"
string(9) "贾俊龙"
string(9) "陈林坚"
string(9) "王哲林"
string(9) "黄毅超"
string(9) "王增杰"
string(17) "法迪·哈提布"
string(16) "杰里米-泰勒"
string(20) "德怀特·拜克斯"

可以看到 string(4) "孙?"没有转换过来,正确的是孙喆

大部分转换正常,但是有些字转不过来,会转成问号——?,请教各位大神有什么好办法解决这个问题?

非常感谢~

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

浅唱々樱花落 2022-09-08 23:10:58

试下这样看看?属于生僻字,不在gb2312中,在gbk中存在。GB18030字符集兼容GBK

mark下生僻字集合:劼,晅,虞,崟,珺,祎,鏐,勍,璟,芃,夐,昱,昉,昳,旸,睿,崑,翀,弋,嬿,贇,喆

图片描述

<?php
set_time_limit(0);

require_once('simple_html_dom.php');

//$page = file_get_contents('http://www.cba.gov.cn/cbastats/teamdetail.aspx?id=Te013');
//$page=iconv('GBK', 'UTF-8', $page);
//echo $page;exit();
$shd = new simple_html_dom();
//$shd->load($page);
$shd->load_file('http://www.cba.gov.cn/cbastats/teamdetail.aspx?id=Te013');
$playerNodes = $shd->find('table#DataGrid2 tr');
unset($playerNodes[0]);
foreach ($playerNodes as $ky => $playerNode) {
    //查找节点
    $playerNodeTds = $playerNode->children();
    $playerNodeA = $playerNodeTds[0]->children();
    $playerId = explode('=', $playerNodeA[0]->href);
    
    //获取内容
    $player['team_id'] = $team['team_id'];
    $player['player_id'] = $playerId[1];
    $player['player_name'] = $playerNodeA[0]->innertext;
    $player['number'] = mb_convert_encoding($playerNodeTds[1]->innertext, 'GBK', 'GB2312');
    $player['birthday'] = mb_convert_encoding($playerNodeTds[2]->innertext, 'GBK', 'GB2312');
    $player['position']  = mb_convert_encoding($playerNodeTds[3]->innertext, 'GBK', 'GB2312');
    $player['height'] = $playerNodeTds[4]->innertext;
    $player['weight'] = $playerNodeTds[5]->innertext;

    var_dump($player['player_name']);
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文