获取html表数据时的特殊字符
我有 PHP 抓取脚本,它从另一个网站获取 HTML 表格内容。该脚本不会获取 HTML 特殊字符(标签),这会导致内容看起来未格式化。
如何修改以下代码以获取 HTML 特殊字符,包括所有标签?
完整代码:
<?php
error_reporting(E_ERROR);
set_time_limit(0);
function createRSSFile($tag,$value,$data)
{
# this will return the each element with tag.
$tag=strtolower(str_replace(" ","_",$tag));
$tag=strtolower(str_replace(":","",$tag));
$tag=strtolower(str_replace("&","and",$tag));
// $returnITEM = "<".$tag.">".htmlspecialchars(str_replace(" 00:00:00","",$value))."</".$tag.">";
$returnITEM = "<".$tag.">".htmlspecialchars(str_replace("â¢","<br/><br/> ",$value))."</".$tag.">";
return $returnITEM;
}
// function extraFields($data){
//print_r($data);
// $returnITEM = "<".strtolower(str_replace(" ","_",$data[18][0])).">".htmlspecialchars($data[18][1])."</".strtolower(str_replace(" ","_",$data[18][0])).">";
// $returnITEM = "<".strtolower(str_replace("&","or",$data[19][0])).">".htmlspecialchars($data[19][1])."</".strtolower(str_replace("&","or",$data[19][0])).">";
// $returnITEM .= "<".strtolower(str_replace(" ","_",$data[20][0])).">".htmlspecialchars($data[20][1])."</".strtolower(str_replace(" ","_",$data[20][0])).">";
// $returnITEM .= "<".strtolower(str_replace(" ","_",$data[22][0])).">".htmlspecialchars($data[23][0])."</".strtolower(str_replace(" ","_",$data[22][0])).">";
// $returnITEM .= "<".strtolower(str_replace(" ","_",$data[24][0])).">".htmlspecialchars($data[25][0])."</".strtolower(str_replace(" ","_",$data[24][0])).">";
// $returnITEM .= "<".strtolower(str_replace(" ","_",$data[26][0])).">".htmlspecialchars($data[26][1])."</".strtolower(str_replace(" ","_",$data[26][0])).">";
// preg_match('/[a-z0-9]+([_\\.-][a-z0-9]+)*@([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}/i',$data[25][0],$email);
// $email=$email[0];
// $returnITEM .= "<email>".$email."</email>";
// return $returnITEM;
// }
function fileRead(){
$filename = "count.txt";
$handle = fopen($filename, "r");
$contents = fread($handle, filesize($filename));
fclose($handle);
return $contents;
}
function fileWrite ($val) {
$filename = 'count.txt';
$somecontent = $val;
if (is_writable($filename)) {
if (!$handle = fopen($filename, 'w')) {
echo "Cannot open file ($filename)";
exit;
}
if (fwrite($handle, $somecontent) === FALSE) {
echo "Cannot write to file ($filename)";
exit;
}
fclose($handle);
} else {
echo "The file $filename is not writable";
}
}
function fetchData($jobid) {
$html=file_get_contents('http://acbar.org/JobDetail.aspx?id='.$jobid);
$html=str_replace("<td></td>", "",$html);
$html=str_replace("<td style=\"font-size:8pt;font-weight:bold;\"></td>","<td style=\"font-size:8pt;font-weight:bold;\">Null</td>",$html);
$html=str_replace("<td style=\"font-size:8pt;font-weight:bold;\" colspan=\"2\" ></td>","<td style=\"font-size:8pt;font-weight:bold;\" colspan=\"2\" >Null</td>",$html);
$html=str_replace(" ", " ",$html);
$html=str_replace("", "<br>",$html);
$html=str_replace("<br>", "_br_",$html);
// $html=str_replace("\â\u","'",$html);
$dom = new DOMDocument;
$dom->loadHTML( $html );
//echo $dom->saveHTML();
//exit;
$rows = array();
foreach( $dom->getElementsByTagName( 'tr' ) as $tr ) {
$cells = array();
foreach( $tr->getElementsByTagName( 'td' ) as $td ) {
if(trim($td->nodeValue)!='')
$cells[] = str_replace("br","<br>",trim($td->nodeValue));
}
if(sizeof($cells)>0)
$rows[] = $cells;
}
for($i=0;$i<0;$i++)
array_shift ($rows);
// echo "<pre>"; print_r($rows); echo "</pre>";
// exit;
if($rows[0][1]=="")
return false;
else
return $rows;
}
// Lets build the page
$latestBuild = date("r");
// Lets define the the type of doc we're creating.
$createXML ="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>";
$createXML .= "<rss version=\"0.92\">";
$createXML .= "<channel>
<title>Job List</title>
<link>http://acbar.org</link>
<description>Job List</description>
<lastBuildDate>$latestBuild</lastBuildDate>
<language>en</language>";
$startFrom=fileRead();
$startFrom=$startFrom+1;
$endWith=$startFrom+3;
for($jid=$startFrom;$jid<$endWith;$jid++) {
$data=fetchData($jid);
if(!$data)
break;
$srcurl='http://acbar.org/JobDetail.aspx?id='.$jid;
$createXML .= '<item><sourceurl>'.htmlspecialchars($srcurl).'</sourceurl>';
for($i=0;$i<23;$i++)
{
$tag=$data[$i][0];
$value=$data[$i][1];
$createXML .= createRSSFile($tag,$value,$data);
}
// $extra=extraFields($data);
// $createXML .= $extra;
$createXML .= "</item>";
// fileWrite($jid);
}
// preg_match('/[a-z0-9]+([_\\.-][a-z0-9]+)*@([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}/i',$data[26][1],$email);
// $email=$email[0];
header("content-type: text/xml");
echo $createXML .= "</channel></rss>";
?>
I have PHP scrapping script which fetches HTML table content from another website. The script doesn't fetch HTML special characters (tags) which cause the content to look unformatted.
How can I modify the following code to fetch HTML special characters, including all tags?
Complete Code:
<?php
error_reporting(E_ERROR);
set_time_limit(0);
function createRSSFile($tag,$value,$data)
{
# this will return the each element with tag.
$tag=strtolower(str_replace(" ","_",$tag));
$tag=strtolower(str_replace(":","",$tag));
$tag=strtolower(str_replace("&","and",$tag));
// $returnITEM = "<".$tag.">".htmlspecialchars(str_replace(" 00:00:00","",$value))."</".$tag.">";
$returnITEM = "<".$tag.">".htmlspecialchars(str_replace("â¢","<br/><br/> ",$value))."</".$tag.">";
return $returnITEM;
}
// function extraFields($data){
//print_r($data);
// $returnITEM = "<".strtolower(str_replace(" ","_",$data[18][0])).">".htmlspecialchars($data[18][1])."</".strtolower(str_replace(" ","_",$data[18][0])).">";
// $returnITEM = "<".strtolower(str_replace("&","or",$data[19][0])).">".htmlspecialchars($data[19][1])."</".strtolower(str_replace("&","or",$data[19][0])).">";
// $returnITEM .= "<".strtolower(str_replace(" ","_",$data[20][0])).">".htmlspecialchars($data[20][1])."</".strtolower(str_replace(" ","_",$data[20][0])).">";
// $returnITEM .= "<".strtolower(str_replace(" ","_",$data[22][0])).">".htmlspecialchars($data[23][0])."</".strtolower(str_replace(" ","_",$data[22][0])).">";
// $returnITEM .= "<".strtolower(str_replace(" ","_",$data[24][0])).">".htmlspecialchars($data[25][0])."</".strtolower(str_replace(" ","_",$data[24][0])).">";
// $returnITEM .= "<".strtolower(str_replace(" ","_",$data[26][0])).">".htmlspecialchars($data[26][1])."</".strtolower(str_replace(" ","_",$data[26][0])).">";
// preg_match('/[a-z0-9]+([_\\.-][a-z0-9]+)*@([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}/i',$data[25][0],$email);
// $email=$email[0];
// $returnITEM .= "<email>".$email."</email>";
// return $returnITEM;
// }
function fileRead(){
$filename = "count.txt";
$handle = fopen($filename, "r");
$contents = fread($handle, filesize($filename));
fclose($handle);
return $contents;
}
function fileWrite ($val) {
$filename = 'count.txt';
$somecontent = $val;
if (is_writable($filename)) {
if (!$handle = fopen($filename, 'w')) {
echo "Cannot open file ($filename)";
exit;
}
if (fwrite($handle, $somecontent) === FALSE) {
echo "Cannot write to file ($filename)";
exit;
}
fclose($handle);
} else {
echo "The file $filename is not writable";
}
}
function fetchData($jobid) {
$html=file_get_contents('http://acbar.org/JobDetail.aspx?id='.$jobid);
$html=str_replace("<td></td>", "",$html);
$html=str_replace("<td style=\"font-size:8pt;font-weight:bold;\"></td>","<td style=\"font-size:8pt;font-weight:bold;\">Null</td>",$html);
$html=str_replace("<td style=\"font-size:8pt;font-weight:bold;\" colspan=\"2\" ></td>","<td style=\"font-size:8pt;font-weight:bold;\" colspan=\"2\" >Null</td>",$html);
$html=str_replace(" ", " ",$html);
$html=str_replace("", "<br>",$html);
$html=str_replace("<br>", "_br_",$html);
// $html=str_replace("\â\u","'",$html);
$dom = new DOMDocument;
$dom->loadHTML( $html );
//echo $dom->saveHTML();
//exit;
$rows = array();
foreach( $dom->getElementsByTagName( 'tr' ) as $tr ) {
$cells = array();
foreach( $tr->getElementsByTagName( 'td' ) as $td ) {
if(trim($td->nodeValue)!='')
$cells[] = str_replace("br","<br>",trim($td->nodeValue));
}
if(sizeof($cells)>0)
$rows[] = $cells;
}
for($i=0;$i<0;$i++)
array_shift ($rows);
// echo "<pre>"; print_r($rows); echo "</pre>";
// exit;
if($rows[0][1]=="")
return false;
else
return $rows;
}
// Lets build the page
$latestBuild = date("r");
// Lets define the the type of doc we're creating.
$createXML ="<?xml version=\"1.0\" encoding=\"UTF-8\" ?>";
$createXML .= "<rss version=\"0.92\">";
$createXML .= "<channel>
<title>Job List</title>
<link>http://acbar.org</link>
<description>Job List</description>
<lastBuildDate>$latestBuild</lastBuildDate>
<language>en</language>";
$startFrom=fileRead();
$startFrom=$startFrom+1;
$endWith=$startFrom+3;
for($jid=$startFrom;$jid<$endWith;$jid++) {
$data=fetchData($jid);
if(!$data)
break;
$srcurl='http://acbar.org/JobDetail.aspx?id='.$jid;
$createXML .= '<item><sourceurl>'.htmlspecialchars($srcurl).'</sourceurl>';
for($i=0;$i<23;$i++)
{
$tag=$data[$i][0];
$value=$data[$i][1];
$createXML .= createRSSFile($tag,$value,$data);
}
// $extra=extraFields($data);
// $createXML .= $extra;
$createXML .= "</item>";
// fileWrite($jid);
}
// preg_match('/[a-z0-9]+([_\\.-][a-z0-9]+)*@([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}/i',$data[26][1],$email);
// $email=$email[0];
header("content-type: text/xml");
echo $createXML .= "</channel></rss>";
?>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这对我有用...
编辑...添加
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
this works for me ...
EDIT ... added
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");