从 html 中提取文本时出现奇怪的字符

发布于 2024-11-19 17:47:44 字数 1917 浏览 0 评论 0原文

我正在从 html 中提取一些文本,这些文本作为字符串传递。提取的文本格式很奇怪。它应该是阿拉伯语,但提取时显示奇怪的字符。我对代码进行了注释,以便于理解。总的来说,代码的作用是找出传递的 html 的字符集,例如(utf、windows-1256),然后以适当的方式加载文档。使用 html 节点解析在循环中查找所需的 html 元素并提取每个元素所需的文本。

问题是 if 语句中的两个语句有效

$html =  @iconv('windows-1256', 'windows-1256', $html);
@$doc->loadHTMl($this->metaUtf8. $html);

,并且被注释掉的以下语句显示了乱码文本,该文本不应该如此,并且应该在没有上述两个语句的情况下工作。那么原因可能是什么?

//@$doc->loadHTMl($this->metaWindows1256. $html);

代码:

    //strings declared that will appended to html when loading the doc
    public $metaWindows1256 = '<meta http-equiv="Content-Type" content="text/html; charset=windows-1256"/>' ;
    public $metaUtf8 = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>' ;

    //extract characterset of html passed in variable $html
    preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i', $html, $matches );
    if ( isset( $matches[3] ) )
    {
        $charset = $matches[3];
    }

    $doc = new DOMDocument();
    if(!($charset=='UTF-8') && !($charset=='utf-8'))
    {

            $html =  @iconv('windows-1256', 'windows-1256', $html);
            @$doc->loadHTMl($this->metaUtf8. $html);
        //@$doc->loadHTMl($this->metaWindows1256. $html);
    }
    else
    {
            echo 'LOADING UTF';
        @$doc->loadHTMl($this->metaUtf8. $html);
    }

    foreach($doc->getElementsByTagName($element_tagname) as $element)
    {
        if (substr_count($element->getAttribute($attribute),$value)!=0) //if the title of the div contains 'post_message'
        {
            $found_element[]= $element->getAttribute('href');
            $found_element[]= $element->nodeValue;
            $found_elements[] = $found_element;
            unset($found_element);
        }
    }`

I am extracting some text from html which is passed as a string. The format of the text extracted is strange. It should be in Arabic but is showing strange characters when extracted. I have commented the code to make it easy to understand. Overall what the code does it to find out the characterset of the html passed e.g. (utf, windows-1256), then load the document in the approprate manner. Use html node parsing to find the required html elements in a loop and extract each one's required text.

The problem is the two statements within the if statements works

$html =  @iconv('windows-1256', 'windows-1256', $html);
@$doc->loadHTMl($this->metaUtf8. $html);

And the following statement after that which is commented out shows the gibberish text which should not be so and should work without the above 2 statements. So what could be the reason?

//@$doc->loadHTMl($this->metaWindows1256. $html);

The code:

    //strings declared that will appended to html when loading the doc
    public $metaWindows1256 = '<meta http-equiv="Content-Type" content="text/html; charset=windows-1256"/>' ;
    public $metaUtf8 = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>' ;

    //extract characterset of html passed in variable $html
    preg_match( '@<meta\s+http-equiv="Content-Type"\s+content="([\w/]+)(;\s+charset=([^\s"]+))?@i', $html, $matches );
    if ( isset( $matches[3] ) )
    {
        $charset = $matches[3];
    }

    $doc = new DOMDocument();
    if(!($charset=='UTF-8') && !($charset=='utf-8'))
    {

            $html =  @iconv('windows-1256', 'windows-1256', $html);
            @$doc->loadHTMl($this->metaUtf8. $html);
        //@$doc->loadHTMl($this->metaWindows1256. $html);
    }
    else
    {
            echo 'LOADING UTF';
        @$doc->loadHTMl($this->metaUtf8. $html);
    }

    foreach($doc->getElementsByTagName($element_tagname) as $element)
    {
        if (substr_count($element->getAttribute($attribute),$value)!=0) //if the title of the div contains 'post_message'
        {
            $found_element[]= $element->getAttribute('href');
            $found_element[]= $element->nodeValue;
            $found_elements[] = $found_element;
            unset($found_element);
        }
    }`

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

2024-11-26 17:47:44

我发现我在代码的其他部分将 html 从 windows1256 转换为 utf 。现在,当我使用其元再次检查 html 的字符集时,它当然会说它是 windows1256,尽管我已经将其转换为 utf。所以后来我再次尝试将其转换为 utf 以及奇怪的字符。

无论如何,谢谢

I found out I was converting the html from windows1256 to utf in some other part of the code. Now when I was checking the characterset again of the html using its meta it would ofcourse say that it is windows1256 although I would have converted it already to utf. So Later I was again trying to covert it to utf again and so the strange characters.

Thanks anyway

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文