将智能报价和其他实体转换为精确形式

发布于 2025-01-04 15:24:14 字数 1298 浏览 2 评论 0原文

我一直致力于将字符串转换为 PDF。出现的一个典型问题是偶尔出现的“智能引号”或其他 utf-8 字符，它们会变成一个或另一个 ISO 字符，例如 –、 –、“、” 等。下面定义的函数解决了上述问题通过将它们编码为 html 实体来解决问题，但是，当然 PDF 不是 html。当输入带有 ’ 代替撇号 ' 的字符串时，该函数会将其转换为 ’。如果我们处理的是 HTML，那就太好了，但作为 PDF，它会将其视为字符串，因此永远不会转换其确切形式。因此，如何将 htmlentity 转换为字符精确形式？

function htmlallentities($str){
    $res = '';
    $strlen = strlen($str);
    for($i=0; $i<$strlen; $i++){
        $byte = ord($str[$i]);
        if($byte < 128) { // 1-byte char
            $res .= $str[$i];
        } elseif($byte < 192) { // invalid utf8
        } elseif($byte < 224) { // 2-byte char
            $res .= '&#'.((63&$byte)*64 + (63&ord($str[++$i]))).';';
        } elseif($byte < 240) { // 3-byte char
            $res .= '&#'.((15&$byte)*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
        } elseif($byte < 248) { // 4-byte char
            $res .= '&#'.((15&$byte)*262144 + (63&ord($str[++$i]))*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
        }
    }
    return $res;
}

（感谢@Floern，https://stackoverflow.com/a/4583465/810821）

如果我使用了不正确的术语，我很抱歉。

先感谢您。

原文

I have been working with converting strings to PDFs. A typical problem that arises is the occasional 'smart quote' or other utf-8 character which becomes one or another ISO character such as â€“, â€™, â€œ , â€, etc. The function defined below solves said problem by encoding them to html entities, however, of course a PDF is not html. When inputting a string with â€™ in place for an apostrophe ’, the function converts it to ’. That's great if we were dealing with HTML, but as a PDF, it treats it as a string and therefore it's exact form is never converted. Therefore, how does one covert the htmlentity to the characters exact form?

function htmlallentities($str){
    $res = '';
    $strlen = strlen($str);
    for($i=0; $i<$strlen; $i++){
        $byte = ord($str[$i]);
        if($byte < 128) { // 1-byte char
            $res .= $str[$i];
        } elseif($byte < 192) { // invalid utf8
        } elseif($byte < 224) { // 2-byte char
            $res .= '&#'.((63&$byte)*64 + (63&ord($str[++$i]))).';';
        } elseif($byte < 240) { // 3-byte char
            $res .= '&#'.((15&$byte)*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
        } elseif($byte < 248) { // 4-byte char
            $res .= '&#'.((15&$byte)*262144 + (63&ord($str[++$i]))*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
        }
    }
    return $res;
}

(With thanks to @Floern, https://stackoverflow.com/a/4583465/810821)

If I have used incorrect terminology, my apologies.

Thank you in advance.

分享到QQ

分享到微博