将智能报价和其他实体转换为精确形式
我一直致力于将字符串转换为 PDF。出现的一个典型问题是偶尔出现的“智能引号”或其他 utf-8 字符,它们会变成一个或另一个 ISO 字符,例如 –、 –、“、” 等。下面定义的函数解决了上述问题通过将它们编码为 html 实体来解决问题,但是,当然 PDF 不是 html。当输入带有 ’
代替撇号 '
的字符串时,该函数会将其转换为 ’
。如果我们处理的是 HTML,那就太好了,但作为 PDF,它会将其视为字符串,因此永远不会转换其确切形式。因此,如何将 htmlentity 转换为字符精确形式?
function htmlallentities($str){
$res = '';
$strlen = strlen($str);
for($i=0; $i<$strlen; $i++){
$byte = ord($str[$i]);
if($byte < 128) { // 1-byte char
$res .= $str[$i];
} elseif($byte < 192) { // invalid utf8
} elseif($byte < 224) { // 2-byte char
$res .= '&#'.((63&$byte)*64 + (63&ord($str[++$i]))).';';
} elseif($byte < 240) { // 3-byte char
$res .= '&#'.((15&$byte)*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
} elseif($byte < 248) { // 4-byte char
$res .= '&#'.((15&$byte)*262144 + (63&ord($str[++$i]))*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
}
}
return $res;
}
(感谢@Floern,https://stackoverflow.com/a/4583465/810821)
如果我使用了不正确的术语,我很抱歉。
先感谢您。
I have been working with converting strings to PDFs. A typical problem that arises is the occasional 'smart quote' or other utf-8 character which becomes one or another ISO character such as –, ’, “ , â€, etc. The function defined below solves said problem by encoding them to html entities, however, of course a PDF is not html. When inputting a string with ’
in place for an apostrophe ’
, the function converts it to ’
. That's great if we were dealing with HTML, but as a PDF, it treats it as a string and therefore it's exact form is never converted. Therefore, how does one covert the htmlentity to the characters exact form?
function htmlallentities($str){
$res = '';
$strlen = strlen($str);
for($i=0; $i<$strlen; $i++){
$byte = ord($str[$i]);
if($byte < 128) { // 1-byte char
$res .= $str[$i];
} elseif($byte < 192) { // invalid utf8
} elseif($byte < 224) { // 2-byte char
$res .= ''.((63&$byte)*64 + (63&ord($str[++$i]))).';';
} elseif($byte < 240) { // 3-byte char
$res .= ''.((15&$byte)*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
} elseif($byte < 248) { // 4-byte char
$res .= ''.((15&$byte)*262144 + (63&ord($str[++$i]))*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
}
}
return $res;
}
(With thanks to @Floern, https://stackoverflow.com/a/4583465/810821)
If I have used incorrect terminology, my apologies.
Thank you in advance.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果智能撇号 (') 变为 –,则问题在于 UTF-8 编码的数据被解释为 windows-1252 编码中的字节序列。您应该找到并修复导致错误解释的代码部分,而不是在数据混乱后尝试修复此问题。
If the smart apostrophe (’) becomes ’, then the problem is that UTF-8 encoded data is being interpreted as a byte sequence in windows-1252 encoding. Instead of trying to fix this after the data has been messed up, you should find and fix the part of code that causes the wrong interpretation.