â��在我的 html 净化后

发布于 2024-08-26 11:47:12 字数 2038 浏览 5 评论 0原文

我有一个数据库,我正在重建表结构,这很糟糕,所以我将一些数据从一个表移植到另一个表。这些数据似乎是从 MSO 产品复制粘贴的,因此当我获取数据时,我使用 htmlpurifier 和 php 中的一些 str_replace 对其进行清理。这是 clean 函数:

   function clean_html($html) {
    $config = HTMLPurifier_Config::createDefault();
    $config->set('AutoFormat','RemoveEmpty',true);
    $config->set('HTML','AllowedAttributes','href,src');
    $config->set('HTML','AllowedElements','p,em,strong,a,ul,li,ol,img');
    $purifier = new HTMLPurifier($config);

    $html = $purifier->purify($html);

    $html = str_replace(' ',' ',$html);
    $html = str_replace("\r",'',$html);
    $html = str_replace("\n",'',$html);
    $html = str_replace("\t",'',$html);
    $html = str_replace('  ',' ',$html);
    $html = str_replace('<p> </p>','',$html);
    $html = str_replace(chr(160),' ',$html);

    return trim($html);
   }

但是,当我将结果放入新表并将其输出到 ckeditor 时,我得到了这三个字符。

然后我有一个 javascript 函数,调用该函数也可以从 ckeditor 的内容中删除特殊字符。它也不能清洁它

  function remove_special(str) {
    var rExps=[ /[\xC0-\xC2]/g, /[\xE0-\xE2]/g,
    /[\xC8-\xCA]/g, /[\xE8-\xEB]/g,
    /[\xCC-\xCE]/g, /[\xEC-\xEE]/g,
    /[\xD2-\xD4]/g, /[\xF2-\xF4]/g,
    /[\xD9-\xDB]/g, /[\xF9-\xFB]/g,
    /\xD1/,/\xF1/g,
    "/[\u00a0|\u1680|[\u2000-\u2009]|u200a|\u200b|\u2028|\u2029|\u202f|\u205f|\u3000|\xa0]/g", 
    /\u000b/g,'/[\u180e|\u000c]/g',
    /\u2013/g, /\u2014/g,
    /\xa9/g,/\xae/g,/\xb7/g,/\u2018/g,/\u2019/g,/\u201c/g,/\u201d/g,/\u2026/g];
    var repChar=['A','a','E','e','I','i','O','o','U','u','N','n',' ','\t','','-','--','(c)','(r)','*',"'","'",'"','"','...'];

    for(var i=0; i<rExps.length; i++) {
        str=str.replace(rExps[i],repChar[i]);
    }

      for (var x = 0; x < str.length; x++) {
    charcode = str.charCodeAt(x);
          if ((charcode < 32 || charcode > 126) && charcode !=10 && charcode != 13) {
              str = str.replace(str.charAt(x), "");
          }
      }
      return str;
  }

有谁知道我需要做什么来摆脱它们。我认为它们可能是某种引言。

I have a database the I am rebuilding the table structure was crap so I'm porting some of the data from one table to another. This data appears to have been copy-pasted from MSO product so as I'm getting the data I clean it up with htmlpurifier and some str_replace in php. Here is the clean function:

   function clean_html($html) {
    $config = HTMLPurifier_Config::createDefault();
    $config->set('AutoFormat','RemoveEmpty',true);
    $config->set('HTML','AllowedAttributes','href,src');
    $config->set('HTML','AllowedElements','p,em,strong,a,ul,li,ol,img');
    $purifier = new HTMLPurifier($config);

    $html = $purifier->purify($html);

    $html = str_replace(' ',' ',$html);
    $html = str_replace("\r",'',$html);
    $html = str_replace("\n",'',$html);
    $html = str_replace("\t",'',$html);
    $html = str_replace('  ',' ',$html);
    $html = str_replace('<p> </p>','',$html);
    $html = str_replace(chr(160),' ',$html);

    return trim($html);
   }

However, when I put the results into my new table and output them to the ckeditor I get those three characters.

I then have a javascript function that is called to remove special characters from the content of the ckeditor too. it doesn't clean it either

  function remove_special(str) {
    var rExps=[ /[\xC0-\xC2]/g, /[\xE0-\xE2]/g,
    /[\xC8-\xCA]/g, /[\xE8-\xEB]/g,
    /[\xCC-\xCE]/g, /[\xEC-\xEE]/g,
    /[\xD2-\xD4]/g, /[\xF2-\xF4]/g,
    /[\xD9-\xDB]/g, /[\xF9-\xFB]/g,
    /\xD1/,/\xF1/g,
    "/[\u00a0|\u1680|[\u2000-\u2009]|u200a|\u200b|\u2028|\u2029|\u202f|\u205f|\u3000|\xa0]/g", 
    /\u000b/g,'/[\u180e|\u000c]/g',
    /\u2013/g, /\u2014/g,
    /\xa9/g,/\xae/g,/\xb7/g,/\u2018/g,/\u2019/g,/\u201c/g,/\u201d/g,/\u2026/g];
    var repChar=['A','a','E','e','I','i','O','o','U','u','N','n',' ','\t','','-','--','(c)','(r)','*',"'","'",'"','"','...'];

    for(var i=0; i<rExps.length; i++) {
        str=str.replace(rExps[i],repChar[i]);
    }

      for (var x = 0; x < str.length; x++) {
    charcode = str.charCodeAt(x);
          if ((charcode < 32 || charcode > 126) && charcode !=10 && charcode != 13) {
              str = str.replace(str.charAt(x), "");
          }
      }
      return str;
  }

Does anyone know off hand what I need to do to get rid of them. I think they may be some sort of quote.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

在你怀里撒娇 2024-09-02 11:47:12

你的字符编码完全不正常。 â�� 对我来说表示三字节 UTF-8 编码字符。

您需要发现一些事情

  • 旧表的编码是什么?
  • 新表的编码是什么?
  • 显示ckeditor的页面的编码是什么?

看起来像 HTMLPurifier 的默认值为 UTF-8 所以你真的需要请注意数据的编码!

Your character encodings are all out of whack. � is indicative to me of a three-byte UTF-8 encoded character.

Some things you need to discover

  • What is was the encoding of the old table?
  • What is the encoding of the new table?
  • What is the encoding of the page that displays ckeditor?

It looks like HTMLPurifier's default is UTF-8 so you really need to be aware of the encoding of your data!

来日方长 2024-09-02 11:47:12

有类似的问题: php 删除/识别此符号 �

字符 � 是 < a href="http://unicode.org/charts/PDF/UFFF0.pdf" rel="nofollow noreferrer">替换字符 (U+FFFD)。当 UTF 代码中存在错误时使用它:

FFFD � REPLACEMENT CHARACTER

 - used to replace an incoming character whose value 
   is unknown or unrepresentable in Unicode

在大多数情况下,这意味着某些数据使用 UTF 编码进行解释,而数据不是使用该编码而是使用不同的编码进行编码。

我的问题是将文本从 Microsoft Office 产品粘贴到 html 或数据库中。最大的罪犯似乎是破折号和智能引号。

Had a similar issue: php remove/identify this symbol �

The character � is the REPLACEMENT CHARACTER (U+FFFD). It is used when there was an error within an UTF code:

FFFD � REPLACEMENT CHARACTER

 - used to replace an incoming character whose value 
   is unknown or unrepresentable in Unicode

In most cases it means that some data is interpreted with an UTF encoding while the data is not encoded with that encoding but a different one.

My problem was pasting text from microsoft office products to html, or into a database. The largest offenders seem to be the emdash and smart quotes.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文