PHP 清理粘贴的 Microsoft 输入

发布于 2024-07-11 08:04:34 字数 461 浏览 5 评论 0原文

我有一个网站,用户可以使用 TinyMCE 的自定义实现来发布内容(例如在论坛、评论等中)。 他们中的很多人喜欢复制和模仿。 从 Word 粘贴,这意味着他们的输入通常带有大量相关的 MS 内联格式。

我不能只是摆脱 因为 TinyMCE 依赖于 span 标签来进行某些格式设置,而且我不能(也不想)强迫所述用户使用 TinyMCE 的“从 Word 粘贴”功能(无论如何似乎效果不太好)。

有人知道可以为我处理这个问题的库/类/函数吗? 这肯定是一个常见问题,尽管我找不到任何明确的信息。 我最近一直在想,一系列寻找 MS 特定模式的强力正则表达式可能会成功,但我不想重写一些可能已经可用的东西,除非我必须这样做。

另外,修复大引号、破折号等也会很好。 我现在有自己的东西可以做到这一点,但我真的只想找到一个 MS 转换过滤器来统治它们。

I have a site where users can post stuff (as in forums, comments, etc) using a customised implementation of TinyMCE. A lot of them like to copy & paste from Word, which means their input often comes with a plethora of associated MS inline formatting.

I can't just get rid of <span whatever> as TinyMCE relies on the span tag for some of it's formatting, and I can't (and don't want to) force said users to use TinyMCE's "Paste From Word" feature (which doesn't seem to work that well anyway).

Anyone know of a library/class/function that would take care of this for me? It must be a common problem, though I can't find anything definitive. I've been thinking recently that a series of brute-force regexes looking for MS-specific patterns might do the trick, but I don't want to re-write something that may already be available unless I must.

Also, fixing of curly quotes, em-dashes, etc would be good. I have my own stuff to do this now, but I'd really just like to find one MS-conversion filter to rule them all.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

夏雨凉 2024-07-18 08:04:34

HTML Purifier 将创建符合标准的标记并过滤掉许多可能的攻击(例如 XSS)。

为了更快地进行不需要 XSS 过滤的清理,我使用 PECL 扩展 Tidy,它是Tidy HTML 实用程序。

如果这些对您没有帮助,我建议您切换到具有此功能的FCKEditor 内置

HTML Purifier will create standards compliant markup and filter out many possible attacks (such as XSS).

For faster cleanups that don't require XSS filtering, I use the PECL extension Tidy which is a binding for the Tidy HTML utility.

If those don't help you, I suggest you switch to FCKEditor which has this feature built-in.

清风疏影 2024-07-18 08:04:34

就我而言,这工作得很好:

$text = strip_tags($text, '<p><a><em><span>');

您可以只指定允许的标签,而不是尝试取出您不想要的内容(例如嵌入的 word xml)。

In my case, this worked just fine:

$text = strip_tags($text, '<p><a><em><span>');

Rather than trying to pull out stuff you don't want such as embedded word xml, you can just specify you're allowed tags.

天生の放荡 2024-07-18 08:04:34

网站 http://word2cleanhtml.com/ 在从 Word 进行转换方面做得很好。 我在 PHP 中通过废弃的方式使用它来处理一些遗留的 HTML,到目前为止它工作得很好(结果是非常干净的

, 代码)。 当然,作为外部服务,像您的案例一样在在线处理中使用它并不好。

如果您尝试它并带来许多 400 错误,请尝试使用 过滤 HTML首先整洁

The website http://word2cleanhtml.com/ does a good job on converting from Word. I'm using it in PHP by scrapping, to process some legacy HTML, and until now it's working pretty fine (the result is very clean <p>, <b> code). Of course, being an external service it's not good to use it in online processing like your case.

If you try it and it brings many 400 errors, try filtering the HTML with Tidy first.

山人契 2024-07-18 08:04:34

就我而言,有一个模式。 不需要的部分总是以 开头

<!-- [if gte mso 9]>

和结尾,

<![endif]-->

所以我的解决方案是剪掉该块之前和之后的所有内容:

$array = explode("<!-", $string, 2);
$begin = $array[0];
$end=substr(strrchr($string,'[endif]-->'),10);
echo $begin.$end;

In my case, there was a pattern. The unwanted part always started with

<!-- [if gte mso 9]>

and ended by an

<![endif]-->

So my solution was to cut out everything before and after this block:

$array = explode("<!-", $string, 2);
$begin = $array[0];
$end=substr(strrchr($string,'[endif]-->'),10);
echo $begin.$end;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文