有没有办法从 FCKEditor 中删除所有不必要的 MS Word 格式

发布于 2024-08-03 11:00:51 字数 136 浏览 8 评论 0原文

我已经安装了 fckeditor，当从 MS Word 粘贴时，它添加了很多不必要的格式。我想保留某些内容，例如粗体、斜体、项目符号等。我在网上搜索并提出了一些解决方案，可以去掉所有内容，甚至是我想保留的内容，例如粗体和斜体。有没有办法去掉不必要的文字格式？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月朦胧 2024-08-10 11:00:51

以防万一有人想要已接受答案的 ac# 版本：

public string CleanHtml(string html)
    {
        //Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
        // Only returns acceptable HTML, and converts line breaks to <br />
        // Acceptable HTML includes HTML-encoded entities.

        html = html.Replace("&" + "nbsp;", " ").Trim(); //concat here due to SO formatting
        // Does this have HTML tags?

        if (html.IndexOf("<") >= 0)
        {
            // Make all tags lowercase
            html = Regex.Replace(html, "<[^>]+>", delegate(Match m){
                return m.ToString().ToLower();
            });
            // Filter out anything except allowed tags
            // Problem: this strips attributes, including href from a
            // http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
            string AcceptableTags = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote";
            string WhiteListPattern = "</?(?(?=" + AcceptableTags + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>";
            html = Regex.Replace(html, WhiteListPattern, "", RegexOptions.Compiled);
            // Make all BR/br tags look the same, and trim them of whitespace before/after
            html = Regex.Replace(html, @"\s*<br[^>]*>\s*", "<br />", RegexOptions.Compiled);
        }


         // No CRs
         html = html.Replace("\r", "");
         // Convert remaining LFs to line breaks
         html = html.Replace("\n", "<br />");
         // Trim BRs at the end of any string, and spaces on either side
         return Regex.Replace(html, "(<br />)+$", "", RegexOptions.Compiled).Trim();
    }

Just in case someone wants a c# version of the accepted answer:

public string CleanHtml(string html)
    {
        //Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
        // Only returns acceptable HTML, and converts line breaks to <br />
        // Acceptable HTML includes HTML-encoded entities.

        html = html.Replace("&" + "nbsp;", " ").Trim(); //concat here due to SO formatting
        // Does this have HTML tags?

        if (html.IndexOf("<") >= 0)
        {
            // Make all tags lowercase
            html = Regex.Replace(html, "<[^>]+>", delegate(Match m){
                return m.ToString().ToLower();
            });
            // Filter out anything except allowed tags
            // Problem: this strips attributes, including href from a
            // http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
            string AcceptableTags = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote";
            string WhiteListPattern = "</?(?(?=" + AcceptableTags + @")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>";
            html = Regex.Replace(html, WhiteListPattern, "", RegexOptions.Compiled);
            // Make all BR/br tags look the same, and trim them of whitespace before/after
            html = Regex.Replace(html, @"\s*<br[^>]*>\s*", "<br />", RegexOptions.Compiled);
        }


         // No CRs
         html = html.Replace("\r", "");
         // Convert remaining LFs to line breaks
         html = html.Replace("\n", "<br />");
         // Trim BRs at the end of any string, and spaces on either side
         return Regex.Replace(html, "(<br />)+$", "", RegexOptions.Compiled).Trim();
    }

回复收藏 0 原文

听不够的曲调 2024-08-10 11:00:51

这是我用来清理从富文本编辑器传入的 HTML 的解决方案...它是用 VB.NET 编写的，我没有时间转换为 C#，但它非常简单：

 Public Shared Function CleanHtml(ByVal html As String) As String
     '' Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
     '' Only returns acceptable HTML, and converts line breaks to <br />
     '' Acceptable HTML includes HTML-encoded entities.
     html = html.Replace("&" & "nbsp;", " ").Trim() ' concat here due to SO formatting
     '' Does this have HTML tags?
     If html.IndexOf("<") >= 0 Then
         '' Make all tags lowercase
         html = RegEx.Replace(html, "<[^>]+>", AddressOf LowerTag)
         '' Filter out anything except allowed tags
         '' Problem: this strips attributes, including href from a
         '' http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
         Dim AcceptableTags      As String   = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
         Dim WhiteListPattern    As String   = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
         html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
         '' Make all BR/br tags look the same, and trim them of whitespace before/after
         html = RegEx.Replace(html, "\s*<br[^>]*>\s*", "<br />", RegExOptions.Compiled)
     End If
     '' No CRs
     html = html.Replace(controlChars.CR, "")
     '' Convert remaining LFs to line breaks
     html = html.Replace(controlChars.LF, "<br />")
     '' Trim BRs at the end of any string, and spaces on either side
     Return RegEx.Replace(html, "(<br />)+$", "", RegExOptions.Compiled).Trim()
 End Function

 Public Shared Function LowerTag(m As Match) As String
   Return m.ToString().ToLower()
 End Function

在您的情况下，您需要修改“AcceptableTags”中“已批准”HTML 标记的列表 - 代码仍会去除所有无用的属性（不幸的是，有用的属性，如 HREF 和 SRC，希望这些对您来说并不重要）。

当然，这需要访问服务器。如果您不希望这样，则需要在工具栏上添加某种“清理”按钮，该按钮调用 JavaScript 来弄乱编辑器的当前文本。不幸的是，“粘贴”不是一个可以捕获以自动清理标记的事件，并且每次 OnChange 后进行清理都会导致编辑器无法使用（因为更改标记会更改文本光标位置）。

Here's a solution I use to scrub incoming HTML from rich text editors... it's written in VB.NET and I don't have time to convert to C#, but it's pretty straightforward:

 Public Shared Function CleanHtml(ByVal html As String) As String
     '' Cleans all manner of evils from the rich text editors in IE, Firefox, Word, and Excel
     '' Only returns acceptable HTML, and converts line breaks to <br />
     '' Acceptable HTML includes HTML-encoded entities.
     html = html.Replace("&" & "nbsp;", " ").Trim() ' concat here due to SO formatting
     '' Does this have HTML tags?
     If html.IndexOf("<") >= 0 Then
         '' Make all tags lowercase
         html = RegEx.Replace(html, "<[^>]+>", AddressOf LowerTag)
         '' Filter out anything except allowed tags
         '' Problem: this strips attributes, including href from a
         '' http://stackoverflow.com/questions/307013/how-do-i-filter-all-html-tags-except-a-certain-whitelist
         Dim AcceptableTags      As String   = "i|b|u|sup|sub|ol|ul|li|br|h2|h3|h4|h5|span|div|p|a|img|blockquote"
         Dim WhiteListPattern    As String   = "</?(?(?=" & AcceptableTags & ")notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:([""']?).*?\1?)?)*\s*/?>"
         html = Regex.Replace(html, WhiteListPattern, "", RegExOptions.Compiled)
         '' Make all BR/br tags look the same, and trim them of whitespace before/after
         html = RegEx.Replace(html, "\s*<br[^>]*>\s*", "<br />", RegExOptions.Compiled)
     End If
     '' No CRs
     html = html.Replace(controlChars.CR, "")
     '' Convert remaining LFs to line breaks
     html = html.Replace(controlChars.LF, "<br />")
     '' Trim BRs at the end of any string, and spaces on either side
     Return RegEx.Replace(html, "(<br />)+$", "", RegExOptions.Compiled).Trim()
 End Function

 Public Shared Function LowerTag(m As Match) As String
   Return m.ToString().ToLower()
 End Function

In your case, you'll want to modify the list of "approved" HTML tags in "AcceptableTags"--the code will still strip all the useless attributes (and, unfortunately, the useful ones like HREF and SRC, hopefully those aren't important to you).

Of course, this requires a trip to the server. If you don't want that, you'll need to add some sort of "clean up" button to the toolbar that calls JavaScript to mess with the editor's current text. Unfortunately, "pasting" is not an event that can be trapped to clean up the markup automatically, and cleaning after every OnChange would make for an unusable editor (since changing the markup changes the text cursor position).

回复收藏 0 原文

苦妄 2024-08-10 11:00:51

尝试了接受的解决方案，但它没有清除单词生成的标签。

但是这段代码对我有用

静态字符串CleanWordHtml（字符串html）{

StringCollection sc = new StringCollection();
// 去掉不必要的标签范围（评论和标题）
sc.Add(@"");
sc.Add(@"<标题>(\w|\W)+?");
// 摆脱类和样式
sc.Add(@"\s?class=\w+");
sc.Add(@"\s+style='[^']+'");
// 去掉不必要的标签
sc.添加（
@"<(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[) [^>]*?>");
// 去掉空段落标签
sc.Add(@"(<[^>]+>)+()+");
// 删除附加到  的奇怪 v: 元素标签
sc.Add(@"\s+v:\w+=""[^""]+""");
// 删除多余的行
sc.Add(@"(\n\r){2,}");
foreach（sc 中的字符串 s）
{
    html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase);
}
返回 html； 
}

Tried the accepted solution but it didn't clean the word generated tags.

But this code worked for me

static string CleanWordHtml(string html) {

StringCollection sc = new StringCollection();
// get rid of unnecessary tag spans (comments and title)
sc.Add(@"<!--(\w|\W)+?-->");
sc.Add(@"<title>(\w|\W)+?</title>");
// Get rid of classes and styles
sc.Add(@"\s?class=\w+");
sc.Add(@"\s+style='[^']+'");
// Get rid of unnecessary tags
sc.Add(
@"<(meta|link|/?o:|/?style|/?div|/?st\d|/?head|/?html|body|/?body|/?span|!\[)[^>]*?>");
// Get rid of empty paragraph tags
sc.Add(@"(<[^>]+>)+ (</\w+>)+");
// remove bizarre v: element attached to <img> tag
sc.Add(@"\s+v:\w+=""[^""]+""");
// remove extra lines
sc.Add(@"(\n\r){2,}");
foreach (string s in sc)
{
    html = Regex.Replace(html, s, "", RegexOptions.IgnoreCase);
}
return html; 
}

回复收藏 0 原文