以编程方式清理 Word 生成的 HTML,同时保留样式?

发布于 2024-09-01 04:50:59 字数 737 浏览 3 评论 0原文

在我现在的公司,我们已经有十年的历史了......让我们称之为“Hello World”应用程序。

在想要创建它的新版本的同时,我们也想保留旧的条目。这些旧条目包含可怕的 Word 生成的 HTML,以前从未过滤过。

如果我们迁移到较新的系统,我更愿意清理和过滤 HTML,以便使网站尽可能符合 HTML 标准。
然而,只需像 Jeff Atwood 在他的文章中描述的那样清理该代码博客或我知道的任何其他方式也会破坏风格和格式。

现在,这可能会导致我们的用户反抗,然后一切都会崩溃——这不是一个好主意。

所以问题是:可以在保留基本格式的同时清理 Word 的 HTML 吗?(例如:着色、斜体、粗体文本等)

最好使用公开可用的代码或库,例如 HTML Tidy,C# 中的示例将非常感激。

In my current company, we have this decade old...let's call it a "Hello World" application.

While wanting to create a newer version of it, we also want to preserve older entries. These older entries contain hideous Word-generated HTML which was never filtered before.

If and when we move to a newer system, I'd prefer to have that HTML cleaned and filtered in order to have the site comply with HTML standards as much as possible.
However, just cleaning that code like Jeff Atwood described in his blog or in any other way I know of would also ruin the style and formatting.

Now, that just might cause our users to revolt and then all hell will break loose - not a very good idea.

So the question is: Can Word's HTML be cleaned while preserving basic formatting? (e.g: coloring, italicized, bold text and so on)

Preferably using publicly available code or library, such as HTML Tidy, examples in C# would be much appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

巾帼英雄 2024-09-08 04:50:59

有几个可用的选项,但您当然可以使用 Jeff Atwood 的作为编写您自己的代码的良好起点。如果是这样,您可能会获得对结果的微调控制 - 但请注意,结果永远不会 100% 准确,因为所有额外的 ms 代码实际上都在那里,以确保尽可能多地保真原始文档(至少在 IE 中用于往返目的)。但大多数代码确实保留了大部分格式。

以下是一些可能有用的代码库:

如果您只是想要批处理(并且不关心拥有代码库),则 Office 2000 HTML Filter 2.0 可能是您最好的选择 - 请在 TechRepublic

There are a couple of options available, but you can certainly use Jeff Atwood's as a good starting point to code your own. If so, you'll likely get fine-tuned control over the result - note though that the results will be never been 100% accurate as all that extra ms-code is actually there to ensure as much fidelity with the original document as possible (at least in IE for round-tripping purposes). But most code out there does preserve most formatting.

Here are some code libraries that could be helpful:

If you're just wanting batch-processing (and don't care about owning a code base), the Office 2000 HTML Filter 2.0 is probably your best best - read more about it on TechRepublic.

等待我真够勒 2024-09-08 04:50:59

tidy 非常适合清理和规范 html 语法。

它是非常可配置的,因此对于批量清理来说,很可能
命令行工具将满足您的需要。你没有
自己编写 tidylib 程序。

如果您需要对内容进行更多涉及的清理 -
不仅仅是语法 - 一些 xslt 处理器( xsltproc 就是其中之一)
有一个“--html”选项:输入文件由 html 解析器解析
一个 xml 解析器。然后您可以使用 xslt 来转换或重新排列
内容,然后使用 html 序列化器输出。

tidy works fine for cleaning up and regularizing html syntax.

It's very configurable, so for a batch cleanup, it's likely
the command line tool will do what you need. You don't have
to program tidylib yourself.

If you need to do more involved cleanup of the content -
not just the syntax - some xslt processors ( xsltproc, for one )
have an '--html' option: input files are parsed by the html parser instead
of an xml parser. You can then use xslt to transform or rearrange the
content, then output with the html serializer.

乖乖 2024-09-08 04:50:59

这个问题提出了类似的问题问题,尽管存在,但不需要编程清理。

其中一个答案提到 Office 2007 有一个“发布”->“博客”菜单项,据说可以产生良好的结果并且速度很快。您可以从 Word 创建宏来调用此命令,然后以编程方式调用该宏。您可以使用 COM 或 VBScript 启动 Word 并运行宏,或者使用 /m 开关运行 winword.exe这里给出了 winword.exe 的命令行开关< /a>.

This SO question poses a similar problem, although there, programmatic cleanup is not required.

One of the answers mentions that Office 2007 has a Publish->Blog menu item that reportedly produces good results and is fast. You could create a macro from Word to invoke this command, and then programmatically invoke the macro. You can use COM or VBScript to start word and run the macro, or run winword.exe with the /m switch. Command line switches to winword.exe are given here.

羅雙樹 2024-09-08 04:50:59

一定要有预算。这可能工作。购买前先尝试一下。

Do have a budget for it. This might Work . Try before you buy.

谢绝鈎搭 2024-09-08 04:50:59

看看 FCKEditor ,它是一个基于 javascript 的编辑器,所以看看源代码可能会为您提供很多关于删除 Word HTML 时要查找的内容的提示。

特别是查看文件 /editor/dialog/fck_paste.html。有一个功能,“CleanWord”可以完成这一切。我已经修改了它以便在我自己的应用程序中使用(轻微修改,即不同的替换等...),但是它在摆脱丑陋的 Word HTML 方面做得很好。

它使用正则表达式来查找和替换,这意味着您可以轻松地添加正则表达式并将其导入您选择的另一种编程语言中以运行批处理作业。

Take a look at FCKEditor , its a javascript-based editor, so looking at the source might give you lots of hints as to what to look for when removing word HTML.

In particular, take a look at the file, /editor/dialog/fck_paste.html. There's a function, "CleanWord" does it all. I've modified it for use in my own applications (slight modifications, ie. different replacements, etc...), however it does a great job of getting rid of ugly Word HTML.

It does it using regular expressions to find and replace, which means you can easily extra the regex and import it into another programming language of your choice to run the batch job.

零度℉ 2024-09-08 04:50:59

PSPad 包括 tidy,它有一个“清理 Microsoft Word 2000”选项,我之前曾在 Word 文档中使用过该选项它是可定制的。

PSPad includes tidy, which has a "Clean Microsoft Word 2000" option which I've used for word documents before and it's customizable.

泪眸﹌ 2024-09-08 04:50:59

HtmlRuleSanitizer (可在 NuGet) 可以开箱即用地为您执行此操作。

它使用 HTML Agility Pack 来解析 HTML 代码,并使用一组基于白名单的规则来保留格式。默认规则集将消除几乎所有冗长的 MS Word HTML 代码,同时保留基本文档结构,如标题标签、粗体、斜体等。

如果您想保留特定的 MS Word 样式,则必须创建或调整规则根据您的用例进行设置。

例如,它可以轻松转换 MS Word 为包含以下内容的文档生成的数百行 HTML 代码:

标题一

段落

标题
两个

粗体

斜体

链接

仅适用于以下一组相对干净的 HTML:

<html>
<body>
<h1><span>Heading</span> <span>one</span></h1>
<p><span>Paragraph</span></p>
<h2><span>Heading</span> <span>two</span></h2>
<p><span><strong>Bold</strong></span><strong></strong></p>
<p><span><i>Italic</i></span><i></i></p>
<p><i><a href="http://www.google.com/" target="_blank" rel="nofollow">Link</a></i></p>
</body>
</html>

请注意,MS Word 经常执行的一些烦人的操作(例如打开和关闭标记)(请参阅示例中的 span 元素)并未完全清除。

The HtmlRuleSanitizer (available on NuGet) can do this for you out of the box.

It uses the HTML Agility Pack to parse the HTML code and uses a set of white list based rules to preserve formatting. The default rule sets will get rid of virtually all the verbose MS Word HTML code while preserving basic document structure like header tags, bold, italic, etc.

If you want to preserve specific MS Word styling you'll have to create or adapt a rule set for your use case.

It will for example easily convert the hundreds of lines of HTML code which MS Word would generate for a document containing the following:

Heading one

Paragraph

Heading
two

Bold

Italic

A Link

To only the following set of relatively clean HTML:

<html>
<body>
<h1><span>Heading</span> <span>one</span></h1>
<p><span>Paragraph</span></p>
<h2><span>Heading</span> <span>two</span></h2>
<p><span><strong>Bold</strong></span><strong></strong></p>
<p><span><i>Italic</i></span><i></i></p>
<p><i><a href="http://www.google.com/" target="_blank" rel="nofollow">Link</a></i></p>
</body>
</html>

Note that some of the annoying stuff MS Word is doing like opening and closing tags very often (see the span elements in the example) are not fully cleaned out.

时光病人 2024-09-08 04:50:59

下面是一组 PowerShell 脚本,可以清理 Word-Filtered HTML 并在大约 95% 的情况下正确标记上标/下标。 (不,没有比这更好的了,Word 是为打印而生的。)

https://github.com/ suzumakes/replaceit

基本格式保持不变,标签变成标签,标签变成标签。我认为这就是您正在寻找的,即使您不应该使用 Regex 来解析 HTML,Word-Filtered HTML 几乎不会被过滤,但在运行这些 powershell 脚本后它是干净的。

自述文件中有说明,如果您碰巧遇到任何需要捕获的其他角色或提出任何调整/改进,我很高兴看到您的拉取请求。

Here is a set of PowerShell scripts that will clean Word-Filtered HTML and correctly tag super/subscripts about 95% of the time. (No, you can't get better than that, Word is made for print.)

https://github.com/suzumakes/replaceit

Basic formatting is kept intact, tags become tags and tags become tags. I think this is what you're looking for, and even though you shouldn't use Regex to parse HTML, Word-Filtered HTML is hardly filtered, but it is clean after these powershell scripts are run on it.

Instructions are there in the ReadMe and if you happen to encounter any additional characters that need to be caught or come up with any tweaks/improvements, I'd be happy to see your pull request.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文