检测编码转换问题
我公司网站上的大部分内容最初都是 Word 文档(Windows-1252 编码),最终复制并粘贴到我们的 UTF-8 编码内容管理系统中。 转换通常会因一些字符(特殊换行符、智能引号、科学记数法)而阻塞,必须手动清理这些字符,但当然有一些字符总是会漏掉。
您认为检测这些的最佳方法是什么?
The majority of content on my company's website starts life as a Word document (Windows-1252 encoded) and is eventually copied-and-pasted into our UTF-8-encoded content management system. The conversion usually chokes on a few characters (special break characters, smart quotes, scientific notations) which have to be cleaned up manually, but of course a few always slip through.
What do you think the best way would be to detect these?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您究竟是如何进行转换的?
我经常遇到整个从 Word 复制的问题,但它应该很容易解决。
您提到的这些字符都在
0x80
-0x9F
范围内,其中 Windows-1252 代码页与 ISO-8859-1 代码页不同。 ISO-8859-1 中未定义该范围。您必须从 ISO-8859-1(或者可能是 ISO-8859-15)而不是 Windows-1252 进行转换,从而导致该范围内的字符被阻塞。
您应该调整转换的源编码,或者,如果这在某种程度上不可能(我不熟悉 C#,但我对此表示怀疑),请使用代码页图表来修复与主转换分开的 32 个问题字符。
How exactly are you doing the conversion?
The whole copying-from-Word problem is something I've come across more often, but it should really be easy to solve.
Those chararacters you mention are all in the
0x80
-0x9F
range in which the Windows-1252 code page differs from the ISO-8859-1 code page. That range is undefined in ISO-8859-1.You must be doing the conversion from ISO-8859-1 (or perhaps ISO-8859-15) instead of Windows-1252, causing it to choke on characters in that range.
You should either adjust the source encoding of your conversion or, if that's somehow not possible (I'm not familiar with C#, but I doubt it), use the code page chart to fix the 32 problem characters separate from the main conversion.
您可以将文本保存为 .rtf 然后使用其他程序解析它吗?
您可以使用 Word 的 VBA 将文本保存为正常的内容吗?
Can you save the text as .rtf and then parse it using some other program?
Can you use Word's VBA to save the text as something sane?
正如已经提到的,最好将 Word 内容导出为可解析的格式(RTF 或 XML 都可以)。
使用复制粘贴将材料添加到 CMS 中可能有特定的原因,但通过复制粘贴,您可能总会以某种视觉检查和修复结束,除非您创建一个监视剪贴板的工具。
从 Word(最新版本)复制和粘贴时,剪贴板有多种可以使用的不同格式,其中一种格式是基于 XML 的。
可以创建一些东西来清理剪贴板上的 Word XML,并将文本版本(您可能粘贴到 CMS)“设置”为清理后的格式。
您可以使用 Office 附带的 Word.interop 和标准 C# 剪贴板功能来创建此内容。 该工具可以在 Word 顶部(在后台)运行,同时向 CMS 添加内容。
As already mentioned it would be best to export the Word contents to a parsable format (either RTF or XML would do).
There might be a specific reason for using copy-and-paste to add the material to your CMS but with copying and pasting you probably will always end up with some kind of visual check and fix round unless you create a tool that monitors the clipboard.
When copying and pasting from (a recent version) of Word the clipboard has several different formats that can be used, one of the formats is XML based.
It would be possible to create something that will cleanup the Word XML on the clipboard and "set" the text version (that you probably paste to the CMS) to the cleaned up format.
You could use the Word.interop that comes with office and standard C# clipboard functions to create this. The tool could work on top (in the background) of Word while adding content to the CMS.