将 HTML 文档与不同的字符集相结合
我使用“网页,过滤”的“另存为”选项保存了 MS-Word 文档。我想插入 HTML 和在 HTML5 文档中生成的 CSS 代码,该文档包含页眉、菜单、页脚等。第一个问题是关于字符集和页眉信息:
MS-Word 生成的 HTML(另存为“网页”) ,已过滤”):
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 12 (filtered)">
我的 HTML5 模板:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
我看到的主要问题是两种不同的字符集(UTF-8 与 windows-1252)。此外,我猜测元标记“name=Generator content="Microsoft Word 12 (filtered)”不会成为问题,也许可以删除(?)。
我可以整理 CSS但有一个例外,我不知道“@”符号的含义。示例:
@font-face
{font-family:"Book Antiqua";
panose-1:2 4 6 2 5 3 5 3 3 4;}
我浏览了文档,但没有看到“font-face”ID 或类,所以我猜测这可能会改变。文档中的所有字体。这可能是一个问题(如果属实);如上所述,新文档将包含我的菜单、页眉、页脚等。
I saved a MS-Word Doc with the 'save-as' option of "Web Page, Filtered". I want to insert the HTML & CSS code that was generated inside an HTML5 document that has my header, menu, footer, etc. The first question is in regard to charset and header info:
MS-Word generated HTML (Saved as "Web Page, Filtered"):
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 12 (filtered)">
My HTML5 template:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
The main issue I see is the two different character sets (UTF-8 vs windows-1252). Additionally, I am guessing the meta tag "name=Generator content="Microsoft Word 12 (filtered)" will not be a problem and perhaps can just be removed (?).
I can sort out the CSS with one exception. I do not know what the '@' symbol means. Example:
@font-face
{font-family:"Book Antiqua";
panose-1:2 4 6 2 5 3 5 3 3 4;}
I looked through the document and do not see "font-face" IDs or classes. So I am guessing this might change all of the fonts in the document. This might be a problem (if true); as stated, the new document will have my menu, header, footer, etc.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您不应该将 Office 女士吐出的任何内容复制并粘贴到网站中;主要是因为你的代码变得一团糟,而且很可能只在 IE 中看起来正确。这只是我在收到很多“你的网站坏了!!!”之后的经验。有人将 ms-word-"html" 粘贴到 joomla 页面后抱怨。
无论如何,您网站上的字符集必须是 utf-8。
在我看来,你的@font-face 看起来很糟糕。我只知道它的语法略有不同:
仅此一点不会做任何事情,直到您在其他地方应用“Awesomefont”:
You should not copy&paste anything that ms office pukes out into a website; mostly because your code becomes a big mess, and it will most likely only look right in IE. This just my experience after i got a lot "Your website is broken!!!" complains after someone pasted ms-word-"html" into joomla pages.
Anyway, charset on your website must be utf-8.
Your @font-face looks broken to me. I only know it in a slightly different syntax:
this alone wont do anything, until you apply "Awesomefont" somewhere else:
下面是一组 PowerShell 脚本,可以清理 Word-Filtered HTML 并在大约 95% 的情况下正确标记上标/下标。 (不,没有比这更好的了,Word 是为打印而生的。)
https://github.com/ suzumakes/replaceit
这还会将 Windows-1252 类中 M$ 吐出的字符更改为相应的 UTF-8 对应字符。它删除了所有样式和类,以便您可以轻松地将 HTML 直接放入模板中。根据制作 Word 文档的人对合理文本和时髦布局的疯狂程度,您可能只需要几分钟的清理时间,或者您可能必须修复 M$ 在各处插入软连字符的倾向。
自述文件中有说明,如果您碰巧遇到任何需要捕获的其他角色或提出任何调整/改进,我很高兴看到您的拉取请求。
Here is a set of PowerShell scripts that will clean Word-Filtered HTML and correctly tag super/subscripts about 95% of the time. (No, you can't get better than that, Word is made for print.)
https://github.com/suzumakes/replaceit
This also changes the characters that M$ barfs out in windows-1252 class to their appropriate UTF-8 counterparts. It removes all the styling and classes so that you can drop the HTML straight into your template with minimal fuss. Depending on how crazy the person who made your Word doc went with justifying text and funky layouts you may have just a few minutes of cleanup, or you may have to fix M$'s propensity to insert soft hyphens all over the place.
Instructions are there in the ReadMe and if you happen to encounter any additional characters that need to be caught or come up with any tweaks/improvements, I'd be happy to see your pull request.