在 PHP 中将 Word 文档转换为可用的 HTML

发布于 2024-07-07 08:36:32 字数 301 浏览 7 评论 0原文

我有一组 Word 文档,我想使用我编写的 PHP 工具发布它们。 我将 Word 文档复制并粘贴到文本框中,然后使用 PHP 程序将它们保存到 MySQL 中。 我遇到的问题是由Word文档中的所有非标准字符引起的,例如大引号和省略号(“...”)。 我现在所做的就是手动搜索并用纯文本或 HTML 实体(é ; 等)替换这些类型的东西(以及 e-acute 等外来符号) PHP 中是否有一个函数我可以调用它将获取 Word 文档的输出,并将所有应该是实体的内容转换为实体,并将在 Firefox 中无法正确显示的其他符号转换为可以显示的符号。

谢谢!

I have a set of Word documents which I want to publish using a PHP tool I've written. I copy and paste the Word documents into a text box and then save them into MySQL using the PHP program. The problem I Have arises from all the non-standard characters that Word documents have, like curly quotes and ellipses ("..."). What I do at the moment is manually search and replace these kinds of things (and also foreign symbols such as e-acute) with either plain text or HTML entities (é ; etc) Is there a function in PHP I can call that will take the output of a Word document and convert everything that should be entities into entities, and other symbols that don't display properly in Firefox into symbols that do display.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

笨笨の傻瓜 2024-07-14 08:36:32

这在过去对我很有帮助:

$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')

This has served me well in the past:

$str = mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')
萌酱 2024-07-14 08:36:32

更好的解决方案是确保您的数据库设置为支持 UTF-8 字符。 扩展集中可用的附加字符应涵盖您正在讨论的所有“非标准”字符。

否则,如果您确实必须将这些字符转换为 HTML 实体,请使用 htmlentities()

A better solution would be to ensure that your database is set-up to support UTF-8 characters. The additional characters available in the extended set should cover all the "non-standard" characters that you're talking about.

Otherwise, if you really must convert these characters into HTML entities, use htmlentities().

蔚蓝源自深海 2024-07-14 08:36:32

我认为所有这些答案都忽略了一个要点。 Windows 本身使用 windows 风格的 latin1,因此如果您将一些特殊字符(例如不对称引号)粘贴到 Windows 计算机上的表单中,然后将其发送到 unix(或任何非 muckrosoft)框(例如数据库)或其他)某些字符与 UNIX 系统理解的任何字符都不匹配,因此出现混乱和乱码的字符。 这意味着,即使你有一个 UTF-8 数据库,并使用 htmlentities,一些令人讨厌的字符仍然会通过,因为它们是操作系统无法识别的字符 - 它们甚至不是 UTF-8 的一部分 -是微软独有的发明。 我很想知道一个巧妙的解决方案 - 我所做的就是使用 UTF-8 字符(也是手动)列表手动将我遇到的仅限 microsoft 的字符的字符代码列入黑名单,对所有这些字符执行 str_replace ,然后然后你可以用它们做任何你想做的事 - iconv,htmlentities,直接保存到 utf8 数据库中,这不再重要了。

我对这一切的把握有点不稳定 - 请查看 http:// /www.cs.tut.fi/~jkorpela/www/windows-chars.html 提供了一个很好的解释,我已将其简化为上面的简短形式。 - 如果有人有更好的解决方案(肯定有一个!)如何 PHPify 本文所解释的内容......我很想听听!

I think that all these answers miss one vital point. Windows itself uses a windows flavour of latin1, so if you paste some special characters in (like asymetrical quotes) into a form on a windows machine and that gets sent to a unix (or anything non-muckrosoft) box (be that to a database or whatever) some of the characters do not get matched to anything the unix system comprehends, hence the confused and garbled characters. What this means is that even if you have a UTF-8 database, and use htmlentities, some nasties are still going to get through because they are characters the OS doesn't recognise - they aren't even part of UTF-8 - the are microsoft-only inventions. I would love to know of a slick solution - what I do is manually blacklist the character codes of the microsoft-only chars I have encountered with an (also manual) list of UTF-8 characters, do a str_replace for all of these, and THEN you can do whatever you want with them - iconv, htmlentities, save straight into an utf8 database, it matters not anymore.

My grasp on this all is a little shaky - check out http://www.cs.tut.fi/~jkorpela/www/windows-chars.html for an excellent explanation which I have mutilated into short form above. - If someone has a better solution (surely there is one out there!) of how to PHPify what this article explains... I would love to hear it!

痞味浪人 2024-07-14 08:36:32

htmlspecialchars() 会给你带来很大的帮助,但要小心,因为 Word 文档很混乱。

htmlspecialchars() will get you a long way, but watch out because Word documents are messy.

故笙诉离歌 2024-07-14 08:36:32

这是我针对不可移植 Windows 字符集问题编写的解决方案。 这会将有问题的几乎拉丁字符 1 字符替换为其等效的 HTML 实体。

$translation=array(
    // reference from http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
    "\x82" => "‚",
    "\x83" => "ƒ",
    "\x84" => "„",
    "\x85" => "…",
    "\x86" => "†",
    "\x87" => "‡",
    "\x88" => "ˆ",
    "\x89" => "‰",
    "\x8a" => "Š",
    "\x8b" => "‹",
    "\x8c" => "Œ",
    "\x91" => "‘",
    "\x92" => "’",
    "\x93" => "“",
    "\x94" => "”",
    "\x95" => "•",
    "\x96" => "–",
    "\x97" => "—",
    "\x98" => "˜",
    "\x99" => "™",
    "\x9a" => "š",
    "\x9b" => "›",
    "\x9c" => "œ",
    "\x9f" => "Ÿ",
);    
return str_replace(array_keys($translation),array_values($translation),$input);

它对我有用TM

Here's a solution I cooked up for the problem with the non-portable windows character set. This replaces the offending almost-Latin-1 characters with their equivalent HTML entities.

$translation=array(
    // reference from http://www.cs.tut.fi/~jkorpela/www/windows-chars.html
    "\x82" => "‚",
    "\x83" => "ƒ",
    "\x84" => "„",
    "\x85" => "…",
    "\x86" => "†",
    "\x87" => "‡",
    "\x88" => "ˆ",
    "\x89" => "‰",
    "\x8a" => "Š",
    "\x8b" => "‹",
    "\x8c" => "Œ",
    "\x91" => "‘",
    "\x92" => "’",
    "\x93" => "“",
    "\x94" => "”",
    "\x95" => "•",
    "\x96" => "–",
    "\x97" => "—",
    "\x98" => "˜",
    "\x99" => "™",
    "\x9a" => "š",
    "\x9b" => "›",
    "\x9c" => "œ",
    "\x9f" => "Ÿ",
);    
return str_replace(array_keys($translation),array_values($translation),$input);

It Works For MeTM

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文