使用 PHP 快速将 (.rtf|.doc) 文件转换为 Markdown 语法
我已经手动将文章转换为 Markdown 语法几天了,这变得相当乏味。 其中一些长达 3 或 4 页,全文采用斜体和其他强调文本。 有没有一种更快的方法可以将 (.rtf|.doc) 文件转换为干净的 Markdown 语法,我可以利用?
I've been manually converting articles into Markdown syntax for a few days now, and it's getting rather tedious. Some of these are 3 or 4 pages, italics and other emphasized text throughout. Is there a faster way to convert (.rtf|.doc) files to clean Markdown Syntax that I can take advantage of?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
如果您碰巧使用的是 Mac,
textutil
可以很好地将 doc、docx 和 rtf 转换为 html,而 pandoc 可以很好地将生成的 html 转换为 markdown:我有一个 <我不久前拼凑的一个 href="https://gist.github.com/1181510" rel="noreferrer">脚本 尝试使用 textutil、pdf2html 和 pandoc 来转换我抛出的任何内容降价。
If you happen to be on a mac,
textutil
does a good job of converting doc, docx, and rtf to html, and pandoc does a good job of converting the resulting html to markdown:I have a script that I threw together a while back that tries to use textutil, pdf2html, and pandoc to convert whatever I throw at it to markdown.
ProgTips 有一个可能的解决方案 Word 宏(源代码下载):
来源:ProgTips
宏源码
安装
: ProgTips
Source
宏源代码,以便安全保存 ProgTips删除帖子或网站被删除:
来源:ProgTips
ProgTips has a possible solution with a Word macro (source download):
Source: ProgTips
Macro source
Installation
Source: ProgTips
Source
Macro source for safe keeping if ProgTips deletes the post or the site gets wiped out:
Source: ProgTips
如果您愿意使用
.docx
格式,您可以使用我编写的这个 PHP 脚本,它将提取 XML、运行一些 XSL 转换并输出相当不错的 Markdown 等效项:https://github.com/matb33/docx2md
请注意,它是从命令行工作的,并且其界面相当基本。 但是,它会完成工作!
如果该脚本对您来说效果不够好,我鼓励您将您的
.docx
文件发送给我,以便我可以重现您的问题并修复它。 在 GitHub 中记录问题,或者如果您愿意,可以直接联系我。If you're open to using the
.docx
format, you could use this PHP script that I put together that will extract the XML, run some XSL transformations and output a pretty decent Markdown equivalent:https://github.com/matb33/docx2md
Note that it is meant to work from the command-line, and is rather basic in its interface. However, it will get the job done!
If the script doesn't work well enough for you, I encourage you to send me your
.docx
files so I can reproduce your problem and fix it. Log an issue in GitHub or contact me directly if you prefer.Pandoc 是一个很好的命令行转换工具,但同样,您首先需要将输入转换为 Pandoc 可以使用的格式可以读取,即:
Pandoc is a good command-line conversion tool, but again, you will first need to get the input into a format that Pandoc can read, which is:
我们遇到了同样的问题,必须将 Word 文档转换为 Markdown。 有些是更复杂且(非常)大的文档,包含数学方程和图像等。 所以我制作了这个使用多种不同工具进行转换的脚本: https://github.com/Versal/word2markdown
因为它使用一系列工具,所以更容易出错,但如果您有更复杂的文档,它可能是一个很好的起点。 希望它能有所帮助! :)
更新:
它目前仅适用于 Mac OS X,并且您需要安装一些要求(Word、Pandoc、HTML Tidy、git、node/npm)。 为了使其正常工作,您还需要打开一个空的Word文档,然后执行:文件->另存为网页->兼容性->编码->UTF-8。 然后将此编码保存为默认值。 有关如何设置的更多详细信息,请参阅自述文件。
然后在控制台运行:
然后你可以在
document.md
中找到Markdown,在document_files
目录中找到图像。现在可能有点复杂,所以我欢迎任何使这变得更容易或使其在其他操作系统上工作的贡献! :)
We had the same problem of having to convert Word documents to markdown. Some were more complicated and (very) large documents, with math equations and images and such. So I made this script which converts using a number of different tools: https://github.com/Versal/word2markdown
Because it uses a chain of several tools it is a bit more error-prone, but it can be a good starting point if you have more complicated documents. Hope it can be helpful! :)
Update:
It currently only works on Mac OS X, and you need to have some requirements installed (Word, Pandoc, HTML Tidy, git, node/npm). For it to work properly, you also need to open an empty Word document, and do: File->Save As Webpage->Compatibility->Encoding->UTF-8. Then this encoding is saved as default. See the README for more details on how to set up.
Then run this in the console:
Then you can find the Markdown in
document.md
and images in the directorydocument_files
.It's perhaps a bit complicated now, so I would welcome any contributions that make this easier or make this work on other operating systems! :)
你试过这个吗? 不确定功能丰富程度,但它适用于简单的文本。
http://markitdown.medusis.com/
Have you tried this one? Not sure about feature richness, but it works for simple texts.
http://markitdown.medusis.com/
作为大学 ruby 课程的一部分,我开发了一个可以将 openoffice word 文件 (.odt) 转换为 Markdown 的工具。
为了将其转换为正确的格式,必须做出很多假设。 例如,很难确定必须被视为标题的文本的大小。
然而,您唯一可以通过此转换放松的是格式化任何满足的文本始终附加到 Markdown 文档。
我开发的工具支持列表、粗体和斜体文本,并且它具有表格语法。
http://github.com/bostko/doc2text
尝试一下,请给我您的反馈。
As part of the university ruby course I developed a tool which can convert openoffice word files (.odt) to markdown.
A lot of assumptions has to be made in order to turn it to correct formatting. For example it is hard to determine the size of a text which has to be considered as Heading.
However the only think that you can loose with this conversion is the formatting any text that is met is always appends to the markdown document.
The tool I've developed supports lists, bold and italic text, and it has syntax for tables.
http://github.com/bostko/doc2text
Give it a try and please give me your feedback.