我刚刚拿到了 Stackoverflow 数据转储,对此我感到很失望帖子的正文字段采用 HTML 格式而不是 Markdown 格式。 我怀疑原始数据库中有 Markdown,因为如果我尝试编辑答案,我就会看到它。
我想从大量答案中恢复 Markdown。 我将使用命令行工具或某种 Lua 或 C 库以批处理模式处理数百个条目,因此使用像 wmd Markdown 编辑器 不适合。 人们能说
有哪些工具可以帮助我从 Stackoverflow 数据转储中恢复 Markdown?
(相关问题,不重复:在 wmd 中将 HTML 转换回 Markdown。)
I've just got my hands on a Stackoverflow data dump, and I'm disappointed to see that the Body field of the posts is in HTML rather than Markdown. I suspect there's Markdown in the original database because that's what I see if I try to edit an answer.
I want to recover Markdown from a large set of answers. I will be processing hundreds of entries in batch mode, using either command-line tools or some kind of Lua or C library, so an interactive tool like the wmd Markdown editor is not suitable. Can people say
what tools are available to help me recover Markdown from a Stackoverflow data dump?
(Related question, not a duplicate: Convert HTML back to Markdown within wmd.)
发布评论
评论(2)
Markdownify 将 HTML 转换为 Markdown。
另请参阅: MetaSO / Markdown 可以从 SO 数据转储中恢复吗?
Markdownify converts HTML to Markdown.
See Also: MetaSO / Can Markdown be recovered from the SO data dump?
看看 pandoc:http://johnmacfarlane.net/pandoc/
其中包含一个 html2markdown 工具pandoc 运行得很好,并且该程序是从命令行运行的,使得批量转换非常好。
这是手册页: http://johnmacfarlane.net/pandoc/html2markdown.1.html< /a>
take a look at pandoc:http://johnmacfarlane.net/pandoc/
there is an html2markdown tool included with pandoc that works pretty well, and the program is run from the command line, making batch conversion quite nice.
here is the man page: http://johnmacfarlane.net/pandoc/html2markdown.1.html