你会如何解析 Markdown？

发布于 2024-07-14 14:17:56 字数 984 浏览 10 评论 0 原文

编辑：我最近了解了一个名为 CommonMark 的项目，该项目正确识别并处理原文中的歧义降价规范。 http://commonmark.org/ 它有很棒的 C# 库支持。

您可以在此处找到语法。

下载后的源代码是用 Perl 编写的，我无意尊重它。它充满了正则表达式，并且依赖 MD5 哈希来转义某些字符。这件事有些不对劲！

我即将为 Markdown 硬编码一个解析器。这方面有什么经验？

如果你对 Markdown 的实际解析没有任何有意义的话可说，请饶过我。（这可能听起来很刺耳，但是是的，我正在寻找洞察力，而不是解决方案，即第三方库）。

为了帮助您找到答案，正则表达式旨在识别模式！不要解析整个语法。人们认为这样做是foobar。

如果您考虑一下 Markdown，它基本上是基于段落的概念。
因此，合理的方法可能是将输入分成段落。
段落有很多种，例如标题、文本、列表、块引用和代码。
因此，挑战在于识别这些段落以及它们出现的背景。

一旦我发现它值得分享，我会带着解决方案回来。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

人事已非 2024-07-21 14:17:56

我所知道的唯一使用实际解析器的 Markdown 实现是 Jon MacFarleane 的 peg-markdown。其解析器基于解析表达式语法解析器生成器称为挂钩。

编辑：Mauricio Fernandez 最近发布了他的简单标记 Markdown 解析器，他作为 OcsiBlog 博客引擎。因为解析器是用 OCaml 编写的，所以它极其简单且简短（268 SLOC对于解析器，HTML 发射器)，但 速度极快（比折扣（用手工优化的 C 编写）并且比六百倍href="http://Deveiate.Org/projects/BlueCloth/" rel="noreferrer">BlueCloth ( Ruby))，尽管事实上它甚至还没有针对性能进行优化。由于它仅供 Mauricio 本人内部博客使用，因此与官方 Markdown 存在一些偏差规范，但 Mauricio 创建了一个可以恢复大部分更改的分支。

回复收藏 0 原文

↘人皮目录ツ 2024-07-21 14:17:56

上周我发布了一个新的基于解析器的 Markdown Java 实现，名为 pegdown。
pegdown 使用 PEG 解析器首先构建抽象语法树，随后将其写入 HTML。因此，它比基于正则表达式的方法非常干净，并且更容易阅读、维护和扩展。
PEG 语法基于 John MacFarlanes C 实现“peg-markdown”。

也许有你感兴趣的东西......

回复收藏 0 原文

你的呼吸 2024-07-21 14:17:56

如果我要尝试解析 markdown （及其扩展 Markdown extra），我想我我会尝试使用状态机并一次解析一个字符，将一些代表文本位的内部结构链接在一起，然后，一旦所有内容都被解析，就从所有串在一起的对象生成输出。

基本上，我会在读取输入文件时构建一个类似迷你 DOM 的树。
要生成输出，我只需遍历树并输出 HTML 或其他任何内容（PS、LaTex、RTF...）

可能会增加复杂性的东西：

事实上，您可以混合 HTML 和 markdown，尽管规则可以很容易实现：只需忽略两个平衡标签之间的任何内容并逐字输出即可。

网址和注释可以在文本底部找到参考。使用超链接的数据结构可以简单地记录如下内容：

[我的链接文本][linkkey] 
  结果是这样的结构：  
      URL结构：  
      |   InnerText ：“我的链接文本” 
      |   键：“链接键” 
      |   网址：<空>

可以使用下划线定义标题，这可能迫使我们对通用段落使用简单的数据结构，并在读取文件时修改其属性：

段落结构： 
  |   InnerText ：当前段落文本  
  |   （从行首到行尾）。 
  |   标题级别：   或 1-4 当我们可以评估时  
  |   该段落标题级别（如果有）。 
  标题可以用下划线

无论如何，只是一些想法。

我确信有许多小细节需要处理，并且我非常确信正则表达式在此过程中会变得很方便。
毕竟，它们是用来处理文本的。

If I was to try to parse markdown (and its extension Markdown extra) I think I would try to use a state machine and parse it one char at a time, linking together some internal structures representing bits of text as I go along then, once all is parsed, generating the output from the objects all stringed together.

Basically, I'd build a mini-DOM-like tree as I read the input file.
To generate an output, I would just traverse the tree and output HTML or anything else (PS, LaTex, RTF,...)

Things that can increase complexity:

The fact that you can mix HTML and markdown, although the rule could be easy to implement: just ignore anything that's between two balanced tags and output it verbatim.

URLs and notes can have their reference at the bottom of the text. Using data structures for hyperlinks could simply record something like:

[my text to a link][linkkey]
results in a structure like: 
    URLStructure: 
    |  InnerText : "my text to a link"
    |  Key       : "linkkey"
    |  URL       : <null>

Headers can be defined with an underline, that could force us to use a simple data structure for a generic paragraph and modify its properties as we read the file:

ParagraphStructure:
|  InnerText    : the current paragraph text 
|                 (beginning of line until end of line).
|  HeadingLevel : <null> or 1-4 when we can assess 
|                 that paragraph heading level, if any.

Anyway, just some thoughts.

I'm sure that there are many small details to take care of and I'm pretty sure that Regexes could become handy during the process.
After all, they were meant to process text.

回复收藏 0 原文