编辑:我最近了解了一个名为 CommonMark 的项目,该项目
正确识别并处理原文中的歧义
降价规范。 http://commonmark.org/ 它有很棒的 C# 库
支持。
您可以在此处找到语法。
下载后的源代码是用 Perl 编写的,我无意尊重它。 它充满了正则表达式,并且依赖 MD5 哈希来转义某些字符。 这件事有些不对劲!
我即将为 Markdown 硬编码一个解析器。 这方面有什么经验?
如果你对 Markdown 的实际解析没有任何有意义的话可说,请饶过我。 (这可能听起来很刺耳,但是是的,我正在寻找洞察力,而不是解决方案,即第三方库)。
为了帮助您找到答案,正则表达式旨在识别模式! 不要解析整个语法。 人们认为这样做是foobar。
- 如果您考虑一下 Markdown,它基本上是基于段落的概念。
- 因此,合理的方法可能是将输入分成段落。
- 段落有很多种,例如标题、文本、列表、块引用和代码。
- 因此,挑战在于识别这些段落以及它们出现的背景。
一旦我发现它值得分享,我会带着解决方案回来。
Edit: I recently learned about a project called CommonMark, which
correctly identifies and deals with the ambiguities in the original
Markdown specification. http://commonmark.org/ It has great C# library
support.
You can find the syntax here.
The source that follows with the download is written in Perl, which I have no intentions of honoring. It is riddled with regular expressions, and it relies on MD5 hashes to escape certain characters. Something is just wrong about that!
I'm about to hard code a parser for Markdown. What is experience with this?
If you don't have anything meaningful to say about the actual parsing of Markdown, spare me the time. (This might sound harsh, but yes, I'm looking for insight, not a solution, that is, a third-party library).
To help a bit with the answers, regular expressions are meant to identify patterns! NOT to parse an entire grammar. That people consider doing so is foobar.
- If you think about Markdown, it's fundamentally based around the concept of paragraphs.
- As such, a reasonable approach might be to split the input into paragraphs.
- There are many kinds of paragraphs, for example, heading, text, list, blockquote, and code.
- The challenge is thus to identify these paragraphs and in what context they occur.
I'll be back with a solution, once I find it's worthy to be shared.
发布评论
评论(10)
我所知道的唯一使用实际解析器的 Markdown 实现是 Jon MacFarleane 的 peg-markdown。 其解析器基于解析表达式语法 解析器生成器称为 挂钩。
编辑:Mauricio Fernandez 最近发布了他的简单标记 Markdown 解析器,他作为 OcsiBlog 博客引擎。 因为解析器是用 OCaml 编写的,所以它极其简单且简短(268 SLOC对于 解析器,HTML 发射器),但 速度极快(比 折扣(用手工优化的 C 编写)并且比 六百倍href="http://Deveiate.Org/projects/BlueCloth/" rel="noreferrer">BlueCloth ( Ruby)),尽管事实上它甚至还没有针对性能进行优化。 由于它仅供 Mauricio 本人内部博客使用,因此与 官方 Markdown 存在一些偏差规范,但 Mauricio 创建了一个可以恢复大部分更改的分支。
The only markdown implementation I know of, that uses an actual parser, is Jon MacFarleane’s peg-markdown. Its parser is based on a Parsing Expression Grammar parser generator called peg.
EDIT: Mauricio Fernandez recently released his Simple Markup Markdown parser, which he wrote as part of his OcsiBlog Weblog Engine. Because the parser is written in OCaml, it is extremely simple and short (268 SLOC for the parser, 43 SLOC for the HTML emitter), yet blazingly fast (20% faster than discount (written in hand-optimized C) and sixhundred times faster than BlueCloth (Ruby)), despite the fact that it isn't even optimized for performance yet. Because it is only intended for internal use by Mauricio himself for his weblog, there are a few deviations from the official Markdown specification, but Mauricio has created a branch which reverts most of those changes.
上周我发布了一个新的基于解析器的 Markdown Java 实现,名为 pegdown。
pegdown 使用 PEG 解析器首先构建抽象语法树,随后将其写入 HTML。 因此,它比基于正则表达式的方法非常干净,并且更容易阅读、维护和扩展。
PEG 语法基于 John MacFarlanes C 实现“peg-markdown”。
也许有你感兴趣的东西......
I released a new parser-based Markdown Java implementation last week, called pegdown.
pegdown uses a PEG parser to first build an abstract syntax tree, which is subsequently written out to HTML. As such it is quite clean and much easier to read, maintain and extend than a regex based approach.
The PEG grammar is based on John MacFarlanes C implementation "peg-markdown".
Maybe something of interest to you...
如果我要尝试解析 markdown (及其扩展 Markdown extra),我想我我会尝试使用状态机并一次解析一个字符,将一些代表文本位的内部结构链接在一起,然后,一旦所有内容都被解析,就从所有串在一起的对象生成输出。
基本上,我会在读取输入文件时构建一个类似迷你 DOM 的树。
要生成输出,我只需遍历树并输出 HTML 或其他任何内容(PS、LaTex、RTF...)
可能会增加复杂性的东西:
事实上,您可以混合 HTML 和 markdown,尽管规则可以很容易实现:只需忽略两个平衡标签之间的任何内容并逐字输出即可。
网址和注释可以在文本底部找到参考。 使用超链接的数据结构可以简单地记录如下内容:
可以使用下划线定义标题,这可能迫使我们对通用段落使用简单的数据结构,并在读取文件时修改其属性:
无论如何,只是一些想法。
我确信有许多小细节需要处理,并且我非常确信正则表达式在此过程中会变得很方便。
毕竟,它们是用来处理文本的。
If I was to try to parse markdown (and its extension Markdown extra) I think I would try to use a state machine and parse it one char at a time, linking together some internal structures representing bits of text as I go along then, once all is parsed, generating the output from the objects all stringed together.
Basically, I'd build a mini-DOM-like tree as I read the input file.
To generate an output, I would just traverse the tree and output HTML or anything else (PS, LaTex, RTF,...)
Things that can increase complexity:
The fact that you can mix HTML and markdown, although the rule could be easy to implement: just ignore anything that's between two balanced tags and output it verbatim.
URLs and notes can have their reference at the bottom of the text. Using data structures for hyperlinks could simply record something like:
Headers can be defined with an underline, that could force us to use a simple data structure for a generic paragraph and modify its properties as we read the file:
Anyway, just some thoughts.
I'm sure that there are many small details to take care of and I'm pretty sure that Regexes could become handy during the process.
After all, they were meant to process text.
我可能会阅读语法规范足够多的时间来了解它,并了解如何解析它。
阅读现有的解析器代码当然很棒,既可以查看复杂性的主要来源,也可以查看是否使用了任何特殊的巧妙技巧。 MD5 校验和的使用似乎有点奇怪,但我还没有对代码进行足够的研究,无法理解为什么要这样做。 名为
_EscapeSpecialChars()
的例程中的注释指出:用完整的 MD5 替换单个字符确实看起来有些奢侈,但也许确实有意义。
当然,考虑为 Flex 摆脱正则表达式的困境。
I'd probably read the syntax specification enough times to know it, and get a feel for how to parse it.
Reading the existing parser code is of course brilliant, both to see what seems to be the main source of complexity, and if any special clever tricks are being used. The use of MD5 checksumming seems a bit weird, but I haven't studied the code enough to understand why it's being done. A comment in a routine called
_EscapeSpecialChars()
states:Replacing a single character by a full MD5 does seem extravagant, but perhaps it really makes sense.
Of course, it'd be clever to consider creating a "true" syntax, for a tool such as Flex to get out of the regex bog.
如果您不喜欢 Perl,还有至少 10 种其他语言的 Markdown 实现。 它们可能并不都具有 100% 的兼容性,但往往非常接近。
If Perl isn't your thing, there are Markdown implementations in at least 10 other languages. They probably don't all have 100% compatibility, but tend to be pretty close.
MarkdownPapers 是另一个 Java 实现,其解析器在 JavaCC 语法。
MarkdownPapers is another Java implementation whose parser is defined in a JavaCC grammar.
如果您使用的编程语言具有超过三种其他
用户,您应该能够找到一个库来为您解析它。 A
快速 Google 搜索揭示了 CL、Haskell、Python 的库,
JavaScript、Ruby 等。 您不太可能需要
重新发明这个轮子。
如果你真的必须从头开始编写,我建议你写一个
正确的解析器。 有了这个技巧,你就不必逃避事情了
与 MD5 哈希值。 (我同意如果你必须做这样的事情,
是时候重新考虑你的设计了。)
If you are using a programming language that has more than three other
users, you should be able to find a library to parse it for you. A
quick Google-ing reveals libraries for CL, Haskell, Python,
JavaScript, Ruby, and so on. It is highly unlikely that you will need
to reinvent this wheel.
If you really have to write it from scratch, I recommend writing a
proper parser. With this technique, you won't have to escape things
with MD5 hashes. (I agree that if you have to do something like this,
it's time to reconsider your design.)
有多种语言可用的库,包括 php、ruby、java、c#、javascript。 我建议看看其中的一些想法。
这取决于您希望使用哪种语言,为了实现它的最佳方式,将有惯用和非惯用的方法。
正则表达式在 Perl 中工作,因为 Perl 和正则表达式是最好的朋友。
There are libraries available in a number of languages, including php, ruby, java, c#, javascript. I'd suggest looking at some of these for ideas.
It depends on which language you wish to use, for the best way to implement it, there will be idiomatic and non idiomatic ways to do it.
Regexes work in perl, because perl and regex are best friends.
Markdown 是一种 JAWL(只是另一种 wiki 语言),
有很多开源 wiki,您可以检查解析器的代码。 大多数使用 REGEX
查看 Screwturn wiki,它有一个有趣的多通道格式化程序管道,这是一种非常好的技术 - 请参阅 /core/Formatter.cs 和 /core/FormatterPipeline.cs
最好是使用/加入现有项目,这些类型事情总是比看上去要困难得多
Markdown is a JAWL (just another wiki language)
There are plenty of open source wiki's out there that you can examine the code of the parser. Most use REGEX
Check out the screwturn wiki, is has an interesting multi pass formatter pipeline, a very nice technique - see /core/Formatter.cs and /core/FormatterPipeline.cs
Best is to use/join an existing project, these sorts of things are always much harder than they appear
在这里您可以找到 Markdown 的 JavaScript 实现。 它还严重依赖正则表达式,因为这是解析文本最快、最简单的方法。
但它省去了 MD5 部分。
我无法直接帮助您进行解析编码,但也许此链接可以以某种方式帮助您。
Here you can find a JavaScript-implementation of Markdown. It also relies heavily on regular expressions, as this is just the fastest and easiest way to parse the text.
But it spares the MD5 part.
I cannot help directly with the coding of the parsing, but maybe this link can help you one way or another.