@acmedinotech/docproc 中文文档教程
docproc
一个可扩展的文档文档处理器,适用于人性化的标记。 使用您选择的 Markdown 文档来驱动它:
docproc path/to/your/file
Architecture Overview
首先,让我们谈谈文档结构。 人类可读的文档是线性的,它们通常按组(块)组织。 块本身包含内联数据或子块。
## html blocks at different levels
<html>
<div><b>bold</b></div>
</html>
## markdown
> blockquote **bold**
normal paragraph
所有实体文档处理器的基本方法是,它们使用词法分析器模式将文档分解成最小的部分,然后按顺序将它们放回一起(在我们的例子中,作为带有内联文本的块)。
docproc 在那里没有任何不同。 docproc 的目标是创建一个模式来配置词素检测和块/内联处理。 一旦您了解了这些部分的组合方式,编写您自己的处理器就会变得容易。
docproc 不假设您要处理的内容,但它确实带有 Markdown ( CommonMark) 插件和 DinoMark 插件,增强了 CommonMark 的动态处理能力。
How it Works (High Level)
让我们使用以下 Markdown 片段作为参考:
> **blockquote**
paragraph _**bold italic**_
首先,我们需要指定以下词素:
>
(space)
**
_
\\n
任何未明确标识的内容都组合在一起并作为它们自己的词素发出。
我们还需要构建两个块处理程序:
blockquoteHandler
will only accept lines beginning with>
. If there are 2 consecutive newlines, the blockquote handler is done.paragraphHandler
accepts anything. Like blockquote, it also terminates after 2 consecutive newlines.
块的每个实例都有自己的处理程序实例。
最后,我们需要构建两个内联处理程序:
boldHandler
starts and stops**
and allows embedded formattingitalicHandler
starts and stops_
and allows embedded formatting
Follow the Tokens
让我们从块级别开始跟踪每个标记如何更改解析器的状态:
>
blockquoteHandler
can accept and is set as current handler
,
**
,blockquote
,**
- all accepted by
blockquoteHandler
- all accepted by
\\n
,\\n
- blockquote done, no longer current handler
paragraph
paragraphHandler
can accept and is set as current handler
_
,**
,bold
,,
italic
,**
,_
- all accepted by
paragraphHandler
- all accepted by
到目前为止非常简单。 现在让我们看看块内部,看看内联标记会发生什么。 我将使用段落处理程序:
_
- matches an inline handler. it'll take all tokens until another
_
, but since it allows embedding other formatting, it'll first defer the tokens to specific handlers if they exist - stack:
[italicHandler]
- matches an inline handler. it'll take all tokens until another
**
- matches an inline handler, which nests and defers
- stack:
[italicHandler, boldHandler]
bold
,,
italic
- goes into
boldHandler
- goes into
**
boldHandler
is popped- stack:
[italicHandler]
_
italicHandler
is popped- stack:
[]
当您将文档转换为字符串时,您会得到所有的片段,这些片段由不同处理程序返回的 HTML 片段组装而成。
基本上就是这样! 您可以在 readme.example.ts
中看到所有这些
内容进行更深入的研究:
docproc
An extensible document processor, suitable for human-friendly markup. Take it for a drive with your Markdown document of choice:
docproc path/to/your/file
Architecture Overview
First, let's talk document structure. Human-readable docs are linear, and they're typically organized in groups (blocks). The blocks themselves contain inline data or sub-blocks.
## html blocks at different levels
<html>
<div><b>bold</b></div>
</html>
## markdown
> blockquote **bold**
normal paragraph
The basic approach to all solid document processors is that they use a lexer-parser pattern to break the doc down into its smallest part then sequentially put them back together (in our case, as blocks with inline text).
docproc isn't any different there. What docproc aims to do is create a pattern for configuring lexeme detection and block/inline handling. Once you get a sense for how these pieces fit it should make writing your own processor easy.
docproc makes no assumption about what you're trying to process, but it does come with a Markdown (CommonMark) plugin and DinoMark plugin, which enhances CommonMark with more dynamic processing capabilities.
How it Works (High Level)
Let's use the following snippet of Markdown as our reference:
> **blockquote**
paragraph _**bold italic**_
To start, we need to specify the following lexemes:
>
(space)
**
_
\\n
Anything that isn't explicitly identified is grouped together and emitted as their own lexemes.
We'll also need to build two block handlers:
blockquoteHandler
will only accept lines beginning with>
. If there are 2 consecutive newlines, the blockquote handler is done.paragraphHandler
accepts anything. Like blockquote, it also terminates after 2 consecutive newlines.
Each instance of a block has its own handler instance.
Finally, we'll need to build two inline handlers:
boldHandler
starts and stops**
and allows embedded formattingitalicHandler
starts and stops_
and allows embedded formatting
Follow the Tokens
Let's trace how each token changes the state of the parser, starting at the block level:
>
blockquoteHandler
can accept and is set as current handler
,
**
,blockquote
,**
- all accepted by
blockquoteHandler
- all accepted by
\\n
,\\n
- blockquote done, no longer current handler
paragraph
paragraphHandler
can accept and is set as current handler
_
,**
,bold
,,
italic
,**
,_
- all accepted by
paragraphHandler
- all accepted by
Pretty simple so far. Now let's look within the block and see what happens with the inline tokens. I'll use the paragraph handler:
_
- matches an inline handler. it'll take all tokens until another
_
, but since it allows embedding other formatting, it'll first defer the tokens to specific handlers if they exist - stack:
[italicHandler]
- matches an inline handler. it'll take all tokens until another
**
- matches an inline handler, which nests and defers
- stack:
[italicHandler, boldHandler]
bold
,,
italic
- goes into
boldHandler
- goes into
**
boldHandler
is popped- stack:
[italicHandler]
_
italicHandler
is popped- stack:
[]
When you turn the document into a string, you get all the pieces back, assembled from fragments of HTML returned from the different handlers.
That's basically it! You can see it all put together in readme.example.ts
Take a deeper dive: