@acmedinotech/docproc 中文文档教程

发布于 4年前 浏览 21 项目主页 更新于 3年前

docproc

一个可扩展的文档文档处理器,适用于人性化的标记。 使用您选择的 Markdown 文档来驱动它:

docproc path/to/your/file

Architecture Overview

首先,让我们谈谈文档结构。 人类可读的文档是线性的,它们通常按组(块)组织。 块本身包含内联数据或子块。

## html blocks at different levels

<html>
    <div><b>bold</b></div>
</html>

## markdown

> blockquote **bold**

normal paragraph

所有实体文档处理器的基本方法是,它们使用词法分析器模式将文档分解成最小的部分,然后按顺序将它们放回一起(在我们的例子中,作为带有内联文本的块)。

docproc 在那里没有任何不同。 docproc 的目标是创建一个模式来配置词素检测和块/内联处理。 一旦您了解了这些部分的组合方式,编写您自己的处理器就会变得容易。

High Level Architecture

docproc 不假设您要处理的内容,但它确实带有 Markdown ( CommonMark) 插件和 DinoMark 插件,增强了 CommonMark 的动态处理能力。

How it Works (High Level)

让我们使用以下 Markdown 片段作为参考:

> **blockquote**

paragraph _**bold italic**_

首先,我们需要指定以下词素:

  • >
  • (space)
  • **
  • _
  • \\n

任何未明确标识的内容都组合在一起并作为它们自己的词素发出。

我们还需要构建两个块处理程序

  1. blockquoteHandler will only accept lines beginning with >. If there are 2 consecutive newlines, the blockquote handler is done.
  2. paragraphHandler accepts anything. Like blockquote, it also terminates after 2 consecutive newlines.

块的每个实例都有自己的处理程序实例。

最后,我们需要构建两个内联处理程序

  1. boldHandler starts and stops ** and allows embedded formatting
  2. italicHandler starts and stops _ and allows embedded formatting

Follow the Tokens

让我们从块级别开始跟踪每个标记如何更改解析器的状态:

  • >
    • blockquoteHandler can accept and is set as current handler
  • , **, blockquote, **
    • all accepted by blockquoteHandler
  • \\n, \\n
    • blockquote done, no longer current handler
  • paragraph
    • paragraphHandler can accept and is set as current handler
  • _, **, bold, , italic, **, _
    • all accepted by paragraphHandler

到目前为止非常简单。 现在让我们看看块内部,看看内联标记会发生什么。 我将使用段落处理程序:

  • _
    • matches an inline handler. it'll take all tokens until another _, but since it allows embedding other formatting, it'll first defer the tokens to specific handlers if they exist
    • stack: [italicHandler]
  • **
    • matches an inline handler, which nests and defers
    • stack: [italicHandler, boldHandler]
  • bold, , italic
    • goes into boldHandler
  • **
    • boldHandler is popped
    • stack: [italicHandler]
  • _
    • italicHandler is popped
    • stack: []

当您将文档转换为字符串时,您会得到所有的片段,这些片段由不同处理程序返回的 HTML 片段组装而成。

基本上就是这样! 您可以在 readme.example.ts 中看到所有这些

内容进行更深入的研究:

docproc

An extensible document processor, suitable for human-friendly markup. Take it for a drive with your Markdown document of choice:

docproc path/to/your/file

Architecture Overview

First, let's talk document structure. Human-readable docs are linear, and they're typically organized in groups (blocks). The blocks themselves contain inline data or sub-blocks.

## html blocks at different levels

<html>
    <div><b>bold</b></div>
</html>

## markdown

> blockquote **bold**

normal paragraph

The basic approach to all solid document processors is that they use a lexer-parser pattern to break the doc down into its smallest part then sequentially put them back together (in our case, as blocks with inline text).

docproc isn't any different there. What docproc aims to do is create a pattern for configuring lexeme detection and block/inline handling. Once you get a sense for how these pieces fit it should make writing your own processor easy.

High Level Architecture

docproc makes no assumption about what you're trying to process, but it does come with a Markdown (CommonMark) plugin and DinoMark plugin, which enhances CommonMark with more dynamic processing capabilities.

How it Works (High Level)

Let's use the following snippet of Markdown as our reference:

> **blockquote**

paragraph _**bold italic**_

To start, we need to specify the following lexemes:

  • >
  • (space)
  • **
  • _
  • \\n

Anything that isn't explicitly identified is grouped together and emitted as their own lexemes.

We'll also need to build two block handlers:

  1. blockquoteHandler will only accept lines beginning with >. If there are 2 consecutive newlines, the blockquote handler is done.
  2. paragraphHandler accepts anything. Like blockquote, it also terminates after 2 consecutive newlines.

Each instance of a block has its own handler instance.

Finally, we'll need to build two inline handlers:

  1. boldHandler starts and stops ** and allows embedded formatting
  2. italicHandler starts and stops _ and allows embedded formatting

Follow the Tokens

Let's trace how each token changes the state of the parser, starting at the block level:

  • >
    • blockquoteHandler can accept and is set as current handler
  • , **, blockquote, **
    • all accepted by blockquoteHandler
  • \\n, \\n
    • blockquote done, no longer current handler
  • paragraph
    • paragraphHandler can accept and is set as current handler
  • _, **, bold, , italic, **, _
    • all accepted by paragraphHandler

Pretty simple so far. Now let's look within the block and see what happens with the inline tokens. I'll use the paragraph handler:

  • _
    • matches an inline handler. it'll take all tokens until another _, but since it allows embedding other formatting, it'll first defer the tokens to specific handlers if they exist
    • stack: [italicHandler]
  • **
    • matches an inline handler, which nests and defers
    • stack: [italicHandler, boldHandler]
  • bold, , italic
    • goes into boldHandler
  • **
    • boldHandler is popped
    • stack: [italicHandler]
  • _
    • italicHandler is popped
    • stack: []

When you turn the document into a string, you get all the pieces back, assembled from fragments of HTML returned from the different handlers.

That's basically it! You can see it all put together in readme.example.ts

Take a deeper dive:

    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文