双向文本解析建议
我正在研究实现双向文本解析框架的可行性,以允许使用 Markdown、BBCode、DocuWiki 等常见范例的组合来处理格式化文本。实际上,这意味着每个实现都必须能够在通用格式之间进行转换。这可能是 HTML,但更实际的是中间(更容易解析)格式,如 XML 或 YAML。
这可能会利用标记器将文档分解为其相关组件。这听起来是最好的方法吗?您能预见到任何重大障碍吗?
最后,是否有人知道现有的实现(或尝试)。
请注意,这主要针对 PHP,但也欢迎其他解决方案。
I'm looking at the feasability of implementing a bi-directional text parsing framework to allow formatted text to be processed using a combination of common paradigms such as Markdown, BBCode, DocuWiki, and so on. Practically speaking this means that each implentation must be able to translate to and from a common format. That could be HTML, but more realistically an intermediate (more easily parsable) format like XML or YAML.
This will probably utilize a tokenizer to break the document into it's relevant components. Does this sound like the best approach and can you forsee any significant roadblocks?
Lastly, is anyone aware of an existing implementations (or attempts).
Note that this is focused on PHP, but other solutions are welcome.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
看一下 HTML 解析器的源代码,例如 Nokogiri、Hpricot、BeautifulSoup 等。它们会给你一些关于构建结构化文本解析器的思考。
可能不需要转换为中间格式,因为标记化的对象树将是构建所有输出格式所需的全部。
如果您有具体的实施问题,您也应该发布它们。
Have a look at the source of an HTML parser such as Nokogiri, Hpricot, BeautifulSoup etc. They will give you some food for thought on constructing a structured text parser.
There's probably no need to translate to an intermediate format, since your tokenised object tree is going to be all you need to build all the output formats.
If you have specific implementation questions, you should post them too.