在 Erlang 中编写 HTML 解析器
我对 Erlang 非常陌生,作为学习练习的一部分,我想用 Erlang 编写一个 HTML 解析器。
我想从网页中提取某些值,也许使用模式来描述我想要提取的数据。
有人能给我一些关于如何在 Erlang 中解决这个问题的高级建议吗?
我认为我需要将文档变成一堆令牌,也许使用有限状态机来跟踪我在嵌套方面的位置以及我在元素中的位置。
I am very new to Erlang and as part of my learning exercise, I would like to write an HTML parser in Erlang.
I want to extract certain values from a web page, perhaps using a pattern to describe what data I want to extract.
Can anybody offer me some high level advice as to how they would approach this problem in Erlang?
I think I need to turn the document into a stack of tokens perhaps using a finite state machine to track where I am with regards to nesting and where I am in the element.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议您看一下 Mochiweb 中包含的内容:
http: //github.com/mochi/mochiweb/blob/master/src/mochiweb_html.erl
parse/1
函数可能是您感兴趣的入口点。I would suggest you to have a look to the one included in Mochiweb:
http://github.com/mochi/mochiweb/blob/master/src/mochiweb_html.erl
The
parse/1
function is probably the entry point you're interested into.如果您打算完成的话,这是一项艰巨的工作。你最好使用 Roberto 建议的那个,但是如果你决定编写自己的项目来熟悉 Erlang,这里有一些建议......
你应该首先决定是否要手动编写解析器或使用 leex 和 yecc 从语法生成解析器。如果您想学习如何编写惯用的 Erlang,手动编码可能是更好的学习体验。编写解析器是向 Erlang 自我介绍的绝佳方式;函数式编程语言擅长实现解析器。
其次,您应该决定是否要生成类似 DOM 的结构,还是要执行类似 SAX 的回调模型,称为 Erlang 中的行为。如果您选择后者,您可以简单地实现创建 DOM 的行为。
如果您研究行为,您可能还想研究参数化模块。这是一个实验性功能,可以补充行为,允许将不可变状态存储在“模块实例”中。目前尚不清楚这个新功能是否会得到社区的支持。 (对于某些人来说,它看起来太OO了)。
另一个优秀的资源是 xmerl 代码。密切关注它如何确定字符编码并进行相应的解析。 HTML(各种标准)的工作方式略有不同,但在读取文件时考虑正确的字符编码非常重要。
同样从 xmerl 中,您可以看到该库如何使用 Erlang 元组构建 DOM。您可能想做类似的事情。
This is a big job if you plan to be complete about it. You are best to use the one that Roberto suggest, but if you are determined to write your own as a project to get familiar with Erlang here are some suggestions...
You should first decide whether you are going to hand-code your parser or use leex and yecc to generate your parser from a grammar. Hand coding might be a better learning experience if you want to learn how to write idiomatic Erlang. Writing a parser is an excellent way to introduce yourself to Erlang; functional programming languages excel at implementing parsers.
Second, you should decide if you want to generate a DOM-like structure or do a SAX-like callback model known as a behaviour in Erlang. If you do the latter, you could simply implement the behaviour to create a DOM.
If you look at behaviours, you may also want to look into parametrized modules. This is an experimental feature that can complement behaviours, allowing immutable state to be stored within the an "instance of a module". It is not known whether or not this new feature will be supported by the community or not. (For some people it just looks too OO).
Another excellent resource is the xmerl code. Pay close at to how it determines the character encoding and parses accordingly. HTML (varioust standards) work slightly different, but it's important that you take into account the proper character encoding when you read the file.
Also from xmerl, you can see how that library constructs a DOM using Erlang tuples. You might want to do something similar.