使用自然语言处理来解析网站
我通常对通过爬行网站进行数据挖掘感兴趣,但我从未找到很多关于我真正想要实现的过程的文档。我非常热衷于编写一组基本规则来定义如何解析页面,然后在工具出错时对其进行训练。
假设我想解析餐厅网站上的菜单。我想创建一个工具,允许我编写一组规则,通常显示菜单项+价格的位置。然后,我可以运行该工具并告诉它哪些菜单项解析正确,哪些菜单项错误。然后,该工具将从这些更正中“学习”,下次运行它时,我会得到更好的结果。
我稍微研究了一下 NLTK 工具包,这让我想知道解决这个问题的最佳方法是否是使用 NLP 工具,比如 NLTK。谁能指出我寻找可以帮助我入门的书籍和(理想情况下)图书馆的正确方向? NLP 是出路吗?谢谢!
I'm interested generally in the data mining by crawling websites, but I've never been able to find a lot of documentation on the process I'd really like to implement. I'm very keen on the idea of writing a base set of rules that define how to parse a page, then training the tool when it makes mistakes.
Let's say I want to parse menus from restaurant websites. I'd like to create a tool that would allow me to write a set of rules that show generally where the menu items + prices are. Then, I could run the tool and tell it which menu items it parsed out correctly, and which ones were wrong. The tool would then "learn" from these corrections, and the next time I run it, I'd get better results.
I've looked a bit at the NLTK toolkit, and it's got me wondering if the best way to solve this problem is with a NLP tool, like NLTK. Can anyone point me in the correct direction for finding books and (ideally) libraries that can help me get started? Is NLP the way to go? Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
“解析页面”到底是什么意思?解析页面中的句子?进行结构化信息提取?
这是主动学习的问题,这是非常高级的东西。你需要一个机器学习工具包;哪一个取决于您到底想要做什么:创建解析树或提取显着信息。我相信 NLTK 有一些随机解析器支持。
What exactly do you mean by "parsing a page"? Parsing the sentences in a page? Doing structured information extraction?
This is the problem of active learning, which is pretty advanced stuff. You'll need a machine learning toolkit; which one depends on what exactly you want to do: make parse trees or extraction salient information. NLTK has some stochastic parser support, I believe.