python/php 中的模板提取
python 或 php 中是否存在现有的模板提取库? Perl 有 Template::Extract,但我一直找不到类似的实现在 python 或 php 中。
我能找到的Python中唯一接近的是TemplateMaker(http://code.google.com/ p/templatemaker/),但这并不是真正的模板提取库。
Are there existing template extract libraries in either python or php? Perl has Template::Extract, but I haven't been able to find a similar implementation in either python or php.
The only thing close in python that I could find is TemplateMaker (http://code.google.com/p/templatemaker/), but that's not really a template extraction library.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
经过更多研究后,我找到了一个正是我正在寻找的解决方案。 filippo 在这篇文章中发布了用于屏幕抓取的 python 解决方案列表: HTML 抓取选项?< /a> 其中有一个名为 scrapemark 的包( http://arshaw.com/scrapemark/ )。
希望这对正在寻找相同解决方案的其他人有所帮助。
After digging around some more I found a solution to exactly what I was looking for. filippo posted a list of python solutions for screen scraping in this post: Options for HTML scraping? among which is a package called scrapemark ( http://arshaw.com/scrapemark/ ).
Hope this helps anyone else who is looking for the same solution.
TmeplateMaker
似乎确实可以满足您的需求,至少根据其文档是这样。它不是接收模板作为输入,而是从一些文档中推断(“学习”)。然后,它具有extract
方法来从使用此模板创建的其他文档中提取数据。该示例显示:
因此,为了实现您需要的任务,我认为您应该:
仔细想想,它甚至比 Perl 的
Template::Extract
更有用,因为它不期望您为它提供一个干净的模板 - 它会从示例文本中自行学习。TmeplateMaker
does seem to do what you need, at least according to its documentation. Instead of receiving a template as an input, it infers ("learns") if from a few documents. Then, it has theextract
method to extract the data from other documents that were created with this template.The example shows:
So, to achieve the task you require, I think you should:
Come to think about it, it's even more useful than Perl's
Template::Extract
as it doesn't expect you to provide it a clean template - it learns it on its own from sample text.以下是 TemplateMaker 的作者 Adrian 的有趣讨论 http://www.holovaty.com/writing/ templatemaker/
它似乎很像我所说的包装归纳库。
如果您正在寻找其他更可配置的东西(更少的抓取),请查看 lxml.html 和 BeautifulSoup,也适用于 python。
Here is an interesting discussion from Adrian the author of TemplateMaker http://www.holovaty.com/writing/templatemaker/
It seems to be a lot like what I would call a wrapper induction library.
If your looking for something else that is more configurable (less for scraping) take a look at lxml.html and BeautifulSoup, also for python.