python/php 中的模板提取

发布于 2024-08-19 14:35:15 字数 348 浏览 1 评论 0原文

python 或 php 中是否存在现有的模板提取库? Perl 有 Template::Extract,但我一直找不到类似的实现在 python 或 php 中。

我能找到的Python中唯一接近的是TemplateMaker(http://code.google.com/ p/templatemaker/),但这并不是真正的模板提取库。

Are there existing template extract libraries in either python or php? Perl has Template::Extract, but I haven't been able to find a similar implementation in either python or php.

The only thing close in python that I could find is TemplateMaker (http://code.google.com/p/templatemaker/), but that's not really a template extraction library.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

兔小萌 2024-08-26 14:35:15

经过更多研究后,我找到了一个正是我正在寻找的解决方案。 filippo 在这篇文章中发布了用于屏幕抓取的 python 解决方案列表: HTML 抓取选项?< /a> 其中有一个名为 scrapemark 的包( http://arshaw.com/scrapemark/ )。

希望这对正在寻找相同解决方案的其他人有所帮助。

After digging around some more I found a solution to exactly what I was looking for. filippo posted a list of python solutions for screen scraping in this post: Options for HTML scraping? among which is a package called scrapemark ( http://arshaw.com/scrapemark/ ).

Hope this helps anyone else who is looking for the same solution.

删除→记忆 2024-08-26 14:35:15

TmeplateMaker 似乎确实可以满足您的需求,至少根据其文档是这样。它不是接收模板作为输入,而是从一些文档中推断(“学习”)。然后,它具有 extract 方法来从使用此模板创建的其他文档中提取数据。

该示例显示:

# Now that we have a template, let's extract some data.
>>> t.extract('<b>red and green</b>')
('red', 'green')
>>> t.extract('<b>django and stephane</b>')
('django', 'stephane')

# The extract() method is very literal. It doesn't magically trim
# whitespace, nor does it have any knowledge of markup languages such as
# HTML.
>>> t.extract('<b>  spacy  and <u>underlined</u></b>')
('  spacy ', '<u>underlined</u>')

# The extract() method will raise the NoMatch exception if the data
# doesn't match the template. In this example, the data doesn't have the
# leading and trailing "<b>" tags.
>>> t.extract('this and that')
Traceback (most recent call last):
...

因此,为了实现您需要的任务,我认为您应该:

  • 给它一些从您的模板呈现的文档 - 它可以毫无困难地从它们推断模板。
  • 使用推断的模板从新文档中提取数据。

仔细想想,它甚至比 Perl 的 Template::Extract 更有用,因为它不期望您为它提供一个干净的模板 - 它会从示例文本中自行学习。

TmeplateMaker does seem to do what you need, at least according to its documentation. Instead of receiving a template as an input, it infers ("learns") if from a few documents. Then, it has the extract method to extract the data from other documents that were created with this template.

The example shows:

# Now that we have a template, let's extract some data.
>>> t.extract('<b>red and green</b>')
('red', 'green')
>>> t.extract('<b>django and stephane</b>')
('django', 'stephane')

# The extract() method is very literal. It doesn't magically trim
# whitespace, nor does it have any knowledge of markup languages such as
# HTML.
>>> t.extract('<b>  spacy  and <u>underlined</u></b>')
('  spacy ', '<u>underlined</u>')

# The extract() method will raise the NoMatch exception if the data
# doesn't match the template. In this example, the data doesn't have the
# leading and trailing "<b>" tags.
>>> t.extract('this and that')
Traceback (most recent call last):
...

So, to achieve the task you require, I think you should:

  • Give it a few documents rendered from your template - it will have no trouble inferring the template from them.
  • Use the inferred template to extract data from new documents.

Come to think about it, it's even more useful than Perl's Template::Extract as it doesn't expect you to provide it a clean template - it learns it on its own from sample text.

一桥轻雨一伞开 2024-08-26 14:35:15

以下是 TemplateMaker 的作者 Adrian 的有趣讨论 http://www.holovaty.com/writing/ templatemaker/

它似乎很像我所说的包装归纳库。

如果您正在寻找其他更可配置的东西(更少的抓取),请查看 lxml.html 和 BeautifulSoup,也适用于 python。

Here is an interesting discussion from Adrian the author of TemplateMaker http://www.holovaty.com/writing/templatemaker/

It seems to be a lot like what I would call a wrapper induction library.

If your looking for something else that is more configurable (less for scraping) take a look at lxml.html and BeautifulSoup, also for python.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文