当前位置：文江博客话题详情

python/php 中的模板提取

发布于 2024-08-19 14:35:15 字数 348 浏览 7 评论 0原文

python 或 php 中是否存在现有的模板提取库？ Perl 有 Template::Extract，但我一直找不到类似的实现在 python 或 php 中。

我能找到的Python中唯一接近的是TemplateMaker（http://code.google.com/ p/templatemaker/），但这并不是真正的模板提取库。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

兔小萌 2024-08-26 14:35:15

经过更多研究后，我找到了一个正是我正在寻找的解决方案。 filippo 在这篇文章中发布了用于屏幕抓取的 python 解决方案列表： HTML 抓取选项？< /a> 其中有一个名为 scrapemark 的包（ http://arshaw.com/scrapemark/ ）。

希望这对正在寻找相同解决方案的其他人有所帮助。

回复收藏 0 原文

删除→记忆 2024-08-26 14:35:15

TmeplateMaker 似乎确实可以满足您的需求，至少根据其文档是这样。它不是接收模板作为输入，而是从一些文档中推断（“学习”）。然后，它具有 extract 方法来从使用此模板创建的其他文档中提取数据。

该示例显示：

# Now that we have a template, let's extract some data.
>>> t.extract('<b>red and green</b>')
('red', 'green')
>>> t.extract('<b>django and stephane</b>')
('django', 'stephane')

# The extract() method is very literal. It doesn't magically trim
# whitespace, nor does it have any knowledge of markup languages such as
# HTML.
>>> t.extract('<b>  spacy  and <u>underlined</u></b>')
('  spacy ', '<u>underlined</u>')

# The extract() method will raise the NoMatch exception if the data
# doesn't match the template. In this example, the data doesn't have the
# leading and trailing "<b>" tags.
>>> t.extract('this and that')
Traceback (most recent call last):
...

因此，为了实现您需要的任务，我认为您应该：

给它一些从您的模板呈现的文档 - 它可以毫无困难地从它们推断模板。
使用推断的模板从新文档中提取数据。

仔细想想，它甚至比 Perl 的 Template::Extract 更有用，因为它不期望您为它提供一个干净的模板 - 它会从示例文本中自行学习。

TmeplateMaker does seem to do what you need, at least according to its documentation. Instead of receiving a template as an input, it infers ("learns") if from a few documents. Then, it has the extract method to extract the data from other documents that were created with this template.

The example shows:

# Now that we have a template, let's extract some data.
>>> t.extract('<b>red and green</b>')
('red', 'green')
>>> t.extract('<b>django and stephane</b>')
('django', 'stephane')

# The extract() method is very literal. It doesn't magically trim
# whitespace, nor does it have any knowledge of markup languages such as
# HTML.
>>> t.extract('<b>  spacy  and <u>underlined</u></b>')
('  spacy ', '<u>underlined</u>')

# The extract() method will raise the NoMatch exception if the data
# doesn't match the template. In this example, the data doesn't have the
# leading and trailing "<b>" tags.
>>> t.extract('this and that')
Traceback (most recent call last):
...

So, to achieve the task you require, I think you should:

Give it a few documents rendered from your template - it will have no trouble inferring the template from them.
Use the inferred template to extract data from new documents.

Come to think about it, it's even more useful than Perl's Template::Extract as it doesn't expect you to provide it a clean template - it learns it on its own from sample text.

回复收藏 0 原文