如何自动从html页面创建模板？

发布于 2024-11-19 05:04:08 字数 304 浏览 14 评论 0 原文

我有一个用例，需要在 Java 中以编程方式呈现给定网页格式的未格式化文本。即文本应该像网页一样自动设置样式、段落、项目符号等格式。
正如我首先看到的，我必须分析一段未格式化的文本，以找出段落、要点、标题等的候选者。我打算使用 Lucene 分析器/标记器来完成此任务。有其他选择吗？
第二个问题是将格式化的网页转换为某种模板（例如速度模板），其中包含各种实体（如标题、项目符号等）的占位符。
Java 中是否有任何文本分析/模板库可以帮助我做到这一点？最好是开源的。
对于在 Java 中以更好的方式完成此类任务还有其他建议吗？

感谢您的帮助。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冷默言语 2024-11-26 05:04:08

你正在做的事情有很多困难的部分。

用户输入

如果您不要求用户提供任何上下文，您将永远无法猜测文本的结构。至少，您应该要求他们在您的 GUI 中提供一个标题和一系列段落。

理想情况下，您可以要求他们遵循众所周知的标记语言（Markdown、Textile 等）并使用开源解析器来提取结构。

外部页面

如果使用任何页面，您唯一可以依赖的就是“结构标记”。因此，假设您知道页面的标题应该是“Hello World”，并且页面中的某处有一个“h1”元素，您也许可以假设这就是标题可能所在的位置。

但是，如果页面是 div 标签汤，并且仅使用 CSS 来区分标题的呈现而不是大部分文本，那么您将不得不猜测样式如何已完成：如果您不知道页面是如何制作的，那显然是不可能的。

我不认为 Lucene 会对此有帮助（据我所知 Lucene 是为了创建大量文本中使用的单词的索引；我不认为它可以帮助您猜测文本的哪一部分）标题，副标题等...）

从外部页面生成模板

假设您“猜对了”，您可以通过

复制粘贴页面
替换要更改的部分来生成内容标签您选择
存储的模板语言模板系统可以访问的某个地方的模板
配置您的模板/视图系统（用于速度的viewResolver）为正确的人使用正确的模板

这当然会带来可怕的法律问题，因为您的模板将包含原始网站作者的作品（最有可能受版权保护的材料）

更现实的解决方案

我建议您将问题限制为：

使用具有一些可用结构信息的输入（使用GUI输入它，使用标记语言，等等）
使用模板那你提供、了解结构（并且可以很容易地重用）

请注意，这些点都与模板系统无关。

否则，我担心你会承担不合理的工作量......

There are a lot of hard parts to what you're doing.

The user input

If you don't ask your user to provide any context, you're never going to guess the structure of the text. At least, you should ask them to provide a title, and a series of paragraph in your GUI.

Ideally, you could ask them to follow a well-know markup language (Markdown, Textile, etc...) and use the open source parser to extract the structure.

The external page

If any page is used, the only things you can rely on are the "structural markup". So assuming you know the title of the page should be "Hello World", and there is a "h1" element somewhere in the page, you can maybe assume that this is where the header could go.

But if the pages is a div tag-soup, and only CSS is used to differentiate the rendering of the header as opposed to the bulk of the text, you're going to have to guess how the styling is done : that's plain impossible if you don't know how the page is made.

I don't think Lucene would help fo this (as far as I know Lucene is made to create an index of the words used in a bulk of text ; I don't think it can help you guessing which part of the text is meant to be a title, a subtitle, etc...)

Generating templates from external page

Assuming you have "guessed" right, you could generate the content by

copy pasting the page
replacing the parts to change with tags of your template language of choice
storing the template somewhere the templating system can access it
configure your template / view system (viewResolver for velocity) to use the right template for the rigth person

That would of course pose terrible legal questions, since your templates would incorporate works by the original website author (most probably copyrighted material)

A more realistic solution

I would suggest you constrain your problem to :

using input that has some structure information available (use a GUI to enter it, use a markup language, whatever)
using templates that you provide, know the structure of (and can reuse very easily)

Note that none of those points are related to the template system.

Otherwise, I'm afraid you're heading to an unreasonnable amount of work...

回复收藏 0 原文

~没有更多了~