如何自动从html页面创建模板?
我有一个用例,需要在 Java 中以编程方式呈现给定网页格式的未格式化文本。即文本应该像网页一样自动设置样式、段落、项目符号等格式。
正如我首先看到的,我必须分析一段未格式化的文本,以找出段落、要点、标题等的候选者。我打算使用 Lucene 分析器/标记器来完成此任务。有其他选择吗?
第二个问题是将格式化的网页转换为某种模板(例如速度模板),其中包含各种实体(如标题、项目符号等)的占位符。
Java 中是否有任何文本分析/模板库可以帮助我做到这一点?最好是开源的。
对于在 Java 中以更好的方式完成此类任务还有其他建议吗?
感谢您的帮助。
I have a use case in which I need to render an unformatted text in the format of a given web page programmatically in Java. i.e. The text should automatically be formatted like the web page with styles, paragraphs, bullet points etc.
As I see first I will have to analyze the piece of unformatted text to find out the candidates for paragraphs, bullet points, headings etc. I intend to use Lucene analyzers/tokenizers for this task. Are there any alternatives?
The second problem is to convert the formatted web page into some kind of template (e.g. velocity template) with place holders for various entities like titles, bullet points etc.
Is there any text analysis/templating library in Java that can help me do this? Preferably open source.
Are there any other suggestions for doing this sort of task in a better way in Java?
Thanks for your help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你正在做的事情有很多困难的部分。
用户输入
如果您不要求用户提供任何上下文,您将永远无法猜测文本的结构。至少,您应该要求他们在您的 GUI 中提供一个标题和一系列段落。
理想情况下,您可以要求他们遵循众所周知的标记语言(Markdown、Textile 等)并使用开源解析器来提取结构。
外部页面
如果使用任何页面,您唯一可以依赖的就是“结构标记”。因此,假设您知道页面的标题应该是“Hello World”,并且页面中的某处有一个“h1”元素,您也许可以假设这就是标题可能所在的位置。
但是,如果页面是 div 标签汤,并且仅使用 CSS 来区分标题的呈现而不是大部分文本,那么您将不得不猜测样式如何已完成:如果您不知道页面是如何制作的,那显然是不可能的。
我不认为 Lucene 会对此有帮助(据我所知 Lucene 是为了创建大量文本中使用的单词的索引;我不认为它可以帮助您猜测文本的哪一部分)标题,副标题等...)
从外部页面生成模板
假设您“猜对了”,您可以通过
这当然会带来可怕的法律问题,因为您的模板将包含原始网站作者的作品(最有可能受版权保护的材料)
更现实的解决方案
我建议您将问题限制为:
请注意,这些点都与模板系统无关。
否则,我担心你会承担不合理的工作量......
There are a lot of hard parts to what you're doing.
The user input
If you don't ask your user to provide any context, you're never going to guess the structure of the text. At least, you should ask them to provide a title, and a series of paragraph in your GUI.
Ideally, you could ask them to follow a well-know markup language (Markdown, Textile, etc...) and use the open source parser to extract the structure.
The external page
If any page is used, the only things you can rely on are the "structural markup". So assuming you know the title of the page should be "Hello World", and there is a "h1" element somewhere in the page, you can maybe assume that this is where the header could go.
But if the pages is a div tag-soup, and only CSS is used to differentiate the rendering of the header as opposed to the bulk of the text, you're going to have to guess how the styling is done : that's plain impossible if you don't know how the page is made.
I don't think Lucene would help fo this (as far as I know Lucene is made to create an index of the words used in a bulk of text ; I don't think it can help you guessing which part of the text is meant to be a title, a subtitle, etc...)
Generating templates from external page
Assuming you have "guessed" right, you could generate the content by
That would of course pose terrible legal questions, since your templates would incorporate works by the original website author (most probably copyrighted material)
A more realistic solution
I would suggest you constrain your problem to :
Note that none of those points are related to the template system.
Otherwise, I'm afraid you're heading to an unreasonnable amount of work...