java中的html截断器
是否有任何实用程序(或示例源代码)可以在 Java 中截断 HTML(用于预览)?我想在服务器上而不是在客户端上进行截断。
我正在使用 HTMLUnit 来解析 HTML。
更新:
我希望能够预览 HTML,因此截断器将保持 HTML 结构,同时在所需的输出长度之后删除元素。
Is there any utility (or sample source code) that truncates HTML (for preview) in Java? I want to do the truncation on the server and not on the client.
I'm using HTMLUnit to parse HTML.
UPDATE:
I want to be able to preview the HTML, so the truncator would maintain the HTML structure while stripping out the elements after the desired output length.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我编写了另一个 java 版本的 truncateHTML。此函数将字符串截断至多个字符,同时保留整个单词和 HTML 标签。
I've written another java version of truncateHTML. This function truncates a string up to a number of characters while preserving whole words and HTML tags.
我认为您需要编写自己的 XML 解析器来完成此任务。拉出body节点,添加节点直到二进制长度<一些固定的大小,然后重建文档。如果 HTMLUnit 不创建语义 XHTML,我建议使用 tagsoup。
如果您需要 XML 解析器/处理程序,我推荐 XOM。
I think you're going to need to write your own XML parser to accomplish this. Pull out the body node, add nodes until binary length < some fixed size, and then rebuild the document. If HTMLUnit doesn't create semantic XHTML, I'd recommend tagsoup.
If you need an XML parser/handler, I'd recommend XOM.
有一个 PHP 函数可以在这里执行此操作: http://snippets.dzone.com/posts/show /7125
我已经对初始版本做了一个快速而肮脏的 Java 移植,但是注释中有后续的改进版本可能值得考虑(尤其是处理整个单词的版本):
注意: 您需要 Apache Commons Lang 来执行
StringUtils.join()
。There is a PHP function that does it here: http://snippets.dzone.com/posts/show/7125
I've made a quick and dirty Java port of the initial version, but there are subsequent improved versions in the comments that could be worth considering (especially one that deals with whole words):
Note: You'll need Apache Commons Lang for the
StringUtils.join()
.我可以为您提供一个我为此编写的 Python 脚本: http://www.ellipsix .net/ext-tmp/summarize.txt。不幸的是,我没有 Java 版本,但如果您愿意,可以自行翻译并修改它以满足您的需要。它并不是很复杂,只是我为我的网站拼凑而成的东西,但我已经使用它一年多了,它通常看起来运行得很好。
如果您想要健壮的东西,XML(或 SGML)解析器几乎肯定比我所做的更好。
I can offer you a Python script I wrote to do this: http://www.ellipsix.net/ext-tmp/summarize.txt. Unfortunately I don't have a Java version, but feel free to translate it yourself and modify it to suit your needs if you want. It's not very complicated, just something I hacked together for my website, but I've been using it for a little more than a year and it generally seems to work pretty well.
If you want something robust, an XML (or SGML) parser is almost certainly a better idea than what I did.
我找到了这个博客: dencat:在 Java 中截断 HTML
它包含Python的java端口,Django模板函数
truncate_html_words
I found this blog: dencat: Truncating HTML in Java
It contains a java port of Pythons, Django template function
truncate_html_words