在 Java 中使用同级标签解析 HTML 内容(或)在两个之间查找内容标签
背景:我正在编写一个 Java 程序来遍历 HTML 文件,并将标签中除 或
之外的所有内容替换为 Lorem Ipsum 。我最初是用正则表达式来做到这一点的,只是删除了 > 之间的所有内容。和 a <,实际上效果很好(我知道这是亵渎的),但我试图将其变成其他人可能会觉得有用的工具,所以我不敢再通过尝试使用正则表达式来威胁宇宙的神圣性在 HTML 上。
我正在尝试使用 HtmlCleaner,这是一个吸引我的 Java 库,因为它没有其他依赖项。但是,尝试实现它时我无法像这样处理 html:
<div>
This text is in the div <span>but this is also in a span.</span>
</div>
问题很简单。当 TagNodeVisitor 到达 div 时,如果我用适量的 Lipsum 替换其内容,它将消除 span 标签。但如果我只深入到没有其他子节点的 TagNode,我就会错过第一段文本。
HtmlCleaner 有一个 ContentNode 对象,但该对象没有替换方法。我能想到的解决这个问题的任何方法似乎都太复杂了。有谁熟悉使用 HtmlCleaner 或您更熟悉的其他解析库来处理此问题的方法吗?
Background: I'm writing a Java program to go through HTML files and replace all the content in tags that are not <script>
or <style>
with Lorem Ipsum. I originally did this with a regex just removing everything between a > and a <, which actually worked quite well (blasphemous I know), but I'm trying to turn this into a tool others may find useful so I wouldn't dare threaten the sanctity of the universe any more by trying to use regex on html.
I'm trying to use HtmlCleaner, a Java library that attracted me because it has no other dependencies. However, trying to implement it I've been unable to deal with html like this:
<div>
This text is in the div <span>but this is also in a span.</span>
</div>
The problem is simple. When the TagNodeVisitor reaches the div, if I replace its contents with the right amount of lipsum, it will eliminate the span tag. But if I drill down to only TagNodes with no other children, I would miss the first bit of text.
HtmlCleaner has a ContentNode object, but that object has no replace method. Anything I can think of to deal with this seems like it must be far too complicated. Is anyone familiar with a way to deal with this, with HtmlCleaner or some other parsing library you're more familiar with?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您几乎可以使用 JSoup setters 做任何您想做的事情,
这适合您吗?
You can pretty much do anything you want with JSoup setters
Would that suit you ?
HtmlCleaner 的 ContentNode 有一个方法 getContent() ,返回 java.lang.StringBuilder。这是可变的,可以更改为您想要的任何值。
HtmlCleaner's ContentNode has a method getContent() that returns a java.lang.StringBuilder. This is mutable and can be changed to whatever value you want.