穿越嵌套的HTML递归列出
我无法将HTML带入下面,并将其放入列表中:
List<String> output = Arrays.asList(
new String[] {
"First Level-Second Level--Third Level",
"a-b--c1",
"a-b--c2",
"a-b--c3",
"a-b2--c1",
"a-b2--c2",
"a-b2--c3"
});
<ul>
<li>First Level</li>
<ul>
<li>Second Level</li>
<ul>
<li>Third Level</li>
</ul>
</ul>
<li>a</li>
<ul>
<li>b</li>
<ul>
<li>c</li>
<li>c</li>
<li>c</li>
</ul>
<li>b2</li>
<ul>
<li>c1</li>
<li>c2</li>
<li>c3</li>
</ul>
</ul>
</ul>
我已经将其加载到jsoup中,像元素一样通过它们。我拥有的代码如下:
public static String recHTML(Element element, String str) {
str += "-" + element.ownText();
if (element.children().size() == 0) return str + "--" + element.ownText();
else {
for (int i = 0; i < element.children().size(); i++) {
Element next = element.child(i);
str += recHTML(next, str);
}
}
}
我将带有破折号的字符串返回,以使用正则表达式将它们分成数组以具有不同级别的凹痕。我正在努力使自己的输出匹配,无论我尝试什么,我都无法使它起作用。任何帮助将不胜感激,谢谢。
I am unable to take the HTML below and put it into a list like:
List<String> output = Arrays.asList(
new String[] {
"First Level-Second Level--Third Level",
"a-b--c1",
"a-b--c2",
"a-b--c3",
"a-b2--c1",
"a-b2--c2",
"a-b2--c3"
});
<ul>
<li>First Level</li>
<ul>
<li>Second Level</li>
<ul>
<li>Third Level</li>
</ul>
</ul>
<li>a</li>
<ul>
<li>b</li>
<ul>
<li>c</li>
<li>c</li>
<li>c</li>
</ul>
<li>b2</li>
<ul>
<li>c1</li>
<li>c2</li>
<li>c3</li>
</ul>
</ul>
</ul>
I have loaded it into Jsoup to go through them like elements. The code I have is below:
public static String recHTML(Element element, String str) {
str += "-" + element.ownText();
if (element.children().size() == 0) return str + "--" + element.ownText();
else {
for (int i = 0; i < element.children().size(); i++) {
Element next = element.child(i);
str += recHTML(next, str);
}
}
}
I am returning the string with the dashes to use regex to split them into an array to have different levels of indentation. I am struggling to get my output to match up and no matter what I try I just can't get this to work. Any help would be appreciated, thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
假设您创建一个像Bellow这样的程序存根:
现在让我们考虑您的无效HTML示例及其有效的等效
无效HTML输入列表
当前,您正在尝试实现类似深度优先搜索(DFS)之类的东西。典型的DFS不会。为什么?因为您提供的HTML结构具有多个不会终止A 链的叶子节点。
假设我们调用您的
输出的每个元素
a 链。当您以DFS的方式穿越HTML树时,请注意A 链一旦找到一个
li
元素,没有ul
兄弟姐妹。这应该是您的结束条件。至于递归步骤,请考虑如何处理ul
和li
元素。定义并不容易;目前,我的解决方案仅适用于您的特定输入。您的
rechtml
实现这一目标必然会丑陋,因为从根到叶节点没有直接嵌套链。新的边缘案例将不断出现在您的代码无法正常工作的地方,并且您将继续进行编辑。有效的HTML输入列表
正如我在我的评论中所建议的那样,您的问题在于输入。假设我们以某种方式将您的输入HTML转换为上述标记。然后,解决方案变得更加干净,更简单 - 简单的DFS将起作用!
有趣的事实:java html整洁工具
我个人认为它不会以正确的方式纠正它 - 正确制作HTML嵌套列表的方法?
Suppose you create a program stub like bellow:
Now let's consider your invalid HTML example and its valid equivalent
Invalid HTML input list
Currently, you're trying to implement something like a depth-first search (DFS). A typical DFS won't do. Why? Because the HTML structure you provided has multiple leaf nodes that do not terminate a chain.
Suppose we call each element of your
output
a chain. As you traverse the HTML tree in a DFS manner, note that a chain is completeas soon as you find a
li
element withoutul
siblings. This should be your end condition. As for the recursion step, think about how to handleul
andli
elements. It is not easy to define; currently, my solution only works for your specific input.Your
recHTML
implementation for this is bound to be ugly since there's no direct nested chain from the root to leaf nodes. New edge cases will continuously come up where your code will not work and you'll keep making edits.Valid HTML input list
As suggested in my comments, your problem lies in the input. Suppose we somehow transform your input HTML to the above markup. Then the solution becomes a lot cleaner and more straightforward - a simple DFS will work!
Fun fact: Java HTML tidy tool JTidy has corrected your HTML input in a slightly different way than I have:
Personally, I don't think it corrects it the right way - Proper way to make HTML nested list?