使用 Python 从 HTML 生成目录

发布于 2024-08-26 11:26:58 字数 728 浏览 12 评论 0原文

我正在尝试根据 HTML 块(不是完整的文件 - 只是内容)生成一个目录,该目录基于其

标签。

到目前为止,我的计划是:

  • 使用 beautifulsoup 提取标头列表

  • 在content 将锚链接放置在标题标签之前/内部(以便用户可以单击目录)--可能有一种方法可以替换 beautifulsoup 内部?

  • 输出指向预定义位置中标题的链接的嵌套列表。

当我这样说时,听起来很容易,但事实证明,这有点背后的痛苦。

有没有什么东西可以一次性为我完成所有这一切,这样我就不会浪费接下来的几个小时重新发明轮子?

一个例子:

<p>This is an introduction</p>

<h2>This is a sub-header</h2>
<p>...</p>

<h3>This is a sub-sub-header</h3>
<p>...</p>

<h2>This is a sub-header</h2>
<p>...</p>

I'm trying to generate a table of contents from a block of HTML (not a complete file - just content) based on its <h2> and <h3> tags.

My plan so far was to:

  • Extract a list of headers using beautifulsoup

  • Use a regex on the content to place anchor links before/inside the header tags (so the user can click on the table of contents) -- There might be a method for replacing inside beautifulsoup?

  • Output a nested list of links to the headers in a predefined spot.

It sounds easy when I say it like that, but it's proving to be a bit of a pain in the rear.

Is there something out there that does all this for me in one go so I don't waste the next couple of hours reinventing the wheel?

A example:

<p>This is an introduction</p>

<h2>This is a sub-header</h2>
<p>...</p>

<h3>This is a sub-sub-header</h3>
<p>...</p>

<h2>This is a sub-header</h2>
<p>...</p>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

兰花执着 2024-09-02 11:26:58

有些人很快就破解了一段丑陋的代码:

soup = BeautifulSoup(html)

toc = []
header_id = 1
current_list = toc
previous_tag = None

for header in soup.findAll(['h2', 'h3']):
    header['id'] = header_id

    if previous_tag == 'h2' and header.name == 'h3':
        current_list = []
    elif previous_tag == 'h3' and header.name == 'h2':
        toc.append(current_list)
        current_list = toc

    current_list.append((header_id, header.string))

    header_id += 1
    previous_tag = header.name

if current_list != toc:
    toc.append(current_list)


def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#%s">%s</a></li>' % item)
    result.append("</ul>")
    return "\n".join(result)

# Table of contents
print list_to_html(toc)

# Modified HTML
print soup

Some quickly hacked ugly piece of code:

soup = BeautifulSoup(html)

toc = []
header_id = 1
current_list = toc
previous_tag = None

for header in soup.findAll(['h2', 'h3']):
    header['id'] = header_id

    if previous_tag == 'h2' and header.name == 'h3':
        current_list = []
    elif previous_tag == 'h3' and header.name == 'h2':
        toc.append(current_list)
        current_list = toc

    current_list.append((header_id, header.string))

    header_id += 1
    previous_tag = header.name

if current_list != toc:
    toc.append(current_list)


def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#%s">%s</a></li>' % item)
    result.append("</ul>")
    return "\n".join(result)

# Table of contents
print list_to_html(toc)

# Modified HTML
print soup
琉璃繁缕 2024-09-02 11:26:58

使用lxml.html

Use lxml.html.

樱娆 2024-09-02 11:26:58

我提供了 Łukasz 提出的解决方案的扩展版本。

def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#{}">{}</a></li>'.format(item[0], item[1]))
    result.append("</ul>")
    return "\n".join(result)

soup = BeautifulSoup(article, 'html5lib')

toc = []
h2_prev = 0
h3_prev = 0
h4_prev = 0
h5_prev = 0

for header in soup.findAll(['h2', 'h3', 'h4', 'h5', 'h6']):
    data = [(slugify(header.string), header.string)]

    if header.name == "h2":
        toc.append(data)
        h3_prev = 0
        h4_prev = 0
        h5_prev = 0
        h2_prev = len(toc) - 1
    elif header.name == "h3":
        toc[int(h2_prev)].append(data)
        h3_prev = len(toc[int(h2_prev)]) - 1
    elif header.name == "h4":
        toc[int(h2_prev)][int(h3_prev)].append(data)
        h4_prev = len(toc[int(h2_prev)][int(h3_prev)]) - 1
    elif header.name == "h5":
        toc[int(h2_prev)][int(h3_prev)][int(h4_prev)].append(data)
        h5_prev = len(toc[int(h2_prev)][int(h3_prev)][int(h4_prev)]) - 1
    elif header.name == "h6":
        toc[int(h2_prev)][int(h3_prev)][int(h4_prev)][int(h5_prev)].append(data)

toc_html = list_to_html(toc)

I have come with an extended version of the solution proposed by Łukasz's.

def list_to_html(lst):
    result = ["<ul>"]
    for item in lst:
        if isinstance(item, list):
            result.append(list_to_html(item))
        else:
            result.append('<li><a href="#{}">{}</a></li>'.format(item[0], item[1]))
    result.append("</ul>")
    return "\n".join(result)

soup = BeautifulSoup(article, 'html5lib')

toc = []
h2_prev = 0
h3_prev = 0
h4_prev = 0
h5_prev = 0

for header in soup.findAll(['h2', 'h3', 'h4', 'h5', 'h6']):
    data = [(slugify(header.string), header.string)]

    if header.name == "h2":
        toc.append(data)
        h3_prev = 0
        h4_prev = 0
        h5_prev = 0
        h2_prev = len(toc) - 1
    elif header.name == "h3":
        toc[int(h2_prev)].append(data)
        h3_prev = len(toc[int(h2_prev)]) - 1
    elif header.name == "h4":
        toc[int(h2_prev)][int(h3_prev)].append(data)
        h4_prev = len(toc[int(h2_prev)][int(h3_prev)]) - 1
    elif header.name == "h5":
        toc[int(h2_prev)][int(h3_prev)][int(h4_prev)].append(data)
        h5_prev = len(toc[int(h2_prev)][int(h3_prev)][int(h4_prev)]) - 1
    elif header.name == "h6":
        toc[int(h2_prev)][int(h3_prev)][int(h4_prev)][int(h5_prev)].append(data)

toc_html = list_to_html(toc)
方觉久 2024-09-02 11:26:58

如何生成表格Python 中 HTML 文本的内容?

但我认为您走在正确的道路上,重新发明轮子会很有趣。

How do I generate a table of contents for HTML text in Python?

But I think you are on the right track and reinventing the wheel will be fun.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文