如何将 HTML/URL 中的下拉选择菜单元素转换为 Pandas Dataframe?

发布于 2025-01-13 02:50:06 字数 2640 浏览 3 评论 0原文

在创建用于匹配和提取 ID 和 SubID 及其名称的数据集时,从请求模块获取文件后,我在 HTML 中有以下代码 -

    <div class="feature">
            <h5>Network</h5>
            <div>
                <div class="row">
                    <ul class="tree network-tree">
                    
                        
    <li class="classification class-C  ">
        <span>
        <input type="checkbox" >
        <a href="/network/nt06410+N01032+N01031" target="_blank">nt06410</a> Calcium signaling

        </span>
        <ul>
    <li class="entry class-D network ">
        <span>
        <input type="checkbox" >
        <a href="/entry/N01032" data-entry="N01032" target="_blank">N01032</a>
        
        Mutation-inactivated PRKN to mGluR1 signaling pathway

        </span>
    </li>

    <li class="entry class-D network ">
        <span>
        <input type="checkbox" >
        <a href="/entry/N01031" data-entry="N01031" target="_blank">N01031</a>
        
        Mutation-caused aberrant SNCA to VGCC-Ca2+ -apoptotic pathway

        </span>
    </li>

        </ul>
    </li>

我想要做的是获取这个特定的下拉选择菜单,突出显示与 pandas 数据框的特定链接 - 输入图片此处描述

ID名称子名称SubID网络
nt06410钙信号传导突变失活的 PRKN 至 mGluR1 信号传导途径N01032nt06410+N01032+N01031

到目前为止我的代码是 -

data = soup.find_all("ul", {"class": "tree network-tree"})

# get all list elements
lis = data[0].find_all('li')

# add a helper lambda, just for readability
find_ul = lambda x: x.find_all('ul')
uls = [find_ul(elem) for elem in lis if find_ul(elem) != []]

# use a nested list comprehension to iterate over the <ul> tags
# and extract text from each <li> into sublists
text = [[li.text.encode('utf-8') for li in ul[0].find_all('li')] for ul in uls]

print(text[0][1])

While creating datasets for matching and extracting IDs and SubIDs with their names I have the following code in HTML after getting the file from requests module -

    <div class="feature">
            <h5>Network</h5>
            <div>
                <div class="row">
                    <ul class="tree network-tree">
                    
                        
    <li class="classification class-C  ">
        <span>
        <input type="checkbox" >
        <a href="/network/nt06410+N01032+N01031" target="_blank">nt06410</a> Calcium signaling

        </span>
        <ul>
    <li class="entry class-D network ">
        <span>
        <input type="checkbox" >
        <a href="/entry/N01032" data-entry="N01032" target="_blank">N01032</a>
        
        Mutation-inactivated PRKN to mGluR1 signaling pathway

        </span>
    </li>

    <li class="entry class-D network ">
        <span>
        <input type="checkbox" >
        <a href="/entry/N01031" data-entry="N01031" target="_blank">N01031</a>
        
        Mutation-caused aberrant SNCA to VGCC-Ca2+ -apoptotic pathway

        </span>
    </li>

        </ul>
    </li>

What I want to do is get this particular dropdown selection menu that highlights particular linkages into a pandas dataframe -
enter image description here

IDNameSubnameSubIDNetwork
nt06410Calcium signalingMutation-inactivated PRKN to mGluR1 signaling pathwayN01032nt06410+N01032+N01031

My code so far has been -

data = soup.find_all("ul", {"class": "tree network-tree"})

# get all list elements
lis = data[0].find_all('li')

# add a helper lambda, just for readability
find_ul = lambda x: x.find_all('ul')
uls = [find_ul(elem) for elem in lis if find_ul(elem) != []]

# use a nested list comprehension to iterate over the <ul> tags
# and extract text from each <li> into sublists
text = [[li.text.encode('utf-8') for li in ul[0].find_all('li')] for ul in uls]

print(text[0][1])

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

趁微风不噪 2025-01-20 02:50:06

您可以使用嵌套的 for 循环并将项目追加到列表中。您将在外循环中重复项目,例如内循环中每个实例的 ID,例如 subID
将列表列表转换为末尾带有 pandas 的 DataFrame。

results = []

for network in soup.select('.row .classification'):
    a = network.select_one('a')
    _id = a.text
    _network = a['href'].split('network/')[-1]
    name = a.next_sibling.strip()

    for pathway in network.select('.network'):
        b = pathway.select_one('a')
        subname = b.next_sibling.strip()
        subid = b.text
        results.append([_id, name, subname, subid, _network])
df = pd.DataFrame(results, columns = ['ID', 'Name', 'Subname', 'SubID', 'Network'])

使用相关链接进行测试: https://www.kegg.jp/pathway/hsa05022

注意KEGG 确实提供免费的层次结构 JSON 下载。

You can use a nested for loop and append items to a list. You will be repeating items in the outer loop e.g. ID for each instance within the inner loop e.g. subID
Convert the list of lists to a DataFrame with pandas at the end.

results = []

for network in soup.select('.row .classification'):
    a = network.select_one('a')
    _id = a.text
    _network = a['href'].split('network/')[-1]
    name = a.next_sibling.strip()

    for pathway in network.select('.network'):
        b = pathway.select_one('a')
        subname = b.next_sibling.strip()
        subid = b.text
        results.append([_id, name, subname, subid, _network])
df = pd.DataFrame(results, columns = ['ID', 'Name', 'Subname', 'SubID', 'Network'])

Tested with a related link: https://www.kegg.jp/pathway/hsa05022

N.B. KEGG does offer free JSON downloads of hierarchies.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文