如何将 HTML/URL 中的下拉选择菜单元素转换为 Pandas Dataframe?
在创建用于匹配和提取 ID 和 SubID 及其名称的数据集时,从请求模块获取文件后,我在 HTML 中有以下代码 -
<div class="feature">
<h5>Network</h5>
<div>
<div class="row">
<ul class="tree network-tree">
<li class="classification class-C ">
<span>
<input type="checkbox" >
<a href="/network/nt06410+N01032+N01031" target="_blank">nt06410</a> Calcium signaling
</span>
<ul>
<li class="entry class-D network ">
<span>
<input type="checkbox" >
<a href="/entry/N01032" data-entry="N01032" target="_blank">N01032</a>
Mutation-inactivated PRKN to mGluR1 signaling pathway
</span>
</li>
<li class="entry class-D network ">
<span>
<input type="checkbox" >
<a href="/entry/N01031" data-entry="N01031" target="_blank">N01031</a>
Mutation-caused aberrant SNCA to VGCC-Ca2+ -apoptotic pathway
</span>
</li>
</ul>
</li>
我想要做的是获取这个特定的下拉选择菜单,突出显示与 pandas 数据框的特定链接 -
ID | 名称 | 子名称 | SubID | 网络 |
---|---|---|---|---|
nt06410 | 钙信号传导 | 突变失活的 PRKN 至 mGluR1 信号传导途径 | N01032 | nt06410+N01032+N01031 |
到目前为止我的代码是 -
data = soup.find_all("ul", {"class": "tree network-tree"})
# get all list elements
lis = data[0].find_all('li')
# add a helper lambda, just for readability
find_ul = lambda x: x.find_all('ul')
uls = [find_ul(elem) for elem in lis if find_ul(elem) != []]
# use a nested list comprehension to iterate over the <ul> tags
# and extract text from each <li> into sublists
text = [[li.text.encode('utf-8') for li in ul[0].find_all('li')] for ul in uls]
print(text[0][1])
While creating datasets for matching and extracting IDs and SubIDs with their names I have the following code in HTML after getting the file from requests module -
<div class="feature">
<h5>Network</h5>
<div>
<div class="row">
<ul class="tree network-tree">
<li class="classification class-C ">
<span>
<input type="checkbox" >
<a href="/network/nt06410+N01032+N01031" target="_blank">nt06410</a> Calcium signaling
</span>
<ul>
<li class="entry class-D network ">
<span>
<input type="checkbox" >
<a href="/entry/N01032" data-entry="N01032" target="_blank">N01032</a>
Mutation-inactivated PRKN to mGluR1 signaling pathway
</span>
</li>
<li class="entry class-D network ">
<span>
<input type="checkbox" >
<a href="/entry/N01031" data-entry="N01031" target="_blank">N01031</a>
Mutation-caused aberrant SNCA to VGCC-Ca2+ -apoptotic pathway
</span>
</li>
</ul>
</li>
What I want to do is get this particular dropdown selection menu that highlights particular linkages into a pandas dataframe -
ID | Name | Subname | SubID | Network |
---|---|---|---|---|
nt06410 | Calcium signaling | Mutation-inactivated PRKN to mGluR1 signaling pathway | N01032 | nt06410+N01032+N01031 |
My code so far has been -
data = soup.find_all("ul", {"class": "tree network-tree"})
# get all list elements
lis = data[0].find_all('li')
# add a helper lambda, just for readability
find_ul = lambda x: x.find_all('ul')
uls = [find_ul(elem) for elem in lis if find_ul(elem) != []]
# use a nested list comprehension to iterate over the <ul> tags
# and extract text from each <li> into sublists
text = [[li.text.encode('utf-8') for li in ul[0].find_all('li')] for ul in uls]
print(text[0][1])
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以使用嵌套的 for 循环并将项目追加到列表中。您将在外循环中重复项目,例如内循环中每个实例的 ID,例如 subID
将列表列表转换为末尾带有 pandas 的 DataFrame。
使用相关链接进行测试: https://www.kegg.jp/pathway/hsa05022
注意KEGG 确实提供免费的层次结构 JSON 下载。
You can use a nested for loop and append items to a list. You will be repeating items in the outer loop e.g. ID for each instance within the inner loop e.g. subID
Convert the list of lists to a DataFrame with pandas at the end.
Tested with a related link: https://www.kegg.jp/pathway/hsa05022
N.B. KEGG does offer free JSON downloads of hierarchies.