如何使用 beautiful soup 查找以某个关键字开头的 href 链接?
我现在做的任务很单调。在这个任务中,我必须访问这个网站 例如页面。您可以看到状态列中的每个案例都附加了一个超链接。我正在尝试找到一种方法来获取以关键字 case-details
开头的某些 href
。因为它们是每个特定案例的状态列中的链接。由于超链接包含有关案件的详细信息。
我的代码:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
url = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=40"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
给出以下输出(为了清晰起见添加了行号..):
....
44 /order-judge-wise
45 order-judgement-date-wise
46 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ3MjAxOQ==
47 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMA==
48 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMQ==
49 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUwMjAyMQ==
50 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUxMjAyMA==
51 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUyMjAyMA==
52 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMA==
53 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMQ==
54 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU1MjAyMQ==
55 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU3MjAyMQ==
56 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=39
57 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=1
....
我想获取以“开头的href
链接case-details
”并将它们放入列表中。后来我用它来废弃每个案例的详细信息并将它们放入 Excel 文件中。
到目前为止,我已经尝试创建一个循环来查找这些链接:
for link in soup.find_all('a'):
if "case" in link.get_text():
print(link['href'])
但到目前为止,没有成功,我也想知道如何将其放入列表中。
预期输出:
url_list1 = ["case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTAzMjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA1MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA2MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA3MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAxNw==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAxOQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAyMQ=="]
The task I am doing right now is very monotonous. In this task I have to go to this website eg page. You can see that there is a hyperlink attached to each case in the Status column. I am trying to find a way in which I can grab certain href
that start with keyword case-details
. As they are the links from status column for each particular case. Since the hyperlinks contain details regarding the cases.
My code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
url = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=40"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
Which gives the following output (added line numbers for clarity..):
....
44 /order-judge-wise
45 order-judgement-date-wise
46 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ3MjAxOQ==
47 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMA==
48 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMQ==
49 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUwMjAyMQ==
50 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUxMjAyMA==
51 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUyMjAyMA==
52 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMA==
53 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMQ==
54 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU1MjAyMQ==
55 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU3MjAyMQ==
56 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=39
57 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=1
....
I want to grab the href
links that start with "case-details
" and put them into a list. Which I later use to scrap details of each case and put them into an excel file.
Till now I've tried to make a loop that looks for these links:
for link in soup.find_all('a'):
if "case" in link.get_text():
print(link['href'])
But till now, no success, I also want to know how to make this into a list.
expected output:
url_list1 = ["case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTAzMjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA1MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA2MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA3MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAxNw==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAxOQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAyMQ=="]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
仅选择这些带有
href
的以
case-details
开头,您可以使用css 选择器
:请注意你必须在前面加上一个baseUrl,例如
列表理解
:示例
输出
Selecting only these
<a>
withhref
starts withcase-details
you could usecss selectors
:be aware you have to prepend a baseUrl e.g. with
list comprehension
:Example
Output