如何使用 beautiful soup 查找以某个关键字开头的 href 链接?

发布于 2025-01-16 21:50:00 字数 3076 浏览 2 评论 0原文

我现在做的任务很单调。在这个任务中,我必须访问这个网站 例如页面。您可以看到状态列中的每个案例都附加了一个超链接。我正在尝试找到一种方法来获取以关键字 case-details 开头的某些 href。因为它们是每个特定案例的状态列中的链接。由于超链接包含有关案件的详细信息。

我的代码:

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
url = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=40"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))

给出以下输出(为了清晰起见添加了行号..):

....
44 /order-judge-wise
45 order-judgement-date-wise
46 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ3MjAxOQ==
47 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMA==
48 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMQ==
49 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUwMjAyMQ==
50 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUxMjAyMA==
51 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUyMjAyMA==
52 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMA==
53 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMQ==
54 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU1MjAyMQ==
55 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU3MjAyMQ==
56 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=39 
57 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=1 
....

我想获取以“开头的href链接case-details”并将它们放入列表中。后来我用它来废弃每个案例的详细信息并将它们放入 Excel 文件中。

到目前为止,我已经尝试创建一个循环来查找这些链接:

for link in soup.find_all('a'):
    if "case" in link.get_text():
        print(link['href'])

但到目前为止,没有成功,我也想知道如何将其放入列表中。

预期输出:

url_list1 = ["case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTAzMjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA1MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA2MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA3MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAxNw==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAxOQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAyMQ=="]

The task I am doing right now is very monotonous. In this task I have to go to this website eg page. You can see that there is a hyperlink attached to each case in the Status column. I am trying to find a way in which I can grab certain href that start with keyword case-details. As they are the links from status column for each particular case. Since the hyperlinks contain details regarding the cases.

My code:

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re
url = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=40"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))

Which gives the following output (added line numbers for clarity..):

....
44 /order-judge-wise
45 order-judgement-date-wise
46 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ3MjAxOQ==
47 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMA==
48 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMQ==
49 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUwMjAyMQ==
50 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUxMjAyMA==
51 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUyMjAyMA==
52 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMA==
53 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMQ==
54 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU1MjAyMQ==
55 case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU3MjAyMQ==
56 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=39 
57 order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=1 
....

I want to grab the href links that start with "case-details" and put them into a list. Which I later use to scrap details of each case and put them into an excel file.

Till now I've tried to make a loop that looks for these links:

for link in soup.find_all('a'):
    if "case" in link.get_text():
        print(link['href'])

But till now, no success, I also want to know how to make this into a list.

expected output:

url_list1 = ["case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTAzMjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA1MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA2MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA3MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAxNw==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA4MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMA==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTA5MjAyMQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAxOQ==",
"case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMTEwMjAyMQ=="]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

空名 2025-01-23 21:50:00

仅选择这些带有 hrefcase-details 开头,您可以使用 css 选择器

soup.select('a[href^="case-details"]')

请注意你必须在前面加上一个baseUrl,例如列表理解

['https://nclt.gov.in/'+a['href'] for a in soup.select('a[href^="case-details"]')]

示例

import requests
from bs4 import BeautifulSoup

url = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=40"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')

urls = ['https://nclt.gov.in/'+a['href'] for a in soup.select('a[href^="case-details"]')]

输出

['https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ3MjAxOQ==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMA==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMQ==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUwMjAyMQ==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUxMjAyMA==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUyMjAyMA==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMA==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMQ==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU1MjAyMQ==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU3MjAyMQ==']

Selecting only these <a> with href starts with case-details you could use css selectors:

soup.select('a[href^="case-details"]')

be aware you have to prepend a baseUrl e.g. with list comprehension:

['https://nclt.gov.in/'+a['href'] for a in soup.select('a[href^="case-details"]')]

Example

import requests
from bs4 import BeautifulSoup

url = "https://nclt.gov.in/order-judgement-date-wise-search?bench=Y2hlbm5haQ==&start_date=MDEvMDEvMjAyMQ==&end_date=MDEvMDEvMjAyMg==&page=40"
response = requests.get(url).text
soup = BeautifulSoup(response, 'html.parser')

urls = ['https://nclt.gov.in/'+a['href'] for a in soup.select('a[href^="case-details"]')]

Output

['https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ3MjAxOQ==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMA==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjQ4MjAyMQ==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUwMjAyMQ==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUxMjAyMA==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUyMjAyMA==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMA==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjUzMjAyMQ==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU1MjAyMQ==',
 'https://nclt.gov.in/case-details?bench=Y2hlbm5haQ==&filing_no=MzMwNTExODAwMjU3MjAyMQ==']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文