从刮板I删除重复的链接I' M制作

发布于 2025-01-27 06:28:32 字数 400 浏览 2 评论 0原文

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import re


url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)

soup = BeautifulSoup(html, "html.parser")

for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))

在底部,它打印了链接...我知道它会进入那里,但我想不出一种方法来删除那里的重复条目。有人可以帮我吗?

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import re


url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)

soup = BeautifulSoup(html, "html.parser")

for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))

down at the bottom, where it prints the link... I know it'll go in there, but I can't think of a way to remove duplicate entries there. Can someone help me with that please?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

花心好男孩 2025-02-03 06:28:32

使用 set 以删除重复项。您调用add()添加一个项目,如果已​​经存在该项目,则不会再次添加它。

尝试以下操作:

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import re

url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)

soup = BeautifulSoup(html, "html.parser")
urls = set()
for link in soup.find_all('a', attrs={'href': re.compile(r"^https://")}):
    urls.add(link.get('href'))
print(urls) # urls contains unique set of URLs

请注意,某些URL可能以http://开头,因此可能需要使用Regexp ^https?://捕获HTTP和HTTPS URL。

您也可以使用设置理解 Syntax syntax以重写分配和< em> for 这样的语句。

urls = {
    link.get("href")
    for link in soup.find_all("a", attrs={"href": re.compile(r"^https://")})
}

Use a set to remove duplicates. You call add() to add an item and if the item is already present then it won't be added again.

Try this:

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import re

url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)

soup = BeautifulSoup(html, "html.parser")
urls = set()
for link in soup.find_all('a', attrs={'href': re.compile(r"^https://")}):
    urls.add(link.get('href'))
print(urls) # urls contains unique set of URLs

Note some URLs might start with http:// so may want to use the regexp ^https?:// to catch both http and https URLs.

You can also use set comprehension syntax to rewrite the assignment and for statements like this.

urls = {
    link.get("href")
    for link in soup.find_all("a", attrs={"href": re.compile(r"^https://")})
}
并安 2025-02-03 06:28:32

您需要以某种方式进行比较,而不是打印它。

尝试以下操作:

您将获得一个由find_all的所有结果的列表,然后将其设置为集合。

data = set(link.get('href') for link in soup.find_all('a', attrs={'href': re.compile("^https://")}))

for elem in data:
    print(elem)

instead of printing it you need to catch is somehow to compare.

Try this:

you get a list with all result by find_all and make it a set.

data = set(link.get('href') for link in soup.find_all('a', attrs={'href': re.compile("^https://")}))

for elem in data:
    print(elem)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文