从刮板I删除重复的链接I＆＃x27; M制作

发布于 2025-01-27 06:28:32 字数 400 浏览 2 评论 0原文

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import re


url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)

soup = BeautifulSoup(html, "html.parser")

for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))

在底部，它打印了链接...我知道它会进入那里，但我想不出一种方法来删除那里的重复条目。有人可以帮我吗？

原文

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import re


url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)

soup = BeautifulSoup(html, "html.parser")

for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
    print(link.get('href'))

down at the bottom, where it prints the link... I know it'll go in there, but I can't think of a way to remove duplicate entries there. Can someone help me with that please?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花心好男孩 2025-02-03 06:28:32

使用 set 以删除重复项。您调用add（）添加一个项目，如果已经存在该项目，则不会再次添加它。

尝试以下操作：

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import re

url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)

soup = BeautifulSoup(html, "html.parser")
urls = set()
for link in soup.find_all('a', attrs={'href': re.compile(r"^https://")}):
    urls.add(link.get('href'))
print(urls) # urls contains unique set of URLs

请注意，某些URL可能以http：//开头，因此可能需要使用Regexp ^https？：//捕获HTTP和HTTPS URL。

您也可以使用设置理解 Syntax syntax以重写分配和< em> for 这样的语句。

urls = {
    link.get("href")
    for link in soup.find_all("a", attrs={"href": re.compile(r"^https://")})
}

Use a set to remove duplicates. You call add() to add an item and if the item is already present then it won't be added again.

Try this:

#!/usr/bin/python3

import requests
from bs4 import BeautifulSoup
import re

url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)

soup = BeautifulSoup(html, "html.parser")
urls = set()
for link in soup.find_all('a', attrs={'href': re.compile(r"^https://")}):
    urls.add(link.get('href'))
print(urls) # urls contains unique set of URLs

Note some URLs might start with http:// so may want to use the regexp ^https?:// to catch both http and https URLs.

You can also use set comprehension syntax to rewrite the assignment and for statements like this.

urls = {
    link.get("href")
    for link in soup.find_all("a", attrs={"href": re.compile(r"^https://")})
}

回复收藏 0 原文

并安 2025-02-03 06:28:32

您需要以某种方式进行比较，而不是打印它。

尝试以下操作：

您将获得一个由find_all的所有结果的列表，然后将其设置为集合。

data = set(link.get('href') for link in soup.find_all('a', attrs={'href': re.compile("^https://")}))

for elem in data:
    print(elem)

instead of printing it you need to catch is somehow to compare.

Try this:

you get a list with all result by find_all and make it a set.

data = set(link.get('href') for link in soup.find_all('a', attrs={'href': re.compile("^https://")}))

for elem in data:
    print(elem)

回复收藏 0 原文

~没有更多了~

关于作者

鲜血染红嫁衣

暂无简介

文章

28 人气

关注发私信

夢野间

文章 0 评论 0

关注

百度③文鱼

文章 0 评论 0

关注

小草泠泠

文章 0 评论 0

关注

zhuwenyan

文章 0 评论 0

关注

weirdo

文章 0 评论 0

关注

坚持沉默

文章 0 评论 0

友情链接

文江博客

从刮板I删除重复的链接I＆＃x27; M制作

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

从刮板I删除重复的链接I＆＃x27; M制作

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。