从刮板I删除重复的链接I' M制作
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
在底部,它打印了链接...我知道它会进入那里,但我想不出一种方法来删除那里的重复条目。有人可以帮我吗?
#!/usr/bin/python3
import requests
from bs4 import BeautifulSoup
import re
url = input("Please enter a URL to scrape: ")
r = requests.get(url)
html = r.text
print(html)
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all('a', attrs={'href': re.compile("^https://")}):
print(link.get('href'))
down at the bottom, where it prints the link... I know it'll go in there, but I can't think of a way to remove duplicate entries there. Can someone help me with that please?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用 set 以删除重复项。您调用
add()
添加一个项目,如果已经存在该项目,则不会再次添加它。尝试以下操作:
请注意,某些URL可能以
http://
开头,因此可能需要使用Regexp^https?://
捕获HTTP和HTTPS URL。您也可以使用设置理解 Syntax syntax以重写分配和< em> for 这样的语句。
Use a set to remove duplicates. You call
add()
to add an item and if the item is already present then it won't be added again.Try this:
Note some URLs might start with
http://
so may want to use the regexp^https?://
to catch both http and https URLs.You can also use set comprehension syntax to rewrite the assignment and for statements like this.
您需要以某种方式进行比较,而不是打印它。
尝试以下操作:
您将获得一个由
find_all
的所有结果的列表,然后将其设置为集合。instead of printing it you need to catch is somehow to compare.
Try this:
you get a list with all result by
find_all
and make it a set.