清理网络刮擦数据并组合在一起?
网站url
是 https://wwww.justia.com /律师/刑法/缅因州
我想只刮擦律师的名字以及他们的办公室。
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
Lawyer_name= soup.find_all("a","url main-profile-link")
for i in Lawyer_name:
print(i.find(text=True))
address= soup.find_all("span","-address -hide-landscape-tablet")
for x in address:
print(x.find_all(text=True))
该名称打印出来只是找到,但地址正在以我想删除的额外打印出来:
['\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t88 Hammond Street', '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBangor,\t\t\t\t\tME 04401\t\t\t\t\t\t ']
所以我试图为每个律师获得的输出就是这样(第一个示例):
Hunter J Tzovarras
88 Hammond Street
Bangor, ME 04401
我是两个问题试图找出
- 我如何清理地址以使阅读更容易?
- 我如何将匹配的律师名称保存在地址上,以便他们 不要混在一起。
The website URL
is https://www.justia.com/lawyers/criminal-law/maine
I'm wanting to scrape only the name of the lawyer and where their office is.
response = requests.get(url)
soup= BeautifulSoup(response.text,"html.parser")
Lawyer_name= soup.find_all("a","url main-profile-link")
for i in Lawyer_name:
print(i.find(text=True))
address= soup.find_all("span","-address -hide-landscape-tablet")
for x in address:
print(x.find_all(text=True))
The name prints out just find but the address is printing off with extra that I want to remove:
['\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t88 Hammond Street', '\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tBangor,\t\t\t\t\tME 04401\t\t\t\t\t\t ']
so the output I'm attempting to get for each lawyer is like this (the 1st one example):
Hunter J Tzovarras
88 Hammond Street
Bangor, ME 04401
two issues I'm trying to figure out
- How can I clean up the address so it is easier to read?
- How can I save the matching lawyer name with the address so they
don't get mixed up.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用
X.Get_Text()
而不是X.find_all
完整工作代码:
输出:
Use
x.get_text()
instead ofx.find_all
Full working code:
Output:
对于您的第二个查询,您可以将它们保存到这样的字典中 -
输出字典 -
for your second query You can save them into a dictionary like this -
Output dictionary -