获取子标签的信息

发布于 2025-02-13 04:54:22 字数 1926 浏览 6 评论 0原文

我正在尝试通过网络刮擦从网站检索信息。我需要的信息是在子标签中找到的,但是我无法获得它,

<div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House
 3
 pièces,                                                                                                         
74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 Appartement
 3
 pièces,                                                                                                         
64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House
 4
 pièces,                                                                                                         
81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>

我正在尝试获取广告和城市。我尝试了:

#BeautifulSoup
from bs4 import BeautifulSoup
import requests

#to get: House 3 pièces, 74 m²
ad = [ad.get_text() for ad in soup.find_all("span", class_='ergov3-txtannonce')]  

#to get cities       
cities = [city.get_text() for city in soup.find_all("cite", class_='ergov3-txtannonce')]

我的输出:

[]
[]

良好的输出:

["House 3 pièces, 74 m²", "Appartement 3 pièces, 64 m²", "House 4 pièces, 81 m²"]                                                                                                       
["New York (11111)", "Los Angeles (22222)", "Chicago (33333)"]                                                                                                                                                                                                                                                                                                       

I'm trying to retrieve information from a site by web scraping. The information I need is found in sub-tabs, but I'm not able to get it

<div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House
 3
 pièces,                                                                                                         
74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 Appartement
 3
 pièces,                                                                                                         
64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House
 4
 pièces,                                                                                                         
81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>

I'm trying to get the ad and the city. I tried:

#BeautifulSoup
from bs4 import BeautifulSoup
import requests

#to get: House 3 pièces, 74 m²
ad = [ad.get_text() for ad in soup.find_all("span", class_='ergov3-txtannonce')]  

#to get cities       
cities = [city.get_text() for city in soup.find_all("cite", class_='ergov3-txtannonce')]

My output:

[]
[]

Good output:

["House 3 pièces, 74 m²", "Appartement 3 pièces, 64 m²", "House 4 pièces, 81 m²"]                                                                                                       
["New York (11111)", "Los Angeles (22222)", "Chicago (33333)"]                                                                                                                                                                                                                                                                                                       

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

天生の放荡 2025-02-20 04:54:23

假设您包含提供的html选择包含您的信息的元素并迭代resultset以刮擦信息。避免多个列表,尝试一次刮擦所有信息并以更结构化的方式保存:

...
data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })
...

注意: 如果汤中不存在元素,则网站的内容可能会通过javascript - 这将是问一个新问题的预定

from bs4 import BeautifulSoup

html='''
<div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House 3 pièces, 74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 Appartement 3 pièces, 64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House 4 pièces, 81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })

data
输出
[{'title': 'House 3 pièces, 74 m²', 'city': 'New York (11111)'},
 {'title': 'Appartement 3 pièces, 64 m²', 'city': 'Los Angeles (22222)'},
 {'title': 'House 4 pièces, 81 m²', 'city': 'Chicago (33333)'}]

Assuming you soup contains the provided HTML select the elements that holds your information and iterate over the ResultSet to scrape the information. avoid multiple lists, try to scrape all information in one go and save it in a more structured way:

...
data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })
...

Note: If the elements are not present in your soup, content of website may provided dynamically by JavaScript - This would be predestined for asking a new question with exact this focus

Example
from bs4 import BeautifulSoup

html='''
<div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House 3 pièces, 74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 Appartement 3 pièces, 64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House 4 pièces, 81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })

data
Output
[{'title': 'House 3 pièces, 74 m²', 'city': 'New York (11111)'},
 {'title': 'Appartement 3 pièces, 64 m²', 'city': 'Los Angeles (22222)'},
 {'title': 'House 4 pièces, 81 m²', 'city': 'Chicago (33333)'}]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文