获取子标签的信息

发布于 2025-02-13 04:54:22 字数 1926 浏览 6 评论 0原文

我正在尝试通过网络刮擦从网站检索信息。我需要的信息是在子标签中找到的，但是我无法获得它，

<div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House
 3
 pièces,                                                                                                         
74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 Appartement
 3
 pièces,                                                                                                         
64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House
 4
 pièces,                                                                                                         
81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>

我正在尝试获取广告和城市。我尝试了：

#BeautifulSoup
from bs4 import BeautifulSoup
import requests

#to get: House 3 pièces, 74 m²
ad = [ad.get_text() for ad in soup.find_all("span", class_='ergov3-txtannonce')]  

#to get cities       
cities = [city.get_text() for city in soup.find_all("cite", class_='ergov3-txtannonce')]

我的输出：

[]
[]

良好的输出：

["House 3 pièces, 74 m²", "Appartement 3 pièces, 64 m²", "House 4 pièces, 81 m²"]                                                                                                       
["New York (11111)", "Los Angeles (22222)", "Chicago (33333)"]

原文

I'm trying to retrieve information from a site by web scraping. The information I need is found in sub-tabs, but I'm not able to get it

<div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House
 3
 pièces,                                                                                                         
74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 Appartement
 3
 pièces,                                                                                                         
64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House
 4
 pièces,                                                                                                         
81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>

I'm trying to get the ad and the city. I tried:

#BeautifulSoup
from bs4 import BeautifulSoup
import requests

#to get: House 3 pièces, 74 m²
ad = [ad.get_text() for ad in soup.find_all("span", class_='ergov3-txtannonce')]  

#to get cities       
cities = [city.get_text() for city in soup.find_all("cite", class_='ergov3-txtannonce')]

My output:

[]
[]

Good output:

["House 3 pièces, 74 m²", "Appartement 3 pièces, 64 m²", "House 4 pièces, 81 m²"]                                                                                                       
["New York (11111)", "Los Angeles (22222)", "Chicago (33333)"]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

天生の放荡 2025-02-20 04:54:23

假设您汤包含提供的html选择包含您的信息的元素并迭代resultset以刮擦信息。避免多个列表，尝试一次刮擦所有信息并以更结构化的方式保存：

...
data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })
...

注意： 如果汤中不存在元素，则网站的内容可能会通过javascript - 这将是问一个新问题的预定

from bs4 import BeautifulSoup

html='''
<div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House 3 pièces, 74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 Appartement 3 pièces, 64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House 4 pièces, 81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })

data

输出

[{'title': 'House 3 pièces, 74 m²', 'city': 'New York (11111)'},
 {'title': 'Appartement 3 pièces, 64 m²', 'city': 'Los Angeles (22222)'},
 {'title': 'House 4 pièces, 81 m²', 'city': 'Chicago (33333)'}]

Assuming you soup contains the provided HTML select the elements that holds your information and iterate over the ResultSet to scrape the information. avoid multiple lists, try to scrape all information in one go and save it in a more structured way:

...
data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })
...

Note: If the elements are not present in your soup, content of website may provided dynamically by JavaScript - This would be predestined for asking a new question with exact this focus

Example

from bs4 import BeautifulSoup

html='''
<div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House 3 pièces, 74 m²
 </span>
 <cite>
 New York (11111)
 </cite>
 </div>
</div>,
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 Appartement 3 pièces, 64 m²
 </span>
 <cite>
 Los Angeles (22222)
 </cite>
 </div>
 <div class="ergov3-txtannonce">
 <div class="ergov3-h3"><span>
 House 4 pièces, 81 m²
 </span>
 <cite>
 Chicago (33333)
 </cite>
 </div>
'''
soup = BeautifulSoup(html)

data = []

for e in soup.select('.ergov3-txtannonce'):
    data.append({
        'title':e.span.get_text(strip=True),
        'city':e.cite.get_text(strip=True)
    })

data

Output

[{'title': 'House 3 pièces, 74 m²', 'city': 'New York (11111)'},
 {'title': 'Appartement 3 pièces, 64 m²', 'city': 'Los Angeles (22222)'},
 {'title': 'House 4 pièces, 81 m²', 'city': 'Chicago (33333)'}]

回复收藏 0 原文

~没有更多了~