美丽的协助
我正在尝试刮擦以下网站( https://www.english-heritage.org.uk.uk/visit/blue-plaques/#?pagebp = 1&sizebp; sizebp; sizebp =12&ampp = ampp; ampp; ambp; ambp; ambp; ambp; amp.amp.amkybp = umkekeybp =&amp ump atampt = ; catbp = 0 ),最终有兴趣存储每个'li class =“ search-result-item”中的某些数据以执行进一步的分析。
一个“搜索result-item”的示例,
我想捕获< h3>
,< span class =“ plaque-lole”>
和< span class =“ Plaque-Location”>
在Python词典中:
<li class="search-result-item"><a href="/visit/blue-plaques/helen-gwynne-vaughan/"><img class="search-result-image max-width" src="/siteassets/home/visit/blue-plaques/find-a-plaque/blue-plaques-f-j/helen-gwynne-vaughan-plaque.jpg?w=732&h=465&mode=crop&scale=both&cache=always&quality=60&anchor=&WebsiteVersion=20220516171525" alt="" title=""><div class="search-result-info"><h3>GWYNNE-VAUGHAN, Dame Helen (1879-1967)</h3><span class="plaque-role">Botanist and Military Officer</span><span class="plaque-location">Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden</span></div></a></li>
到目前为止,我正在尝试隔离所有“搜索result-item”,但我当前的代码绝对没有打印。如果有人可以帮助我解决这个问题,并将我指向正确的方向,将每个数据元素存储到Python词典中,我将非常感激。
from bs4 import BeautifulSoup
import requests
url = 'https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#print(soup.prettify())
print(soup.find_all(class_='search-result-item')).get_text()
I am trying to scrape the following website (https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0) and ultimately am interested in storing some of the data inside each 'li class="search-result-item"' to perform further analytics.
Example of one "search-result-item"
I want to capture the <h3>
,<span class="plaque-role">
and <span class="plaque-location">
in a python dictionary:
<li class="search-result-item"><a href="/visit/blue-plaques/helen-gwynne-vaughan/"><img class="search-result-image max-width" src="/siteassets/home/visit/blue-plaques/find-a-plaque/blue-plaques-f-j/helen-gwynne-vaughan-plaque.jpg?w=732&h=465&mode=crop&scale=both&cache=always&quality=60&anchor=&WebsiteVersion=20220516171525" alt="" title=""><div class="search-result-info"><h3>GWYNNE-VAUGHAN, Dame Helen (1879-1967)</h3><span class="plaque-role">Botanist and Military Officer</span><span class="plaque-location">Flat 93, Bedford Court Mansions, Fitzrovia, London, WC1B 3AE, London Borough of Camden</span></div></a></li>
So far I am trying to isolate all the "search-result-item" but my current code prints absolutely nothing. If someone can help me sort that problem out and point me in the right direction to storing each data element into a python dictionary I would be very grateful.
from bs4 import BeautifulSoup
import requests
url = 'https://www.english-heritage.org.uk/visit/blue-plaques/#?pageBP=1&sizeBP=12&borBP=0&keyBP=&catBP=0'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
#print(soup.prettify())
print(soup.find_all(class_='search-result-item')).get_text()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
内容是通过
javaScript
动态生成的,因此您不会使用beautifulsoup
而不是使用其API找到所需的元素/信息。示例
输出
Content is generated dynamically by
JavaScript
so you wont find the elements / info you are looking for withBeautifulSoup
, instead use their API.Example
Output
您不会得到任何东西,因为搜索结果是由JavaScript生成的。使用他们从中获取数据的API端点。
例如:
输出:
You're not getting anything because the search results are generated by JavaScript. Use the API endpoint they fetch the data from.
For example:
Output: