如何刮擦跨度同级跨度的文本？

发布于 2025-01-31 18:44:55 字数 955 浏览 1 评论 0原文

您好，我正在尝试学习如何进行网络刮擦，因此我首先尝试网络刮擦我的学校菜单。

我遇到了一个问题，如果我无法将菜单项放在跨度类中，而是将单词在跨度类“显示”的同一行中获取。

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path=chromedriver.exe')#changed this 
driver.get('https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/')
results = []
content = driver.page_source
soups = BeautifulSoup(content, 'html.parser')
element=soups.findAll('span',class_ = 'collapsible-heading-status')
for span in element:
    print(span.text)

我试图将其纳入Span.span.text，但这不会返回我任何东西，所以有人可以给我一些指针，以指示如何在Coldapsible头上提取信息 -状态类。

原文

Hello I'm trying to learn how to web scrape so I started by trying to web scrape my school menu.

Ive come into a problem were I can't get the menu items under a span class but instead get the the word within the same line of the span class "show".

here is a short amount of the html text I am trying to work with

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome(executable_path=chromedriver.exe')#changed this 
driver.get('https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/')
results = []
content = driver.page_source
soups = BeautifulSoup(content, 'html.parser')
element=soups.findAll('span',class_ = 'collapsible-heading-status')
for span in element:
    print(span.text)

I have tried to make it into span.span.text but that wouldn't return me anything so can some one give me some pointer on how to extract the info under the collapsible-heading-status class.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里兽 2025-02-07 18:44:55

美味的华夫饼 - 如前所述，他们已经消失了，但是要获得您的目标，方法是通过css selectors使用使用相邻的同胞组合组合：

for e in soup.select('.collapsible-heading-status + span'):
    print(e.text)

或使用。 find_next_sibling（）：

for e in soup.find_all('span',class_ = 'collapsible-heading-status'):
    print(e.find_next_sibling('span').text)

示例

以结构化的方式获取每个信息的

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/")

soup = BeautifulSoup(driver.page_source, 'html.parser')

data = []
    
for e in soup.select('.nutrition'):
    d = {
        'meal':e.find_previous('h4').text,
        'title':e.find_previous('h5').text,
        'name':e.find_previous('span').text,
        'description': e.p.text
        }
    d.update({n.text:n.find_next().text.strip(': ') for n in e.select('h6')})
    data.append(d)
data

：输出

[{'meal': 'Breakfast',
  'title': 'Fresh Inspirations',
  'name': 'Vanilla Chia Seed Pudding with Blueberrries',
  'description': 'Vanilla chia seed pudding with blueberries, shredded coconut, and toasted almonds',
  'Serving Size': '1 serving',
  'Calories': '392.93',
  'Fat (g)': '36.34',
  'Carbohydrates (g)': '17.91',
  'Protein (g)': '4.59',
  'Allergens': 'Tree Nuts/Coconut',
  'Ingredients': 'Coconut milk, chia seeds, beet sugar, imitation vanilla (water, vanillin, caramel color, propylene glycol, ethyl vanillin, potassium sorbate), blueberries, shredded sweetened coconut (desiccated coconut processed with sugar, water, propylene glycol, salt, sodium metabisulfite), blanched slivered almonds'},
 {'meal': 'Breakfast',
  'title': 'Fresh Inspirations',
  'name': 'Housemade Granola',
  'description': 'Crunchy and sweet granola made with mixed nuts and old fashioned rolled oats',
  'Serving Size': '1/2 cup',
  'Calories': '360.18',
  'Fat (g)': '17.33',
  'Carbohydrates (g)': '47.13',
  'Protein (g)': '8.03',
  'Allergens': 'Gluten/Wheat/Dairy/Peanuts/Tree Nuts',
  'Ingredients': 'Old fashioned rolled oats (per manufacturer, may contain wheat/gluten), sunflower seeds, seedless raisins, unsalted butter, pure clover honey, peanut-free mixed nuts (cashews, almonds, sunflower oil and/or cottonseed oil, pecans, hazelnuts, dried Brazil nuts, salt), light brown beet sugar, molasses'},...]

Yummy waffles - As mentioned they are gone, but to get your goal an approach would be to select the names via css selectors using the adjacent sibling combinator:

for e in soup.select('.collapsible-heading-status + span'):
    print(e.text)

or with find_next_sibling():

for e in soup.find_all('span',class_ = 'collapsible-heading-status'):
    print(e.find_next_sibling('span').text)

Example

To get the whole information for each in a structured way you could use:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://housing.ucdavis.edu/dining/menus/dining-commons/tercero/")

soup = BeautifulSoup(driver.page_source, 'html.parser')

data = []
    
for e in soup.select('.nutrition'):
    d = {
        'meal':e.find_previous('h4').text,
        'title':e.find_previous('h5').text,
        'name':e.find_previous('span').text,
        'description': e.p.text
        }
    d.update({n.text:n.find_next().text.strip(': ') for n in e.select('h6')})
    data.append(d)
data

Output

[{'meal': 'Breakfast',
  'title': 'Fresh Inspirations',
  'name': 'Vanilla Chia Seed Pudding with Blueberrries',
  'description': 'Vanilla chia seed pudding with blueberries, shredded coconut, and toasted almonds',
  'Serving Size': '1 serving',
  'Calories': '392.93',
  'Fat (g)': '36.34',
  'Carbohydrates (g)': '17.91',
  'Protein (g)': '4.59',
  'Allergens': 'Tree Nuts/Coconut',
  'Ingredients': 'Coconut milk, chia seeds, beet sugar, imitation vanilla (water, vanillin, caramel color, propylene glycol, ethyl vanillin, potassium sorbate), blueberries, shredded sweetened coconut (desiccated coconut processed with sugar, water, propylene glycol, salt, sodium metabisulfite), blanched slivered almonds'},
 {'meal': 'Breakfast',
  'title': 'Fresh Inspirations',
  'name': 'Housemade Granola',
  'description': 'Crunchy and sweet granola made with mixed nuts and old fashioned rolled oats',
  'Serving Size': '1/2 cup',
  'Calories': '360.18',
  'Fat (g)': '17.33',
  'Carbohydrates (g)': '47.13',
  'Protein (g)': '8.03',
  'Allergens': 'Gluten/Wheat/Dairy/Peanuts/Tree Nuts',
  'Ingredients': 'Old fashioned rolled oats (per manufacturer, may contain wheat/gluten), sunflower seeds, seedless raisins, unsalted butter, pure clover honey, peanut-free mixed nuts (cashews, almonds, sunflower oil and/or cottonseed oil, pecans, hazelnuts, dried Brazil nuts, salt), light brown beet sugar, molasses'},...]

回复收藏 0 原文

~没有更多了~