网络刮擦UL LI标签

发布于 2025-02-12 04:08:15 字数 1379 浏览 1 评论 0原文

我正在尝试刮擦UL& Capterra产品页面的LI标签。我想在单独的变量中获取并存储的信息是“位于'country”,“ url地址”和产品功能的信息。

目前,我只知道如何为UL&amp中的所有内容打印文本。李,不是具体的。

代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager
import requests

driver = webdriver.Firefox()
driver.get("https://www.capterra.com/p/81310/AMCS/")

companyProfile = bs(driver.page_source, 'html.parser')

url = companyProfile.find("ul", class_="nb-type-md nb-list-undecorated undefined").text

features = companyProfile.find("div", class_="nb-col-count-1 sm:nb-col-count-2 md:nb-col-count-3 nb-col-gap-xl nb-my-0 nb-mx-auto").text 

print(url)
print(features)

driver.close()

输出:

AMCSLocated in United StatesFounded in 2004http://www.amcsgroup.com/
Billing & InvoicingBrokerage ManagementBuy / Sell TicketingContainer ManagementCustomer AccountsCustomer DatabaseDispatch ManagementElectronics RecyclingEquipment TrackingFingerprint ScanningID ScanningIntegrated CamerasInventory ManagementInventory TrackingLogistics Management

如何仅获得URL和国家 /地区,如何整齐地获得功能?


我能够通过:

url = driver.find_element(By. XPATH, "//*[starts-with(., 'http')]").text

location = driver.find_element(By. XPATH, "//*[starts-with(., 'Located in')]").text

仍在寻找功能的解决方案来获取URL和位置。

I am trying to scrape the ul & li tags for capterra product pages. The information I want to get and store in separate variables is the "located in 'country," "the url address," and the product features.

Currently, I only know how to print the text for everything in the ul & li, not something specific.

Code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager
import requests

driver = webdriver.Firefox()
driver.get("https://www.capterra.com/p/81310/AMCS/")

companyProfile = bs(driver.page_source, 'html.parser')

url = companyProfile.find("ul", class_="nb-type-md nb-list-undecorated undefined").text

features = companyProfile.find("div", class_="nb-col-count-1 sm:nb-col-count-2 md:nb-col-count-3 nb-col-gap-xl nb-my-0 nb-mx-auto").text 

print(url)
print(features)

driver.close()

Output:

AMCSLocated in United StatesFounded in 2004http://www.amcsgroup.com/
Billing & InvoicingBrokerage ManagementBuy / Sell TicketingContainer ManagementCustomer AccountsCustomer DatabaseDispatch ManagementElectronics RecyclingEquipment TrackingFingerprint ScanningID ScanningIntegrated CamerasInventory ManagementInventory TrackingLogistics Management

How do I get only the url and the country, and how do I get the features neatly?


I was able to get the URL and the location by:

url = driver.find_element(By. XPATH, "//*[starts-with(., 'http')]").text

location = driver.find_element(By. XPATH, "//*[starts-with(., 'Located in')]").text

Still looking for a solution for the features.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

挖鼻大婶 2025-02-19 04:08:15

以下代码将与ul> li标签

soup = BeautifulSoup(driver.page_source, 'html.parser')

for li in soup.find_all('ul',class_="nb-type-md nb-list-undecorated undefined"):
    name = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(1) > span').get_text()
    location = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(2) > span').get_text()
    year = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(3) > span').get_text()
    link = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(4) > span').get_text()

    print(name)
    print(location)
    print(year)
    print(link)

输出:

AMCS
Located in United States 
Founded in 2004
http://www.amcsgroup.com/

更新:

li=[x.get_text() for x in soup.select('[class="nb-type-md nb-list-undecorated undefined"] li span')]
print(li) 

输出:

['AMCS', 'Located in United States', 'Founded in 2004', 'http://www.amcsgroup.com/']

The following code will pull all the text nodes value separately from ul > li tags

soup = BeautifulSoup(driver.page_source, 'html.parser')

for li in soup.find_all('ul',class_="nb-type-md nb-list-undecorated undefined"):
    name = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(1) > span').get_text()
    location = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(2) > span').get_text()
    year = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(3) > span').get_text()
    link = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(4) > span').get_text()

    print(name)
    print(location)
    print(year)
    print(link)

Output:

AMCS
Located in United States 
Founded in 2004
http://www.amcsgroup.com/

Update:

li=[x.get_text() for x in soup.select('[class="nb-type-md nb-list-undecorated undefined"] li span')]
print(li) 

Output:

['AMCS', 'Located in United States', 'Founded in 2004', 'http://www.amcsgroup.com/']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文