网络刮擦UL LI标签

发布于 2025-02-12 04:08:15 字数 1379 浏览 1 评论 0原文

我正在尝试刮擦UL＆amp; Capterra产品页面的LI标签。我想在单独的变量中获取并存储的信息是“位于'country”，“ url地址”和产品功能的信息。

目前，我只知道如何为UL＆amp中的所有内容打印文本。李，不是具体的。

代码：

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager
import requests

driver = webdriver.Firefox()
driver.get("https://www.capterra.com/p/81310/AMCS/")

companyProfile = bs(driver.page_source, 'html.parser')

url = companyProfile.find("ul", class_="nb-type-md nb-list-undecorated undefined").text

features = companyProfile.find("div", class_="nb-col-count-1 sm:nb-col-count-2 md:nb-col-count-3 nb-col-gap-xl nb-my-0 nb-mx-auto").text 

print(url)
print(features)

driver.close()

输出：

AMCSLocated in United StatesFounded in 2004http://www.amcsgroup.com/
Billing & InvoicingBrokerage ManagementBuy / Sell TicketingContainer ManagementCustomer AccountsCustomer DatabaseDispatch ManagementElectronics RecyclingEquipment TrackingFingerprint ScanningID ScanningIntegrated CamerasInventory ManagementInventory TrackingLogistics Management

如何仅获得URL和国家 /地区，如何整齐地获得功能？

我能够通过：

url = driver.find_element(By. XPATH, "//*[starts-with(., 'http')]").text

location = driver.find_element(By. XPATH, "//*[starts-with(., 'Located in')]").text

仍在寻找功能的解决方案来获取URL和位置。

原文

I am trying to scrape the ul & li tags for capterra product pages. The information I want to get and store in separate variables is the "located in 'country," "the url address," and the product features.

Currently, I only know how to print the text for everything in the ul & li, not something specific.

Code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager
import requests

driver = webdriver.Firefox()
driver.get("https://www.capterra.com/p/81310/AMCS/")

companyProfile = bs(driver.page_source, 'html.parser')

url = companyProfile.find("ul", class_="nb-type-md nb-list-undecorated undefined").text

features = companyProfile.find("div", class_="nb-col-count-1 sm:nb-col-count-2 md:nb-col-count-3 nb-col-gap-xl nb-my-0 nb-mx-auto").text 

print(url)
print(features)

driver.close()

Output:

AMCSLocated in United StatesFounded in 2004http://www.amcsgroup.com/
Billing & InvoicingBrokerage ManagementBuy / Sell TicketingContainer ManagementCustomer AccountsCustomer DatabaseDispatch ManagementElectronics RecyclingEquipment TrackingFingerprint ScanningID ScanningIntegrated CamerasInventory ManagementInventory TrackingLogistics Management

How do I get only the url and the country, and how do I get the features neatly?

I was able to get the URL and the location by:

url = driver.find_element(By. XPATH, "//*[starts-with(., 'http')]").text

location = driver.find_element(By. XPATH, "//*[starts-with(., 'Located in')]").text

Still looking for a solution for the features.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挖鼻大婶 2025-02-19 04:08:15

以下代码将与ul＆gt; li标签

soup = BeautifulSoup(driver.page_source, 'html.parser')

for li in soup.find_all('ul',class_="nb-type-md nb-list-undecorated undefined"):
    name = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(1) > span').get_text()
    location = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(2) > span').get_text()
    year = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(3) > span').get_text()
    link = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(4) > span').get_text()

    print(name)
    print(location)
    print(year)
    print(link)

输出：

AMCS
Located in United States 
Founded in 2004
http://www.amcsgroup.com/

更新：

li=[x.get_text() for x in soup.select('[class="nb-type-md nb-list-undecorated undefined"] li span')]
print(li)

输出：

['AMCS', 'Located in United States', 'Founded in 2004', 'http://www.amcsgroup.com/']

The following code will pull all the text nodes value separately from ul > li tags

soup = BeautifulSoup(driver.page_source, 'html.parser')

for li in soup.find_all('ul',class_="nb-type-md nb-list-undecorated undefined"):
    name = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(1) > span').get_text()
    location = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(2) > span').get_text()
    year = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(3) > span').get_text()
    link = li.select_one('[class="nb-type-md nb-list-undecorated undefined"] li:nth-child(4) > span').get_text()

    print(name)
    print(location)
    print(year)
    print(link)

Output:

AMCS
Located in United States 
Founded in 2004
http://www.amcsgroup.com/

Update:

li=[x.get_text() for x in soup.select('[class="nb-type-md nb-list-undecorated undefined"] li span')]
print(li)

Output:

['AMCS', 'Located in United States', 'Founded in 2004', 'http://www.amcsgroup.com/']

回复收藏 0 原文

~没有更多了~

关于作者

菩提树下叶撕阳。

暂无简介

文章

28 人气

关注发私信

冰之心

文章 0 评论 0

关注

貪欢

文章 0 评论 0

关注

好菇凉咱不稀罕他

文章 0 评论 0

关注

guowei007

文章 0 评论 0

关注

大海や

文章 0 评论 0

关注

1KUPGZrJCxEwZ

文章 0 评论 0

友情链接

文江博客

网络刮擦UL LI标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

冰之心

貪欢

好菇凉咱不稀罕他

guowei007

大海や

1KUPGZrJCxEwZ

友情链接

网络刮擦UL LI标签

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

冰之心

貪欢

好菇凉咱不稀罕他

guowei007

大海や

1KUPGZrJCxEwZ

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。