无法使用Python和Selenium检索HREF属性

发布于 2025-01-20 13:21:00 字数 1676 浏览 4 评论 0原文

我对此非常陌生,并且花了几个小时尝试我在这里阅读的各种方法。很抱歉,如果我犯了一些愚蠢的错误

,我想创建一个乐高积木的数据库。从Brickset.com中摘下图像和信息,

我正在使用:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
anchors = [a.get_attribute('href') for a in anchors]

打印(锚)返回:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')

我要定位的是:

div id="ui-tabs-2" class="ui-tabs-panel ui-widget-content ui-corner-bottom" aria-live="polite" aria-labelledby="ui-id-4" role="tabpanel" aria-expanded="true" aria-hidden="false" style="display: block;">
<ul class="moreimages">
<li>
<a href="https://images.brickset.com/sets/AdditionalImages/21054-1/21054_alt10.jpg" class="highslide plain " onclick="return hs.expand(this)">
<img src="https://images.brickset.com/sets/AdditionalImages/21054-1/tn_21054_alt10_jpg.jpg" title="" onerror="this.src='/assets/images/spacer2.png'" loading="lazy">
</a><div class="highslide-caption">

我失去了想解决这个问题的想法。

更新 仍然没有获得HREF属性。为了添加更多细节,我正在尝试将图像在此URL上的“图像”选项卡下获取: https://brickset.com/sets/sets/21330-1/home-alone 这是有问题的代码:

anchors = driver.find_elements(By.XPATH, '//*[@id="ui-tabs-2"]/ul/li/a')
links = [anchors.get_attribute('href') for a in anchors]
print('Found ' + str(len(anchors)) + ' links to images')

我也尝试过:

#anchors = driver.find_elements_by_css_selector("a[href*='21330']")

这仅返回了一个HREF,即使应该有大约十二个。

谢谢大家的帮助!

I'm very new to this and have spent hours trying various methods I've read here. Apologies if I'm making some silly mistake

I want to create a database of my LEGO sets. Pulling images and info from brickset.com

I'm using:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
anchors = [a.get_attribute('href') for a in anchors]

print (anchors) returns:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')

What I'm trying to target:

div id="ui-tabs-2" class="ui-tabs-panel ui-widget-content ui-corner-bottom" aria-live="polite" aria-labelledby="ui-id-4" role="tabpanel" aria-expanded="true" aria-hidden="false" style="display: block;">
<ul class="moreimages">
<li>
<a href="https://images.brickset.com/sets/AdditionalImages/21054-1/21054_alt10.jpg" class="highslide plain " onclick="return hs.expand(this)">
<img src="https://images.brickset.com/sets/AdditionalImages/21054-1/tn_21054_alt10_jpg.jpg" title="" onerror="this.src='/assets/images/spacer2.png'" loading="lazy">
</a><div class="highslide-caption">

I'm losing my mind trying to figure this out.

Update
Still not getting the href attributes. To add more detail, I'm trying to get the images under the "images" tab on this URL:
https://brickset.com/sets/21330-1/Home-Alone
Here is the problematic code:

anchors = driver.find_elements(By.XPATH, '//*[@id="ui-tabs-2"]/ul/li/a')
links = [anchors.get_attribute('href') for a in anchors]
print('Found ' + str(len(anchors)) + ' links to images')

I've also tried:

#anchors = driver.find_elements_by_css_selector("a[href*='21330']")

This only returned one href, even though there should be about a dozen.

Thank you all for the assistance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

拍不死你 2025-01-27 13:21:00

您不应该为多个变量使用相同的名称。

根据第一行代码:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')

是WebElements的列表。 下,使用 href 属性创建另一个列表,您应该使用另一个名称,例如

理想情况 BE:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
hrefs = [a.get_attribute('href') for a in anchors]
print(hrefs)

使用 list classence 在一行:

print(a.get_attribute('href') for a in driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a'))

You shouldn't be using the same name for multiple variables.

As per the first line of code:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')

anchors is the list of WebElements. Ideally to create another list with the href attributes you should use another name, e.g. hrefs

Effectively your code block will be:

anchors = driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a')
hrefs = [a.get_attribute('href') for a in anchors]
print(hrefs)

Using list comprehension in a single line:

print(a.get_attribute('href') for a in driver.find_elements_by_xpath('//*[@id="ui-tabs-2"]/ul/li[1]/a'))
故事还在继续 2025-01-27 13:21:00

第一件事,driver.find_elements_by_xpath已弃用,使用driver.find_element(by.xpath,'locator'')而不是。

现在,如果您想获取页面上链接的所有href

elements = driver.find_element(By.XPATH, '//*[@id="ui-tabs-2"]/ul/li/a')
links = [element.get_attribute('href') for element in elements]

请注意,我不使用[1]来获取一个元素,而是而是所有元素。

First thing, driver.find_elements_by_xpath is deprecated, use driver.find_element(By.XPATH, 'locator') instead.

Now, if you'd like to get all hrefs of the links on the page:

elements = driver.find_element(By.XPATH, '//*[@id="ui-tabs-2"]/ul/li/a')
links = [element.get_attribute('href') for element in elements]

Notice that I'm not using [1] to get a single element, but rather all elements.

时间海 2025-01-27 13:21:00

您可能想尝试一下。

注意:我在此处不使用

import time

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0",
}

sample_urls = [
    "https://brickset.com/sets/21330-1/Home-Alone",
    "https://brickset.com/sets/21101-1/Hayabusa"
]

with requests.Session() as s:
    for sample_url in sample_urls:
        ajax_setID = [
            a["href"] for a in
            BeautifulSoup(s.get(sample_url, headers=headers).text, "lxml").find_all("a")
            if "mainImage" in a["href"]
        ][0]
        image_url = f"https://brickset.com{ajax_setID}&_{int(time.time() * 1000)}"
        headers.update(
            {
                "Referer": sample_url,
                "X-Requested-With": "XMLHttpRequest",
            }
        )
        source_image = (
            BeautifulSoup(
                s.get(image_url, headers=headers).text, "lxml"
            ).find("img")["src"]
        )
        print(f"{sample_url.split('/', -1)[-1]} -> {source_image}")

这应该输出:

Home-Alone -> https://images.brickset.com/sets/images/21330-1.jpg?202109060933
Hayabusa -> https://images.brickset.com/sets/images/21101-1.jpg?201201150457

You might want to try this.

NOTE: I'm not using selenium here.

import time

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:99.0) Gecko/20100101 Firefox/99.0",
}

sample_urls = [
    "https://brickset.com/sets/21330-1/Home-Alone",
    "https://brickset.com/sets/21101-1/Hayabusa"
]

with requests.Session() as s:
    for sample_url in sample_urls:
        ajax_setID = [
            a["href"] for a in
            BeautifulSoup(s.get(sample_url, headers=headers).text, "lxml").find_all("a")
            if "mainImage" in a["href"]
        ][0]
        image_url = f"https://brickset.com{ajax_setID}&_{int(time.time() * 1000)}"
        headers.update(
            {
                "Referer": sample_url,
                "X-Requested-With": "XMLHttpRequest",
            }
        )
        source_image = (
            BeautifulSoup(
                s.get(image_url, headers=headers).text, "lxml"
            ).find("img")["src"]
        )
        print(f"{sample_url.split('/', -1)[-1]} -> {source_image}")

This should output:

Home-Alone -> https://images.brickset.com/sets/images/21330-1.jpg?202109060933
Hayabusa -> https://images.brickset.com/sets/images/21101-1.jpg?201201150457
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文