可以使用请求模块从静态网页中刮擦信息

发布于 2025-02-07 10:55:41 字数 911 浏览 1 评论 0原文

我正在尝试获取产品标题,并且它是网页使用请求模块。标题和描述似乎是静态的,因为它们都存在于页面源中。但是,我没有尝试使用以下尝试来抓住它们。脚本访问attributeError此刻。

import requests
from bs4 import BeautifulSoup

link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}

with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    product_title = soup.select_one("h1[itemProp='name']").text
    product_desc = soup.select_one("#product-page-selling-statement").text
    print(product_title,product_desc)

如何使用请求模块从上面的页面刮擦标题和描述?

I'm trying to fetch product title and it's description from a webpage using requests module. The title and description appear to be static as they both are present in page source. However, I failed to grab them using following attempt. The script throws AttributeError at this moment.

import requests
from bs4 import BeautifulSoup

link = 'https://www.nordstrom.com/s/anine-bing-womens-plaid-shirt/6638030'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36',
}

with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    product_title = soup.select_one("h1[itemProp='name']").text
    product_desc = soup.select_one("#product-page-selling-statement").text
    print(product_title,product_desc)

How can I scrape title and description from above pages using requests module?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

好多鱼好多余 2025-02-14 10:55:41

页面是动态的。追求来自API源的数据:

import requests
import pandas as pd

api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()

df = pd.json_normalize(jsonData['products'].values())

print(df.iloc[0])

输出:

id                                                       6638030-400
name                                  ANINE BING Women's Plaid Shirt
styleId                                                      6638030
styleNumber                                                         
colorCode                                                        400
colorName                                                       BLUE
brandLabelName                                            ANINE BING
hasFlatShot                                                     True
imageUrl           https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price                                                        $149.00
pathAlias          anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice                                                $149.00
productTypeLvl1                                                   12
productTypeLvl2                                                  216
isUmap                                                         False
Name: 0, dtype: object

The page is dynamic. go after the data from the api source:

import requests
import pandas as pd

api = 'https://www.nordstrom.com/api/ng-looks/styleId/6638030?customerId=f36cf526cfe94a72bfb710e5e155f9ba&limit=7'
jsonData = requests.get(api).json()

df = pd.json_normalize(jsonData['products'].values())

print(df.iloc[0])

Output:

id                                                       6638030-400
name                                  ANINE BING Women's Plaid Shirt
styleId                                                      6638030
styleNumber                                                         
colorCode                                                        400
colorName                                                       BLUE
brandLabelName                                            ANINE BING
hasFlatShot                                                     True
imageUrl           https://n.nordstrommedia.com/id/sr3/6d000f40-8...
price                                                        $149.00
pathAlias          anine-bing-womens-plaid-shirt/6638030?origin=c...
originalPrice                                                $149.00
productTypeLvl1                                                   12
productTypeLvl2                                                  216
isUmap                                                         False
Name: 0, dtype: object
雪落纷纷 2025-02-14 10:55:41

测试此类请求时,您应该输出响应,以查看您的回复。最好使用Postman之类的东西(我认为VSCODE现在具有与之相似的功能)来设置URL,标题,方法和参数,并且还可以看到带标头的完整响应。当您将所有操作都正确时,只需将其转换为Python代码即可。 Postman甚至对普通语言具有一些“导出对代码”功能。

无论如何...

我尝试了您对Postman的请求,并得到了此答复:

“简单响应标头”

python与浏览器完成的请求是同一件事。如果标题,URL和参数是相同的,则应收到相同的响应。因此,下一步是比较您的请求与浏览器所做的请求之间的差异:

因此,浏览器包含的一个或多个标题可以从服务器中获得良好的响应,但是只需使用用户代理是不够的。

我会尝试确定哪些标题,但不幸的是,Nordstrom检测到了一些“异常活动”,并且似乎已阻止了我的IP :(

可能是由于发送明显的手工要求。我认为这是我的IP被阻止,因为即使清除了缓存,我也无法从任何浏览器访问该网站。

如此仔细检查,在与刮板一起工作时,您没有发生过同样的情况。

祝你好运!

When testing requests like these you should output the response to see what you're getting back. Best to use something like Postman (I think VSCode has a similar function to it now) to set up URLs, headers, methods, and parameters, and to also see the full response with headers. When you have everything working right, just convert it to python code. Postman even has some 'export to code' functions for common languages.

Anyways...

I tried your request on Postman and got this response:
simple response body

simple response headers

Requests done from python vs a browser are the same thing. If the headers, URLs, and parameters are identical, they should receive identical responses. So the next step is comparing the difference between your request and the request done by the browser:
browser request

So one or more of the headers included by the browser gets a good response from the server, but just using User-Agent is not enough.

I would try to identify which headers, but unfortunately, Nordstrom detected some 'unusual activity' and seems to have blocked my IP :(
blocked
Probably due to sending an obvious handmade request. I think it's my IP that's blocked since I can't access the site from any browser, even after clearing my cache.

So double-check that the same hasn't happened to you while working with your scraper.

Best of luck!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文