使用发布请求从网站上刮下网站表

发布于 2025-01-17 11:09:01 字数 2253 浏览 0 评论 0原文

我的目标是从此 网页< 获取 PQRI 表(列出的两个表中的第二个表) /a> 使用 Python。
由于它是一个ajax表,我尝试了以下操作:

  • 在Chrome中打开网页
  • 打开开发者工具->网络-> Fetch/XHR 获取请求 URL、请求 headers 和 Payload。
  • 使用请求库发出发布请求:
url = "https://apps.usp.org/ajax/USPNF/columnsDB.php"


headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Length": "201",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": "_fbp=fb.1.1646747716384.2068133566; tc_ptid=3U21FqQ3bklFEULP2jijnQ; tc_ptidexpiry=1709819716801; BE_CLA3=p_id%3D8A64RLL6L464RLNNA48664N2RAAAAAAAAH%26bf%3D8d70551f1d08356108a60fc4a2db91d0%26bn%3D1%26bv%3D3.44%26s_expire%3D1648554934915%26s_id%3D8A64RLL6L464RJ2L8J6664N2RAAAAAAAAH; _gid=GA1.2.1041569168.1648468535; _ga_DTGQ04CR27=GS1.1.1648468535.10.0.1648468535.0; USPSESSID=u6i1i80ot1uk49mnauim3o7l37; _ga=GA1.2.1946138806.1646747717; BIGipServerprod_apps.usp.org_http_pool=1271466250.20480.0000",
"Host": "apps.usp.org",
"Origin": "https://apps.usp.org",
"Referer": "https://apps.usp.org/app/USPNF/columnsDB.html",
"sec-ch-ua": "Not A;Brand ;v=99, Chromium;v=99, Google Chrome;v=99",
"sec-ch-ua-mobile" : "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36",
"X-Powered-By": "CPAINT v2.1.0 :: http://sf.net/projects/cpaint",
}

payload = {
"cpaint_function": "updatePQRIResults",
"cpaint_argument[]": "Acclaim%20120%20C18",
"cpaint_argument[]": 0,
"cpaint_argument[]": 0,
"cpaint_argument[]": 0,
"cpaint_argument[]": 2.8,
"cpaint_argument[]": 0,
"cpaint_response_type": "OBJECT",
}

response = requests.post(url, data=payload, headers=headers)

我在开发人员工具中看到所需的输出: 输入图片这里的描述

但是当我提出请求时,我只得到以下响应:

getPQRIData:没有基列'0'\u003cbr\u003e\u000a”

知道我需要更改什么才能获得所需的输出吗?

My goal is to get the PQRI table (second table of the two listed) from this Webpage using Python.
As it is an ajax table, I tried the following:

  • Open the webpage in Chrome
  • Open developer tools -> Network -> Fetch/XHR to get the request URL, request Headers and Payload.
  • Using the request library to make a post request:
url = "https://apps.usp.org/ajax/USPNF/columnsDB.php"


headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Length": "201",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": "_fbp=fb.1.1646747716384.2068133566; tc_ptid=3U21FqQ3bklFEULP2jijnQ; tc_ptidexpiry=1709819716801; BE_CLA3=p_id%3D8A64RLL6L464RLNNA48664N2RAAAAAAAAH%26bf%3D8d70551f1d08356108a60fc4a2db91d0%26bn%3D1%26bv%3D3.44%26s_expire%3D1648554934915%26s_id%3D8A64RLL6L464RJ2L8J6664N2RAAAAAAAAH; _gid=GA1.2.1041569168.1648468535; _ga_DTGQ04CR27=GS1.1.1648468535.10.0.1648468535.0; USPSESSID=u6i1i80ot1uk49mnauim3o7l37; _ga=GA1.2.1946138806.1646747717; BIGipServerprod_apps.usp.org_http_pool=1271466250.20480.0000",
"Host": "apps.usp.org",
"Origin": "https://apps.usp.org",
"Referer": "https://apps.usp.org/app/USPNF/columnsDB.html",
"sec-ch-ua": "Not A;Brand ;v=99, Chromium;v=99, Google Chrome;v=99",
"sec-ch-ua-mobile" : "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36",
"X-Powered-By": "CPAINT v2.1.0 :: http://sf.net/projects/cpaint",
}

payload = {
"cpaint_function": "updatePQRIResults",
"cpaint_argument[]": "Acclaim%20120%20C18",
"cpaint_argument[]": 0,
"cpaint_argument[]": 0,
"cpaint_argument[]": 0,
"cpaint_argument[]": 2.8,
"cpaint_argument[]": 0,
"cpaint_response_type": "OBJECT",
}

response = requests.post(url, data=payload, headers=headers)

I see the desired output in the developer tool:
enter image description here

But when I make the request I only get the following response:

"<c_start></c_start><c_total></c_total>getPQRIData: No base column '0'\u003cbr\u003e\u000a"

Any idea what I need to change to get the desired output?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

Oo萌小芽oO 2025-01-24 11:09:01

您无法将该表单数据作为字典/JSON发送。将其作为字符串发送,应该可以工作:

import pandas as pd
import requests


s = requests.Session()
s.get('https://apps.usp.org/app/USPNF/columnsDB.html')
cookies = s.cookies.get_dict()

cookieStr = ''
for k,v in cookies.items():
    cookieStr += f'{k}={v};'

url = "https://apps.usp.org/ajax/USPNF/columnsDB.php"
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Length": "201",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": cookieStr,
"Host": "apps.usp.org",
"Origin": "https://apps.usp.org",
"Referer": "https://apps.usp.org/app/USPNF/columnsDB.html",
"sec-ch-ua": "Not A;Brand ;v=99, Chromium;v=99, Google Chrome;v=99",
"sec-ch-ua-mobile" : "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.141 Safari/537.36",
"X-Powered-By": "CPAINT v2.1.0 :: http://sf.net/projects/cpaint",
}

final_df = pd.DataFrame()
nextPage = True

page = 0
while nextPage == True:
    i = page*10
    payload = f'cpaint_function=updatePQRIResults&cpaint_argument[]=Acclaim%20120%20C18&cpaint_argument[]=1&cpaint_argument[]=0&cpaint_argument[]=0&cpaint_argument[]=2.8&cpaint_argument[]={i}&cpaint_response_type=OBJECT'
    
    response = s.post(url, data=payload, headers=headers).text
    
    df = pd.read_xml(response).iloc[3:-1,3:]
    
    if (df.iloc[0]['psr'] == 0) and (len(df) == 1):
        nextPage = False
        final_df = final_df.drop_duplicates().reset_index(drop=True)
        
        print('Complete')
    
    else:
        final_df = pd.concat([final_df, df], axis=0)
        
        print(f'Page: {page + 1}')
        page+=1
    

输出:

print(final_df)
       psr    psf                  psn  ...   psvb psvc28 psvc70
0      0.0   0.00      Acclaim 120 C18  ... -0.027  0.086 -0.002
1      1.0   0.24      TSKgel ODS-100Z  ... -0.031 -0.064 -0.161
2      2.0   0.67       Inertsil ODS-3  ... -0.023 -0.474 -0.334
3      3.0   0.74          LaChrom C18  ... -0.006 -0.278 -0.120
4      4.0   0.80       Prodigy ODS(3)  ... -0.012 -0.195 -0.134
..     ...    ...                  ...  ...    ...    ...    ...
753  753.0  29.55        Cosmosil 5PYE  ...  0.092  0.521  1.318
754  754.0  30.44      BioBasic Phenyl  ...  0.217  0.014  0.390
755  755.0  34.56  Microsorb-MV 100 CN  ... -0.029  0.148  0.785
756  756.0  41.62      Inertsil ODS-EP  ...  0.050 -0.620 -0.070
757  757.0  41.84           Flare C18+  ...  0.966 -0.507  1.178

[758 rows x 12 columns]

You can't send that form data as a dictionary/json. Send it as a string and it should work:

import pandas as pd
import requests


s = requests.Session()
s.get('https://apps.usp.org/app/USPNF/columnsDB.html')
cookies = s.cookies.get_dict()

cookieStr = ''
for k,v in cookies.items():
    cookieStr += f'{k}={v};'

url = "https://apps.usp.org/ajax/USPNF/columnsDB.php"
headers = {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9",
"Connection": "keep-alive",
"Content-Length": "201",
"Content-Type": "application/x-www-form-urlencoded",
"Cookie": cookieStr,
"Host": "apps.usp.org",
"Origin": "https://apps.usp.org",
"Referer": "https://apps.usp.org/app/USPNF/columnsDB.html",
"sec-ch-ua": "Not A;Brand ;v=99, Chromium;v=99, Google Chrome;v=99",
"sec-ch-ua-mobile" : "?0",
"sec-ch-ua-platform": "Windows",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
"User-Agent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.141 Safari/537.36",
"X-Powered-By": "CPAINT v2.1.0 :: http://sf.net/projects/cpaint",
}

final_df = pd.DataFrame()
nextPage = True

page = 0
while nextPage == True:
    i = page*10
    payload = f'cpaint_function=updatePQRIResults&cpaint_argument[]=Acclaim%20120%20C18&cpaint_argument[]=1&cpaint_argument[]=0&cpaint_argument[]=0&cpaint_argument[]=2.8&cpaint_argument[]={i}&cpaint_response_type=OBJECT'
    
    response = s.post(url, data=payload, headers=headers).text
    
    df = pd.read_xml(response).iloc[3:-1,3:]
    
    if (df.iloc[0]['psr'] == 0) and (len(df) == 1):
        nextPage = False
        final_df = final_df.drop_duplicates().reset_index(drop=True)
        
        print('Complete')
    
    else:
        final_df = pd.concat([final_df, df], axis=0)
        
        print(f'Page: {page + 1}')
        page+=1
    

Output:

print(final_df)
       psr    psf                  psn  ...   psvb psvc28 psvc70
0      0.0   0.00      Acclaim 120 C18  ... -0.027  0.086 -0.002
1      1.0   0.24      TSKgel ODS-100Z  ... -0.031 -0.064 -0.161
2      2.0   0.67       Inertsil ODS-3  ... -0.023 -0.474 -0.334
3      3.0   0.74          LaChrom C18  ... -0.006 -0.278 -0.120
4      4.0   0.80       Prodigy ODS(3)  ... -0.012 -0.195 -0.134
..     ...    ...                  ...  ...    ...    ...    ...
753  753.0  29.55        Cosmosil 5PYE  ...  0.092  0.521  1.318
754  754.0  30.44      BioBasic Phenyl  ...  0.217  0.014  0.390
755  755.0  34.56  Microsorb-MV 100 CN  ... -0.029  0.148  0.785
756  756.0  41.62      Inertsil ODS-EP  ...  0.050 -0.620 -0.070
757  757.0  41.84           Flare C18+  ...  0.966 -0.507  1.178

[758 rows x 12 columns]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文