网页抓取时提取json

发布于 2025-01-18 10:38:50 字数 927 浏览 1 评论 0 原文

我正在遵循有关网络抓取的 python 指南,但有一行代码对我不起作用。如果有人能帮我找出问题所在,我将不胜感激,谢谢。

from bs4 import BeautifulSoup
import json
import re
import requests

url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
script = soup.find('script', text=re.compile('root\.App\.main'))

json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)

错误消息:

    json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
AttributeError: 'NoneType' object has no attribute 'string'

链接到我正在查看的指南:https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/

I was following a python guide on web scraping and there's one line of code that won't work for me. I'd appreciate it if anybody could help me figure out what the issue is, thanks.

from bs4 import BeautifulSoup
import json
import re
import requests

url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
script = soup.find('script', text=re.compile('root\.App\.main'))

json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*

Error Message:

    json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*

Link to the guide I was looking at: https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/

,script.string, flags=re.MULTILINE).group(1)

Error Message:


Link to the guide I was looking at: https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/

,script.string, flags=re.MULTILINE).group(1) AttributeError: 'NoneType' object has no attribute 'string'

Link to the guide I was looking at: https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/

,script.string, flags=re.MULTILINE).group(1)

Error Message:


Link to the guide I was looking at: https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

谜兔 2025-01-25 10:38:50

我认为主要问题是您应该在您的请求中添加一个 user-agent ,以便您获得预期的 HTML:

headers =   {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)

注意:几乎并且首先 -更深入地研究你的汤,检查是否有预期的信息。

示例

import re
import json
from bs4 import BeautifulSoup

url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
headers =   {'user-agent':'Mozilla/5.0'}

page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)

script = soup.find('script', text=re.compile('root\.App\.main'))

json_text = json.loads(re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*
,script.string, flags=re.MULTILINE).group(1))
json_text

Main issue in my opinion is that you should add an user-agent to your request, so that you get expected HTML:

headers =   {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)

Note: Almost and first at all - Take a deeper look into your soup, to check if expected information is available.

Example

import re
import json
from bs4 import BeautifulSoup

url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
headers =   {'user-agent':'Mozilla/5.0'}

page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)

script = soup.find('script', text=re.compile('root\.App\.main'))

json_text = json.loads(re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*
,script.string, flags=re.MULTILINE).group(1))
json_text
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文