当前位置：文江博客话题详情

instagram screen-scraping instagram-api

刮擦Instagram媒体时的错误，通过在URL的末尾添加（？__ a = 1）

发布于 2025-02-03 10:01:39 字数 700 浏览 5 评论 0 原文

有时，在尝试刮擦Instagram媒体时，通过在URL的末尾添加（？__ a = 1）

ex： https：//www.instagram.com/p/cp/cp-kws6fors/？ = 1

响应返回了

{
    "__ar": 1,
    "error": 1357004,
    "errorSummary": "Sorry, something went wrong",
    "errorDescription": "Please try closing and re-opening your browser window.",
    "payload": null,
    "hsrp": {
        "hblp": {
            "consistency": {
                "rev": 1005622141
            }
        }
    },
    "lid": "7104767527440109183"
}

为什么返回此响应，我该怎么办才能解决此问题？另外，我们还有另一种获取视频和照片URL的方法吗？

原文

Sometimes when trying to scrape Instagram media, by adding at the end of the URL (?__a=1)

EX:
https://www.instagram.com/p/CP-Kws6FoRS/?__a=1

The response returned

{
    "__ar": 1,
    "error": 1357004,
    "errorSummary": "Sorry, something went wrong",
    "errorDescription": "Please try closing and re-opening your browser window.",
    "payload": null,
    "hsrp": {
        "hblp": {
            "consistency": {
                "rev": 1005622141
            }
        }
    },
    "lid": "7104767527440109183"
}

Why is this response returned and what should I do to fix this? Also, did we have another way to get the video and photo URL?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

枕头说它不想醒 2025-02-10 10:01:39

我通过将＆amp; __ d = dis 添加到URL末尾的查询字符串来解决此问题，就像： https：//www.instagram.com/p/cfr6g-whxxxp /？__ a = 1＆amp; __ d = dis

回复收藏 0 原文

伤感在游骋 2025-02-10 10:01:39

我相信我可能会使用以下方法找到一个解决方法：

https://i.instagram.com/api/v1/users/web_profile_info/?username= {username} 以获取用户的信息和最近的帖子。 data.user 来自响应的与 graphql.user 来自 https://i.instagram.com/ {username}/？__ a = 1 < /代码>。
从＆lt; meta property =“ al：ios：url” content =“ instagram：// media？id = {media_id}”＆gt; 中提取媒体ID。：//instagram.com/p/ {post_shortcode} 。
https://i.instagram.com/api/v1/media/ {Media_id}/info 使用提取的媒体ID获得与 https://instagram.com/相同的响应p/{post_shortcode}/？__ a = 1 。

重要的一点：

脚本中使用的用户代理很重要。我发现当开发工具中的重新订购请求返回“对不起，出了问题出了问题” 错误时，我发现了生成的firefox。
该解决方案使用Firefox配置文件中的cookie。在运行此脚本之前，您需要在Firefox中登录Instagram。如果愿意，可以将Firefox切换为Chrome。

cookiejar = browser_cookie3.chrome(domain_name='instagram.com')

这是完整的脚本。让我知道这是否有帮助！

import os
import pathlib
import string
from datetime import datetime, timedelta
from urllib.parse import urlparse
import bs4 as bs
import browser_cookie3
from google.auth.transport import requests
import requests

# setup.
username = "<username>"
output_path = "C:\\some\\path"
headers = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 9; GM1903 Build/PKQ1.190110.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/75.0.3770.143 Mobile Safari/537.36 Instagram 103.1.0.15.119 Android (28/9; 420dpi; 1080x2260; OnePlus; GM1903; OnePlus7; qcom; sv_SE; 164094539)"
}


def download_post_media(post: dict, media_list: list, number: int):
    output_filename = f"{output_path}/{username}"
    if not os.path.isdir(output_filename):
        os.mkdir(output_filename)
    post_time = datetime.fromtimestamp(int(post["taken_at_timestamp"])) + timedelta(hours=5)
    output_filename += f"/{username}_{post_time.strftime('%Y%m%d%H%M%S')}_{post['shortcode']}_{number}"
    current_media_json = media_list[number - 1]
    if current_media_json['media_type'] == 1:
        media_type = "image"
        media_ext = ".jpg"
        media_url = current_media_json["image_versions2"]['candidates'][0]['url']
    elif current_media_json['media_type'] == 2:
        media_type = "video"
        media_ext = ".mp4"
        media_url = current_media_json["video_versions"][0]['url']
    output_filename += media_ext
    response = send_request_get_response(media_url)
    with open(output_filename, 'wb') as f:
        f.write(response.content)


def send_request_get_response(url):
    cookiejar = browser_cookie3.firefox(domain_name='instagram.com')
    return requests.get(url, cookies=cookiejar, headers=headers)


# use the /api/v1/users/web_profile_info/ api to get the user's information and its most recent posts.
profile_api_url = f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}"
profile_api_response = send_request_get_response(profile_api_url)
# data.user is the same as graphql.user from ?__a=1.
timeline_json = profile_api_response.json()["data"]["user"]["edge_owner_to_timeline_media"]
for post in timeline_json["edges"]:
    # get the HTML page of the post.
    post_response = send_request_get_response(f"https://instagram.com/p/{post['node']['shortcode']}")
    html = bs.BeautifulSoup(post_response.text, 'html.parser')
    # find the meta tag containing the link to the post's media.
    meta = html.find(attrs={"property": "al:ios:url"})
    media_id = meta.attrs['content'].replace("instagram://media?id=", "")
    # use the media id to get the same response as ?__a=1 for the post.
    media_api_url = f"https://i.instagram.com/api/v1/media/{media_id}/info"
    media_api_response = send_request_get_response(media_api_url)
    media_json = media_api_response.json()["items"][0]
    media = list()
    if 'carousel_media_count' in media_json:
        # multiple media post.
        for m in media_json['carousel_media']:
            media.append(m)
    else:
        # single media post.
        media.append(media_json)
    media_number = 0
    for m in media:
        media_number += 1
        download_post_media(post['node'], media, media_number)

I believe I may found a workaround using:

https://i.instagram.com/api/v1/users/web_profile_info/?username={username} to get the user's info and recent posts. data.user from the response is the same as graphql.user from https://i.instagram.com/{username}/?__a=1.
Extract the media id from <meta property="al:ios:url" content="instagram://media?id={media_id}"> in the HTML response of https://instagram.com/p/{post_shortcode}.
https://i.instagram.com/api/v1/media/{media_id}/info using the extracted media id to get the same response as https://instagram.com/p/{post_shortcode}/?__a=1.

A couple important of points:

The user-agent used in the script is important. I found the one Firefox generated when re-sending requests in the dev tools returned the "Sorry, something went wrong" error.
This solution uses cookies from your Firefox profile. You need to sign in to Instagram in Firefox before running this script. You can switch Firefox to Chrome if you'd like.

cookiejar = browser_cookie3.chrome(domain_name='instagram.com')

Here's the full script. Let me know if this is helpful!

import os
import pathlib
import string
from datetime import datetime, timedelta
from urllib.parse import urlparse
import bs4 as bs
import browser_cookie3
from google.auth.transport import requests
import requests

# setup.
username = "<username>"
output_path = "C:\\some\\path"
headers = {
    "User-Agent": "Mozilla/5.0 (Linux; Android 9; GM1903 Build/PKQ1.190110.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/75.0.3770.143 Mobile Safari/537.36 Instagram 103.1.0.15.119 Android (28/9; 420dpi; 1080x2260; OnePlus; GM1903; OnePlus7; qcom; sv_SE; 164094539)"
}


def download_post_media(post: dict, media_list: list, number: int):
    output_filename = f"{output_path}/{username}"
    if not os.path.isdir(output_filename):
        os.mkdir(output_filename)
    post_time = datetime.fromtimestamp(int(post["taken_at_timestamp"])) + timedelta(hours=5)
    output_filename += f"/{username}_{post_time.strftime('%Y%m%d%H%M%S')}_{post['shortcode']}_{number}"
    current_media_json = media_list[number - 1]
    if current_media_json['media_type'] == 1:
        media_type = "image"
        media_ext = ".jpg"
        media_url = current_media_json["image_versions2"]['candidates'][0]['url']
    elif current_media_json['media_type'] == 2:
        media_type = "video"
        media_ext = ".mp4"
        media_url = current_media_json["video_versions"][0]['url']
    output_filename += media_ext
    response = send_request_get_response(media_url)
    with open(output_filename, 'wb') as f:
        f.write(response.content)


def send_request_get_response(url):
    cookiejar = browser_cookie3.firefox(domain_name='instagram.com')
    return requests.get(url, cookies=cookiejar, headers=headers)


# use the /api/v1/users/web_profile_info/ api to get the user's information and its most recent posts.
profile_api_url = f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}"
profile_api_response = send_request_get_response(profile_api_url)
# data.user is the same as graphql.user from ?__a=1.
timeline_json = profile_api_response.json()["data"]["user"]["edge_owner_to_timeline_media"]
for post in timeline_json["edges"]:
    # get the HTML page of the post.
    post_response = send_request_get_response(f"https://instagram.com/p/{post['node']['shortcode']}")
    html = bs.BeautifulSoup(post_response.text, 'html.parser')
    # find the meta tag containing the link to the post's media.
    meta = html.find(attrs={"property": "al:ios:url"})
    media_id = meta.attrs['content'].replace("instagram://media?id=", "")
    # use the media id to get the same response as ?__a=1 for the post.
    media_api_url = f"https://i.instagram.com/api/v1/media/{media_id}/info"
    media_api_response = send_request_get_response(media_api_url)
    media_json = media_api_response.json()["items"][0]
    media = list()
    if 'carousel_media_count' in media_json:
        # multiple media post.
        for m in media_json['carousel_media']:
            media.append(m)
    else:
        # single media post.
        media.append(media_json)
    media_number = 0
    for m in media:
        media_number += 1
        download_post_media(post['node'], media, media_number)

回复收藏 0 原文

鸠魁 2025-02-10 10:01:39

用户代理：

Mozilla/5.0 (Linux; Android 9; GM1903 Build/PKQ1.190110.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/75.0.3770.143 Mobile Safari/537.36 Instagram 103.1.0.15.119 Android (28/9; 420dpi; 1080x2260; OnePlus; GM1903; OnePlus7; qcom; sv_SE; 164094539)

/？__ a = 1替代端点；
但是，您应该将用户代理使用此端点。
v1/users/web_profile_info/？username = {username}

data.graphql.user = data.user
给出相同的结果

User-Agent:

Mozilla/5.0 (Linux; Android 9; GM1903 Build/PKQ1.190110.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/75.0.3770.143 Mobile Safari/537.36 Instagram 103.1.0.15.119 Android (28/9; 420dpi; 1080x2260; OnePlus; GM1903; OnePlus7; qcom; sv_SE; 164094539)

/?__a=1 alternative endpoint;
BUT You should put user-agent for using this endpoint.
https://i.instagram.com/api/v1/users/web_profile_info/?username={username}

data.graphql.user = data.user
give same result

回复收藏 0 原文

生生漫 2025-02-10 10:01:39

IG修改了该方法，
使用了新方法：

GET https://i.instagram.com/api/v1/tags/web_info/?tag_name=${tags}

POST https://i.instagram.com/api/v1/tags/${tags}/sections/
body: 
{
include_persistent: 0
max_id: ${The last request contained this field}
next_media_ids[]: ${The last request contained this field}
next_media_ids[]: ${The last request contained this field}
page: ${The last request contained this field}
surface: grid
tab: recent
}

ig modified the method,
used the new method:

GET https://i.instagram.com/api/v1/tags/web_info/?tag_name=${tags}

POST https://i.instagram.com/api/v1/tags/${tags}/sections/
body: 
{
include_persistent: 0
max_id: ${The last request contained this field}
next_media_ids[]: ${The last request contained this field}
next_media_ids[]: ${The last request contained this field}
page: ${The last request contained this field}
surface: grid
tab: recent
}

回复收藏 0 原文