如何从网站提取的信息中获取 url

发布于 2025-01-11 11:57:55 字数 920 浏览 1 评论 0原文

所以基本上我遇到了一个问题,我不知道如何从网站提取的数据中获取 URL。

这是我的代码:

import requests
from bs4 import BeautifulSoup

req = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1')

soup = BeautifulSoup(req.content, "html.parser")

print(soup.prettify())

我得到了很多关于输出的信息,但我唯一需要的是网址,我希望有人可以帮助我。

PS:

它给了我这个信息:

{"response":{"items":[{"url":"https:\/\/2ch.hk\/b\/src\/262671212\/16440825183970.webm","type":"video\/webm","filesize":"20259","width":1280,"height":720,"name":"1521967932778.webm","board":"b","thread":"262671212"},{"url":"https:\/\/2ch.hk\/b\/src\/261549765\/16424501976450.webm","type":"video\/webm","filesize":"12055","width":1280,"height":720,"name":"1526793203110.webm","board":"b","thread":"261549765"}...

但我只需要所有东西中的这一部分 https:\/\/2ch.hk\/b\/src\/261549765\/16424501976450.webm (不完全是这个网址,只是作为示例)

So basically I am stuck on the problem where I don't know how to the url from the extracted data from a website.

Here is my code:

import requests
from bs4 import BeautifulSoup

req = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1')

soup = BeautifulSoup(req.content, "html.parser")

print(soup.prettify())

I get a lot of information on output, but the only thing I need is the url, I hope someone can help me.

P.S:

It gives me this information:

{"response":{"items":[{"url":"https:\/\/2ch.hk\/b\/src\/262671212\/16440825183970.webm","type":"video\/webm","filesize":"20259","width":1280,"height":720,"name":"1521967932778.webm","board":"b","thread":"262671212"},{"url":"https:\/\/2ch.hk\/b\/src\/261549765\/16424501976450.webm","type":"video\/webm","filesize":"12055","width":1280,"height":720,"name":"1526793203110.webm","board":"b","thread":"261549765"}...

But i only need this part out of all the things
https:\/\/2ch.hk\/b\/src\/261549765\/16424501976450.webm (Not exactly this url, but just as an example)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

笑,眼淚并存 2025-01-18 11:57:55

你可以这样做:

url_array = []

for item in soup['response']['items']:
  url_array.append(item['url'])

我想如果API返回JSON数据那么直接解析它应该更好。

You can do it this way:

url_array = []

for item in soup['response']['items']:
  url_array.append(item['url'])

I guess if the API returns JSON data then it should be better to just parse it directly.

橘味果▽酱 2025-01-18 11:57:55

url 生成 json 数据。 Beautifulsoup无法抓取json数据,要抓取json数据,可以按照下一个例子。

import requests
import json
        
data = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1').json()
        
url= data['response']['items'][0]['url']
if url:
   url=url.replace('.webm','.mp4')
   print(url)

输出:

https://2ch.hk/b/src/263361969/16451225633240.mp4

The url produces json data. Beautifulsoup can't grab json data and to grab json data, you can follow the next example.

import requests
import json
        
data = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1').json()
        
url= data['response']['items'][0]['url']
if url:
   url=url.replace('.webm','.mp4')
   print(url)

Output:

https://2ch.hk/b/src/263361969/16451225633240.mp4
错々过的事 2025-01-18 11:57:55

问题是您告诉 BeautifulSoup 将 JSON 数据解析为 HTML。您可以通过以下代码更直接地获取您需要的URL

import json
import requests
from bs4 import BeautifulSoup

req = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1')

data = json.loads(req.content)
my_url = data['response']['items'][0]['url']

The problem is you are telling BeautifulSoup to parse JSON data as HTML. You can get the URL you need more directly with the following code

import json
import requests
from bs4 import BeautifulSoup

req = requests.get('https://api.randomtube.xyz/video.get?chan=2ch.hk&board=b&page=1')

data = json.loads(req.content)
my_url = data['response']['items'][0]['url']
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文