如何从HTML页面源刮擦JSON数据？

发布于 2025-02-10 10:54:03 字数 1189 浏览 3 评论 0 原文

我正在尝试从在线音乐数据库中获取一些数据。特别是，我想提取使用Ctrl+F - “ ISRC”：“ GB-FFM-19-0853）可以找到的数据。

view-source： https：///www.audionetw.audionetwork.com/browse.com/browse.com/browse/mm/ Track/purple-beat_1008534

我正在使用Python和Selenium，并试图通过Tag，Xpath和ID之类的内容来定位数据，但似乎没有任何作用。

我以前从未看过这种x：y格式，一些搜索使我认为它是一种json格式。

有没有办法通过硒获取ISRC数据？我需要一种通用方法（即为具有不同ISRC值的页面工作，因为每个音乐曲目都有不同的曲目）。

到目前为止，我的代码...

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
import os

# Access AudioNetwork and search for tracks.

path = "C:\Program Files (x86)\chromedriver.exe"

driver = webdriver.Chrome(path)

driver.get("https://www.audionetwork.com/track/searchkeyword")

search = driver.find_element(By.ID, "js-keyword")
search.send_keys("ANW3175_001_Purple-Beat.wav")
search.send_keys(Keys.RETURN)
sleep(5)

music_link = driver.find_element(By.CSS_SELECTOR, "a.track__title")

music_link.click()

我知道我需要更好地等待 /可能与代码有关的其他问题，但是关于如何获取ISRC编号的任何想法？

原文

I'm trying to pull some data from an online music database. In particular, I want to pull this data that you can find with CTRL+F -- "isrc":"GB-FFM-19-0853."

view-source:https://www.audionetwork.com/browse/m/track/purple-beat_1008534

I'm using Python and Selenium and have tried to locate the data via things like tag, xpath and id, but nothing seems to be working.

I haven't seen this x:y format before and some searching makes me think it's in a JSON format.

Is there a way to grab that isrc data via Selenium? I'd need the approach to be generic (i.e. work for pages with different isrc values, as each music track has a different one).

My code so far ...

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from time import sleep
import os

# Access AudioNetwork and search for tracks.

path = "C:\Program Files (x86)\chromedriver.exe"

driver = webdriver.Chrome(path)

driver.get("https://www.audionetwork.com/track/searchkeyword")

search = driver.find_element(By.ID, "js-keyword")
search.send_keys("ANW3175_001_Purple-Beat.wav")
search.send_keys(Keys.RETURN)
sleep(5)

music_link = driver.find_element(By.CSS_SELECTOR, "a.track__title")

music_link.click()

I know I need to make better waits / probably other issues with the code, but any ideas on how to grab that ISRC number?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

独木成林 2025-02-17 10:54:03

您想将整个脚本提取为JSON数据（可以在Python中读取为字典）并搜索“ ISRC”参数。

以下代码使用硒来在页面内提取脚本内容，将其解析为JSON并将“ ISRC”值打印到终端。

from selenium import webdriver
from selenium.webdriver.common.by import By
import json

driver = webdriver.Chrome()
driver.get("https://www.audionetwork.com/browse/m/track/purple-beat_1008534")

search = driver.find_element(By.XPATH, "/html/body/script[1]")
content = search.get_attribute('innerHTML')

content_as_dict = json.loads(content)

print(content_as_dict['props']['pageProps']['track']['isrc'])

driver.close()
driver.quit()

You want to extract the entire script as JSON data (which can be read as a dictionary in python) and search for the "isrc" parameter.

The following code uses selenium in order to extract the script content inside the page, parse it as json and print the "isrc" value to the terminal.

from selenium import webdriver
from selenium.webdriver.common.by import By
import json

driver = webdriver.Chrome()
driver.get("https://www.audionetwork.com/browse/m/track/purple-beat_1008534")

search = driver.find_element(By.XPATH, "/html/body/script[1]")
content = search.get_attribute('innerHTML')

content_as_dict = json.loads(content)

print(content_as_dict['props']['pageProps']['track']['isrc'])

driver.close()
driver.quit()

回复收藏 0 原文

夜无邪 2025-02-17 10:54:03

是的，这是JSON格式。它实际上是在HTML脚本标签内包裹的JSON。这本质上是一个“键”：“ value”对 - 因此您概述的特定内容（“ ISRC”：“ GB-FFM-19-08534”）具有ISRC的键，其值为GB-FFM-19-08534 。

Python有一个用于解析JSON的库，我认为您可能想要这个 - 。让我知道这是否对您有用。

如果您想找到ISRC的值，则可以这样做：

import json

... # your code here

jsonString = json.loads(someValueHere)
isrcValue = jsonString["isrc"]

用您正在解析的JSON字符串替换一些值，这应该有所帮助。我认为ISRC是嵌套的，因此可能并不那么简单。我认为您不能在python中对[track.isrc']进行jsonstry，但是我不确定...您要寻找的路径是props.pageprops.track.isrc。您可能必须分配每个层的变量...

jsonString = json.loads(someValueHere)
propsValue = jsonString["props"]
pagePropsValue = propsValue["pageProps"]
trackValue = pagePropsValue["track"]
isrcValue = trackValue["isrc"]

Yes, this is JSON format. It's actually JSON wrapped inside of a HTML script tag. It's a essentially a "key": "value" pair - so the specific thing you outlined ("isrc":"GB-FFM-19-08534") has a key of isrc with a value of GB-FFM-19-08534.

Python has a library for parsing JSON, I think you might want this - https://www.w3schools.com/python/gloss_python_json_parse.asp. Let me know if that works for you.

If you wanted to find the value of isrc, you could do:

import json

... # your code here

jsonString = json.loads(someValueHere)
isrcValue = jsonString["isrc"]

replace someValueHere with the json string that you're parsing through and that should help. I think isrc is nested though, so it might not be quite that simple. I don't think you can just do jsonString["track.isrc"] in python, but I'm not sure... the path you're looking for is props.pageProps.track.isrc. You may have to assign a variable per layer...

jsonString = json.loads(someValueHere)
propsValue = jsonString["props"]
pagePropsValue = propsValue["pageProps"]
trackValue = pagePropsValue["track"]
isrcValue = trackValue["isrc"]

回复收藏 0 原文

~没有更多了~