如果下载 HTML 时没有出现 URL，如何抓取 URL？ JavaScript 可能是一个问题

发布于 2025-01-13 08:21:44 字数 899 浏览 2 评论 0原文

我正在尝试抓取此主页的一些网址（www.globo.com）。我可以获得标题和其他网址。但其中一些不在 HTML 上，无法使用 requests 和 lxml 进行抓取。我不想使用 selenium/bs4/beautifulsoap 因为代码将在 Heroku 服务器上运行，所以这会让一切变得更加困难。

我想要抓取的 URL 位于包含以下两个类的 div 之后：container 和 false。这是强制性的。我可以轻松抓取 div 上没有类“false”的其他 URL。

尽管存在这个问题，有人知道如何抓取 URL 吗？或者有人推荐其他库来完成此任务（不是 bs4 或 selenium）？

import requests
import lxml.html

url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[@class="container false"]//a/@href')
print(urls)

这也不起作用：

import requests
import lxml.html

url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[contains(@class, "container") and contains(@class, "false")]//a/@href')
print(urls)

谢谢

原文

I am trying to scrape some URLs of this homepage (www.globo.com). I can get the headline and others URLs. But some of them aren't on the HTML and couldn't be scraped with requests and lxml. I don't want to use selenium/bs4/beautifulsoap because the code will be running on Heroku server, so it would make everything more difficult.

The URLs that I want to scrape are after a div with these two classes: container and false. This is mandatory. Others URLs without the class "false" on the div I can easily scrape.

Does anyone know how to scrape the URLs despite this problem? Or does someone recommend other library to this task (not bs4 or selenium)?

import requests
import lxml.html

url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[@class="container false"]//a/@href')
print(urls)

This also doesn't work:

import requests
import lxml.html

url = 'https://www.globo.com/'
page = requests.get(url)
doc = lxml.html.fromstring(page.content)
urls = doc.xpath('//div[contains(@class, "container") and contains(@class, "false")]//a/@href')
print(urls)

Thank you

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

窗影残 2025-01-20 08:21:44

事实证明，“丢失”的 URL 实际上位于源代码中，但您需要进行一些挖掘。

基本上，这些是由 JS 从嵌入的 JSON 加载的。您可以定位 JSON 所在的 div，并提取给定列的所有数据。

下面是如何做到这一点：

import json

import requests
from lxml import html

source = html.fromstring(requests.get('https://www.globo.com/').content)
columns = ["esporte", "jornalismo", "entretenimento"]

for column in columns:
    column_data = (
        json.loads(
            source.xpath(f'//div[@id="column-{column}"]')[0].get(f"data-{column}")
        )
    )
    for item in column_data:
        try:
            print(item["content"]["url"])
            print(f'Item id: {item["id"]}')
            print("-" * 120)
        except KeyError:
            continue

这应该产生：

https://ge.globo.com/futebol/times/corinthians/noticia/2022/03/11/junior-moraes-e-aprovado-em-exames-cardiologicos-antes-de-assinar-com-o-corinthians.ghtml
Item id: 527df4d0-2310-4c6c-bda7-7215e2c43ce2
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/futebol/times/sao-paulo/noticia/2022/03/11/passo-a-passo-entenda-a-polemica-entre-ceni-diego-costa-e-o-medico-do-sao-paulo-no-classico.ghtml
Item id: 6516b867-c2ca-412b-9a7b-52aca2a58b2d
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/pe/futebol/noticia/2022/03/11/joelinton-diz-que-nao-conhece-oasis-e-sugere-alceu-valenca-para-musica-da-torcida-do-newcastle.ghtml
Item id: 3d30a7a6-2e13-44a0-957f-85e9ccd4a389
------------------------------------------------------------------------------------------------------------------------
https://oglobo.globo.com/esportes/futebol/apresentado-no-botafogo-piazon-mostra-empolgacao-com-projeto-da-saf-expectativa-grande-25428998?utm_source=globo.com&utm_medium=oglobo
Item id: f33b3e35-a9b9-4f0d-bb9e-95f145d54046
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/futebol/times/corinthians/noticia/2022/03/11/joao-victor-cita-intensidade-maior-nos-treinos-do-corinthians-e-cre-em-evolucao-mais-adaptados.ghtml
Item id: c1306207-e4af-41da-ac65-b4bdc5bc6489
------------------------------------------------------------------------------------------------------------------------
https://extra.globo.com/famosos/jogador-douglas-luiz-da-selecao-brasileira-namora-companheira-de-clube-na-inglaterra-casal-posta-cliques-romanticos-25427740.html
Item id: 44de3874-5143-48a9-89ad-3da8b4c5e0d7
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/am/futebol/times/amazonas-fc/noticia/2022/03/11/atacante-walter-ex-santa-cruz-e-anunciado-pelo-amazonas-fc.ghtml
Item id: d98971c4-b220-4c69-bc92-8c877a389951
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/programas/verao-espetacular/noticia/2022/03/11/tecnologia-ajuda-surfistas-na-busca-por-ondulacoes-historicas-em-nazare.ghtml
Item id: 2737af2e-ee76-41c2-852c-cc0f0d00e01a
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/popo-tatua-cena-de-luta-com-whindersson-na-pele-e-no-coracao.html
Item id: 1505eee1-18df-4fa8-aa53-77feb10a5129
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/combate/noticia/2022/03/11/ufc-marreta-e-ankalaev-batem-peso-rapido-para-luta-no-sabado.ghtml
Item id: c87ceed9-e9c1-47e5-a269-406b0c4a7636
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/ce/ceara/noticia/2022/03/11/tres-dias-de-viagem-e-minha-irma-so-chorando-chamando-o-nome-da-minha-mae-diz-garoto-que-viajou-sem-responsavel-de-sao-paulo-ao-ceara.ghtml
Item id: 4379555e-9892-4e43-998a-567f8f4f1eb5
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/es/espirito-santo/noticia/2022/03/11/brasileiro-e-preso-na-tailandia-com-cocaina-diluida-em-produtos-de-beleza.ghtml
Item id: e914e223-cb1a-43bb-89f5-eaafc6b475fa
------------------------------------------------------------------------------------------------------------------------
https://revistacrescer.globo.com/Saude/noticia/2022/03/apos-vencer-covid-19-e-um-quadro-de-pneumonia-menina-de-3-anos-sai-do-hospital-e-corre-para-abracar-prima.html
Item id: f60fdd89-01da-44c2-b131-ae132e1345c2
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/pr/norte-noroeste/noticia/2022/03/11/justica-nega-posse-de-professora-sem-vacina-contra-covid-para-dar-aulas-na-rede-municipal-de-londrina.ghtml
Item id: 461a0f34-51fa-419b-8c72-80234ce05302
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/to/tocantins/noticia/2022/03/11/mauro-carlesse-se-pronuncia-nas-redes-sociais-apos-renuncia-cheguei-no-limite.ghtml
Item id: d1f0e0fb-7ac5-4975-a7ba-ef7746099073
------------------------------------------------------------------------------------------------------------------------
https://revistagalileu.globo.com/Um-So-Planeta/noticia/2022/03/novas-observacoes-mostram-que-gelo-do-artico-afinou-nos-ultimos-3-anos.html
Item id: 5cd0e336-970f-48b6-a2cd-db59cab98964
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/sp/ribeirao-preto-franca/noticia/2022/03/11/homem-fotografa-partes-intimas-de-mulher-de-saia-em-loja-de-sertaozinho-sp-video.ghtml
Item id: 6521e28a-fd0a-4666-864d-658d119ff31f
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/ba/bahia/noticia/2022/03/11/passageiros-relatam-problema-nas-duas-linhas-do-metro-de-salvador.ghtml
Item id: 29428049-e314-41d0-a693-f2b71b259c79
------------------------------------------------------------------------------------------------------------------------
https://autoesporte.globo.com/curiosidades/noticia/2022/03/maior-carro-do-mundo-tem-26-rodas-heliponto-e-pode-levar-ate-75-pessoas.ghtml
Item id: 79fbc28c-4629-4403-bfc8-7e4511c33d8b
------------------------------------------------------------------------------------------------------------------------
https://revistacrescer.globo.com/Gravidez/noticia/2022/03/coercao-reprodutiva-em-documentario-mulheres-contam-que-parceiros-esconderam-suas-pilulas-anticoncepcionais-e-furaram-preservativos.html
Item id: 471eb31e-b5ae-4d9d-ae12-e9a6e74d1cd3
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/fantastico/noticia/2022/03/11/uma-tarde-com-jade-apos-deixar-o-bbb-22-influencer-curtiu-praia-no-rio-e-atendeu-fas.ghtml
Item id: 5d4f5867-5001-498e-a1fe-d189a7adaed2
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/felipe-roque-curte-praia-com-atriz-sofia-starling-ex-de-andre-marques.html
Item id: fa1ead5a-0f95-43d0-bca8-ebd754421104
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-Inspira/noticia/2022/03/entenda-frontoplastia-procedimento-para-diminuir-testa-feito-pela-ex-bbb-thais-braz.html
Item id: 0a4cebdf-0093-46b0-a192-b34c56e31e44
------------------------------------------------------------------------------------------------------------------------
https://vogue.globo.com/celebridade/noticia/2022/03/gabi-martins-confirma-que-ficou-felipe-neto-mas-descarta-relacionamento-estamos-solteiros.html
Item id: 60b7edde-0ef1-4280-a804-dcccb70c8197
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/jamie-lee-curtis-mostra-corpo-real-em-novo-papel-chupava-barriga-desde-os-11-anos.html
Item id: 07b8250e-4098-44a9-8bac-f70913867aa8
------------------------------------------------------------------------------------------------------------------------
https://gshow.globo.com/tudo-mais/tv-e-famosos/noticia/pergunta-de-susana-vieira-no-encontro-bomba-na-web-posso-falar-mal.ghtml
Item id: 08aacc1c-abd3-477a-a8b9-ffd5d1ab174f
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/Entrevista/noticia/2022/03/titi-muller-sobre-relacao-com-o-ex-marido-gente-quer-ver-o-outro-feliz.html
Item id: 05c58354-7b1b-4bd7-b8b6-48459cbfeec0
------------------------------------------------------------------------------------------------------------------------
https://gshow.globo.com/novelas/um-lugar-ao-sol/vem-por-ai/noticia/um-lugar-ao-sol-christianrenato-fica-entre-os-ciumes-de-barbara-e-as-exigencias-de-stephany.ghtml
Item id: f39cce20-143b-4ba7-a728-86bacedde3e0
------------------------------------------------------------------------------------------------------------------------
https://glamour.globo.com/lifestyle/noticia/2022/03/deborah-secco-exibe-marquinha-de-biquini-na-praia-e-ganha-elogio-do-marido.ghtml
Item id: f3d60eb1-3329-48f3-8eb1-7647ca353558
------------------------------------------------------------------------------------------------------------------------
https://glamour.globo.com/lifestyle/noticia/2022/03/kim-kardashian-compartilha-primeira-foto-no-instagram-ao-lado-de-pete-davidson.ghtml
Item id: 284d4a38-12cd-45ce-850a-ad436512444a
------------------------------------------------------------------------------------------------------------------------

注意：有些项目有 ID 但没有 URL，这些通常是小部件。因此，try- except。

Turns out that the "missing" URL's are actually in the source but you need to do a bit of digging.

Basically, these are loaded by JS from an embedded JSON. You can target the divs the JSON sits in and extract all the data for a given column.

Here's how to do that:

import json

import requests
from lxml import html

source = html.fromstring(requests.get('https://www.globo.com/').content)
columns = ["esporte", "jornalismo", "entretenimento"]

for column in columns:
    column_data = (
        json.loads(
            source.xpath(f'//div[@id="column-{column}"]')[0].get(f"data-{column}")
        )
    )
    for item in column_data:
        try:
            print(item["content"]["url"])
            print(f'Item id: {item["id"]}')
            print("-" * 120)
        except KeyError:
            continue

This should produce:

https://ge.globo.com/futebol/times/corinthians/noticia/2022/03/11/junior-moraes-e-aprovado-em-exames-cardiologicos-antes-de-assinar-com-o-corinthians.ghtml
Item id: 527df4d0-2310-4c6c-bda7-7215e2c43ce2
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/futebol/times/sao-paulo/noticia/2022/03/11/passo-a-passo-entenda-a-polemica-entre-ceni-diego-costa-e-o-medico-do-sao-paulo-no-classico.ghtml
Item id: 6516b867-c2ca-412b-9a7b-52aca2a58b2d
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/pe/futebol/noticia/2022/03/11/joelinton-diz-que-nao-conhece-oasis-e-sugere-alceu-valenca-para-musica-da-torcida-do-newcastle.ghtml
Item id: 3d30a7a6-2e13-44a0-957f-85e9ccd4a389
------------------------------------------------------------------------------------------------------------------------
https://oglobo.globo.com/esportes/futebol/apresentado-no-botafogo-piazon-mostra-empolgacao-com-projeto-da-saf-expectativa-grande-25428998?utm_source=globo.com&utm_medium=oglobo
Item id: f33b3e35-a9b9-4f0d-bb9e-95f145d54046
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/futebol/times/corinthians/noticia/2022/03/11/joao-victor-cita-intensidade-maior-nos-treinos-do-corinthians-e-cre-em-evolucao-mais-adaptados.ghtml
Item id: c1306207-e4af-41da-ac65-b4bdc5bc6489
------------------------------------------------------------------------------------------------------------------------
https://extra.globo.com/famosos/jogador-douglas-luiz-da-selecao-brasileira-namora-companheira-de-clube-na-inglaterra-casal-posta-cliques-romanticos-25427740.html
Item id: 44de3874-5143-48a9-89ad-3da8b4c5e0d7
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/am/futebol/times/amazonas-fc/noticia/2022/03/11/atacante-walter-ex-santa-cruz-e-anunciado-pelo-amazonas-fc.ghtml
Item id: d98971c4-b220-4c69-bc92-8c877a389951
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/programas/verao-espetacular/noticia/2022/03/11/tecnologia-ajuda-surfistas-na-busca-por-ondulacoes-historicas-em-nazare.ghtml
Item id: 2737af2e-ee76-41c2-852c-cc0f0d00e01a
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/popo-tatua-cena-de-luta-com-whindersson-na-pele-e-no-coracao.html
Item id: 1505eee1-18df-4fa8-aa53-77feb10a5129
------------------------------------------------------------------------------------------------------------------------
https://ge.globo.com/combate/noticia/2022/03/11/ufc-marreta-e-ankalaev-batem-peso-rapido-para-luta-no-sabado.ghtml
Item id: c87ceed9-e9c1-47e5-a269-406b0c4a7636
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/ce/ceara/noticia/2022/03/11/tres-dias-de-viagem-e-minha-irma-so-chorando-chamando-o-nome-da-minha-mae-diz-garoto-que-viajou-sem-responsavel-de-sao-paulo-ao-ceara.ghtml
Item id: 4379555e-9892-4e43-998a-567f8f4f1eb5
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/es/espirito-santo/noticia/2022/03/11/brasileiro-e-preso-na-tailandia-com-cocaina-diluida-em-produtos-de-beleza.ghtml
Item id: e914e223-cb1a-43bb-89f5-eaafc6b475fa
------------------------------------------------------------------------------------------------------------------------
https://revistacrescer.globo.com/Saude/noticia/2022/03/apos-vencer-covid-19-e-um-quadro-de-pneumonia-menina-de-3-anos-sai-do-hospital-e-corre-para-abracar-prima.html
Item id: f60fdd89-01da-44c2-b131-ae132e1345c2
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/pr/norte-noroeste/noticia/2022/03/11/justica-nega-posse-de-professora-sem-vacina-contra-covid-para-dar-aulas-na-rede-municipal-de-londrina.ghtml
Item id: 461a0f34-51fa-419b-8c72-80234ce05302
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/to/tocantins/noticia/2022/03/11/mauro-carlesse-se-pronuncia-nas-redes-sociais-apos-renuncia-cheguei-no-limite.ghtml
Item id: d1f0e0fb-7ac5-4975-a7ba-ef7746099073
------------------------------------------------------------------------------------------------------------------------
https://revistagalileu.globo.com/Um-So-Planeta/noticia/2022/03/novas-observacoes-mostram-que-gelo-do-artico-afinou-nos-ultimos-3-anos.html
Item id: 5cd0e336-970f-48b6-a2cd-db59cab98964
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/sp/ribeirao-preto-franca/noticia/2022/03/11/homem-fotografa-partes-intimas-de-mulher-de-saia-em-loja-de-sertaozinho-sp-video.ghtml
Item id: 6521e28a-fd0a-4666-864d-658d119ff31f
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/ba/bahia/noticia/2022/03/11/passageiros-relatam-problema-nas-duas-linhas-do-metro-de-salvador.ghtml
Item id: 29428049-e314-41d0-a693-f2b71b259c79
------------------------------------------------------------------------------------------------------------------------
https://autoesporte.globo.com/curiosidades/noticia/2022/03/maior-carro-do-mundo-tem-26-rodas-heliponto-e-pode-levar-ate-75-pessoas.ghtml
Item id: 79fbc28c-4629-4403-bfc8-7e4511c33d8b
------------------------------------------------------------------------------------------------------------------------
https://revistacrescer.globo.com/Gravidez/noticia/2022/03/coercao-reprodutiva-em-documentario-mulheres-contam-que-parceiros-esconderam-suas-pilulas-anticoncepcionais-e-furaram-preservativos.html
Item id: 471eb31e-b5ae-4d9d-ae12-e9a6e74d1cd3
------------------------------------------------------------------------------------------------------------------------
https://g1.globo.com/fantastico/noticia/2022/03/11/uma-tarde-com-jade-apos-deixar-o-bbb-22-influencer-curtiu-praia-no-rio-e-atendeu-fas.ghtml
Item id: 5d4f5867-5001-498e-a1fe-d189a7adaed2
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/felipe-roque-curte-praia-com-atriz-sofia-starling-ex-de-andre-marques.html
Item id: fa1ead5a-0f95-43d0-bca8-ebd754421104
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-Inspira/noticia/2022/03/entenda-frontoplastia-procedimento-para-diminuir-testa-feito-pela-ex-bbb-thais-braz.html
Item id: 0a4cebdf-0093-46b0-a192-b34c56e31e44
------------------------------------------------------------------------------------------------------------------------
https://vogue.globo.com/celebridade/noticia/2022/03/gabi-martins-confirma-que-ficou-felipe-neto-mas-descarta-relacionamento-estamos-solteiros.html
Item id: 60b7edde-0ef1-4280-a804-dcccb70c8197
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/QUEM-News/noticia/2022/03/jamie-lee-curtis-mostra-corpo-real-em-novo-papel-chupava-barriga-desde-os-11-anos.html
Item id: 07b8250e-4098-44a9-8bac-f70913867aa8
------------------------------------------------------------------------------------------------------------------------
https://gshow.globo.com/tudo-mais/tv-e-famosos/noticia/pergunta-de-susana-vieira-no-encontro-bomba-na-web-posso-falar-mal.ghtml
Item id: 08aacc1c-abd3-477a-a8b9-ffd5d1ab174f
------------------------------------------------------------------------------------------------------------------------
https://revistaquem.globo.com/Entrevista/noticia/2022/03/titi-muller-sobre-relacao-com-o-ex-marido-gente-quer-ver-o-outro-feliz.html
Item id: 05c58354-7b1b-4bd7-b8b6-48459cbfeec0
------------------------------------------------------------------------------------------------------------------------
https://gshow.globo.com/novelas/um-lugar-ao-sol/vem-por-ai/noticia/um-lugar-ao-sol-christianrenato-fica-entre-os-ciumes-de-barbara-e-as-exigencias-de-stephany.ghtml
Item id: f39cce20-143b-4ba7-a728-86bacedde3e0
------------------------------------------------------------------------------------------------------------------------
https://glamour.globo.com/lifestyle/noticia/2022/03/deborah-secco-exibe-marquinha-de-biquini-na-praia-e-ganha-elogio-do-marido.ghtml
Item id: f3d60eb1-3329-48f3-8eb1-7647ca353558
------------------------------------------------------------------------------------------------------------------------
https://glamour.globo.com/lifestyle/noticia/2022/03/kim-kardashian-compartilha-primeira-foto-no-instagram-ao-lado-de-pete-davidson.ghtml
Item id: 284d4a38-12cd-45ce-850a-ad436512444a
------------------------------------------------------------------------------------------------------------------------

NOTE: Some items have an ID but don't have an URL, these are usually widgets. Hence, the try-except.

回复收藏 0 原文

~没有更多了~