Web用R刮擦动态网页

发布于 2025-02-03 06:15:31 字数 1187 浏览 4 评论 0 原文

我的目标是从此站点获取数据： https://www.insee.fr/fr/recherche?q= emploi-population+active+active+en +2018＆amp; amp;taille=20＆amp; amp; debut=0 ，尤其是尤其。
我知道获得功能不起作用，因为它是动态的，需要通过JavaScript进行处理（相同的是 Web刮擦动态网页python ）。因此，我通过浏览器的检查器模式获取信息，并找到了带有URL的帖子查询。

这是一个可生殖的示例：

library(httr)

body <- list(q="Emploi-Population%20active%20en%202018",
             start="0",
             sortFields=data.frame(field="score",order="desc"),
             filters=data.frame(NULL),
             rows="50",
             facetsQuery=data.frame(NULL))

TMP   <- httr::POST(url = "http://www.insee.fr/fr/solr/consultation?q=Emploi-Population%20active%20en%202018",
              body = body,
              config = config(http_version=1.1),
              encode = "json",verbose())

请注意，AI必须放置HTTP而不是HTTP，因为我什么也没得到（我什么也没得到（我的代理已正确配置，Rstudio可以连接到Internet）。
我得到的只是一个不错的500错误。对我想念什么的想法吗？

原文

my goal is to get data from this site : https://www.insee.fr/fr/recherche?q=Emploi-Population+active+en+2018&taille=20&debut=0, especially id links of different items.
I know that GET function doesn't work because it's dynamic and needed to be process by javascript (same that Web Scraping dynamic webpage Python). So i get info via inspector mode of my browser and found a POST query with the url.

Here is a reproductible example :

library(httr)

body <- list(q="Emploi-Population%20active%20en%202018",
             start="0",
             sortFields=data.frame(field="score",order="desc"),
             filters=data.frame(NULL),
             rows="50",
             facetsQuery=data.frame(NULL))

TMP   <- httr::POST(url = "http://www.insee.fr/fr/solr/consultation?q=Emploi-Population%20active%20en%202018",
              body = body,
              config = config(http_version=1.1),
              encode = "json",verbose())

Note that a i have to put http instead of https because i get nothing otherwise (My proxy is correctly configured and rstudio can connect to the internet).
All i get is a nice 500 error. Any Idea of what i miss ?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

水染的天色ゝ 2025-02-10 06:15:31

您可以更改 Q 参数并将其从URL中删除。我将使用HTTPS并删除您的配置行，以避免卷发提取错误。但是，下面适合返回100个结果，仍然有效。

library(httr)

body <- list(
  q = "Emploi-Population active en 2018",
  start = "0",
  sortFields = data.frame(field = "score", order = "desc"),
  rows = "100"
)

TMP <- httr::POST(
  url = "http://www.insee.fr/fr/solr/consultation",
  body = body,
  config = config(http_version = 1.1),
  encode = "json", verbose()
)

data <- fromJSON(content(TMP, type = "text"))

print(data$documents$titre)

You can change the q param and remove it from your url. I would use https and remove your config line to avoid the curl fetch error. However, the below, adapted to return 100 results, still works.

library(httr)

body <- list(
  q = "Emploi-Population active en 2018",
  start = "0",
  sortFields = data.frame(field = "score", order = "desc"),
  rows = "100"
)

TMP <- httr::POST(
  url = "http://www.insee.fr/fr/solr/consultation",
  body = body,
  config = config(http_version = 1.1),
  encode = "json", verbose()
)

data <- fromJSON(content(TMP, type = "text"))

print(data$documents$titre)

回复收藏 0 原文

咆哮 2025-02-10 06:15:31

我发现将JSON作为字符串运行正常：

library(httr)

json <- paste0('{"q":"Emploi-Population active en 2018 ",',
            '"start":"0","sortFields":[{"field":"score","order":"desc"}],',
            '"filters":[],"rows":"20","facetsQuery":[]}')

url <- paste0('https://www.insee.fr/fr/solr/consultation?q=Emploi-Population',
               '%20active%20en%202018%20')

res <- POST(url, body = json, content_type_json())
output <- content(res)

现在输出是一个巨大的列表，例如，

sapply(output$documents, function(x) x$titre)
#>  [1] "Emploi-Population active en 2018"                                                                          
#>  [2] "Emploi – Population active"                                                                                
#>  [3] "Dossier complet"                                                                                           
#>  [4] "Base du dossier complet"                                                                                   
#>  [5] "Emploi-Population active en 2017"                                                                          
#>  [6] "Comparateur de territoire"                                                                                 
#>  [7] "Emploi – Population active"                                                                                
#>  [8] "L'essentiel sur... les entreprises"                                                                        
#>  [9] "Emploi - population active en 2014"                                                                        
#> [10] "Population active"                                                                                         
#> [11] "Emploi salarié et non salarié par activité"                                                                
#> [12] "Évolution de l'emploi"                                                                                     
#> [13] "Logements, individus, activité, mobilités scolaires et professionnelles, migrations résidentielles en 2018"
#> [14] "Emploi selon le sexe et l’âge"                                                                             
#> [15] "Statut d’emploi et type de contrat selon le sexe et l’âge"                                                 
#> [16] "Sous-emploi selon le sexe et l’âge"                                                                        
#> [17] "Emploi salarié par secteur"                                                                                
#> [18] "Fiche - Chômage"                                                                                           
#> [19] "Emploi-Activité en 2018"                                                                                   
#> [20] "Activité professionnelle des individus : lieu de travail localisé à la zone d'emploi en 2017"

在2022-05-31创建的文档标题： reprex package （v2.0.1）

I found that passing the json as a string worked fine:

library(httr)

json <- paste0('{"q":"Emploi-Population active en 2018 ",',
            '"start":"0","sortFields":[{"field":"score","order":"desc"}],',
            '"filters":[],"rows":"20","facetsQuery":[]}')

url <- paste0('https://www.insee.fr/fr/solr/consultation?q=Emploi-Population',
               '%20active%20en%202018%20')

res <- POST(url, body = json, content_type_json())
output <- content(res)

Now output is a massive list, but here for example are the document titles:

sapply(output$documents, function(x) x$titre)
#>  [1] "Emploi-Population active en 2018"                                                                          
#>  [2] "Emploi – Population active"                                                                                
#>  [3] "Dossier complet"                                                                                           
#>  [4] "Base du dossier complet"                                                                                   
#>  [5] "Emploi-Population active en 2017"                                                                          
#>  [6] "Comparateur de territoire"                                                                                 
#>  [7] "Emploi – Population active"                                                                                
#>  [8] "L'essentiel sur... les entreprises"                                                                        
#>  [9] "Emploi - population active en 2014"                                                                        
#> [10] "Population active"                                                                                         
#> [11] "Emploi salarié et non salarié par activité"                                                                
#> [12] "Évolution de l'emploi"                                                                                     
#> [13] "Logements, individus, activité, mobilités scolaires et professionnelles, migrations résidentielles en 2018"
#> [14] "Emploi selon le sexe et l’âge"                                                                             
#> [15] "Statut d’emploi et type de contrat selon le sexe et l’âge"                                                 
#> [16] "Sous-emploi selon le sexe et l’âge"                                                                        
#> [17] "Emploi salarié par secteur"                                                                                
#> [18] "Fiche - Chômage"                                                                                           
#> [19] "Emploi-Activité en 2018"                                                                                   
#> [20] "Activité professionnelle des individus : lieu de travail localisé à la zone d'emploi en 2017"

^{Created on 2022-05-31 by the reprex package (v2.0.1)}

回复收藏 0 原文

~没有更多了~