是否有一种方法可以在r r上打开的网页上计数字符

发布于 2025-01-31 19:17:14 字数 797 浏览 6 评论 0 原文

我已经将页面保存在文本（为.txt文件）中，其中很多。这些是社交媒体网站的公共个人资料页面。我想对这些个人资料页面上有多少东西进行粗略衡量。当我将这些文本文件保存为.html，然后在浏览器中打开它们时，我可以看到所呈现的配置文件。但是文本文件表明在配置文件页面上的内容是如何开发的。如果我要依靠这一点，那与可查看的配置文件的开发方式完全不相关（因此，我了解到HTML文件是这样的，这不是您查看文件时显示的内容的好代理，因为有很多文本不会在浏览器窗口中渲染）。

从R到从.html文件中提取的典型解析功能似乎会删除很多内容 - 我认为这些配置文件页面结构不太好。

我可以在R。Chrome之类的应用程序中打开这些文件。但是，是否有一种方法（从R编程）剪切/粘贴Chrome中的文本到另一个文件，这是一种测量这些配置文件中出现的文本的方式？我想创建从R自动化的东西，然后循环。

我将在此处放置一个Dropbox链接到示例文件（输入和输出） - ＆gt; 。在文件中，“ test2_simple_pagecode.txt”中，它具有示例配置文件的页面源代码。一个人可以将其更改为.html扩展程序，并将其放在浏览器中并查看页面。我要做的是将该文件放在浏览器窗口中，然后将整个页面的文本剪切并粘贴到一个单独的文件中，例如“ test2_simple_cutpaste.txt”中的示例。这样，新文件仅具有在配置文件中实际看到的单词。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

新人笑 2025-02-07 19:17:14

此页面严重依赖JavaScript来渲染页面。我建议您研究rselenium来处理页面。 Rselenium将能够处理JavaScript，您将能够使用“ rvest”软件包来提取感兴趣的信息。

这是提取存储在此人个人资料中的信息的非常快速且非常肮脏的方法，但是那里也存储了很多无关的信息。

看来，配置文件中的信息被存储在HTML代码中的评论中。下面的示例提取了评论，删除Unicode字符并解析JSON数据。

lines <-readLines("test2_simple_pagecode.txt")
alllines <- paste(lines, collapse = " ")

library(stringr)

output<-stringr::str_extract(alllines, "<!--\\{\"content\"\\:\\{\"Notes\".+?-->")
nchar(output)

output2<-gsub("\\\\u002d", " ", output)
jsonlite::parse_json(substr(output2, 5, nchar(output2)-3))

This page relies heavily on javascript to render the page. I suggest looking into rselenium to process the page. RSelenium will be able to process the javascript and you would be able to use the "rvest" package to extract the information of interest.

Here is very quick and very dirty way to extract the information stored in the person’s profile, but there is also a lot of extraneous information stored there also.

It appears that the information in profile is stored as JSON data in a comment in the html code. The example below extracts that comment, removes the unicode character and parses the JSON data.

lines <-readLines("test2_simple_pagecode.txt")
alllines <- paste(lines, collapse = " ")

library(stringr)

output<-stringr::str_extract(alllines, "<!--\\{\"content\"\\:\\{\"Notes\".+?-->")
nchar(output)

output2<-gsub("\\\\u002d", " ", output)
jsonlite::parse_json(substr(output2, 5, nchar(output2)-3))

回复收藏 0 原文

~没有更多了~