我已经将页面保存在文本(为.txt文件)中,其中很多。这些是社交媒体网站的公共个人资料页面。我想对这些个人资料页面上有多少东西进行粗略衡量。当我将这些文本文件保存为.html,然后在浏览器中打开它们时,我可以看到所呈现的配置文件。但是文本文件表明在配置文件页面上的内容是如何开发的。如果我要依靠这一点,那与可查看的配置文件的开发方式完全不相关(因此,我了解到HTML文件是这样的,这不是您查看文件时显示的内容的好代理,因为有很多文本不会在浏览器窗口中渲染)。
从R到从.html文件中提取的典型解析功能似乎会删除很多内容 - 我认为这些配置文件页面结构不太好。
我可以在R。Chrome之类的应用程序中打开这些文件。但是,是否有一种方法(从R编程)剪切/粘贴Chrome中的文本到另一个文件,这是一种测量这些配置文件中出现的文本的方式?我想创建从R自动化的东西,然后循环。
我将在此处放置一个Dropbox链接到示例文件(输入和输出) - > 。在文件中,“ test2_simple_pagecode.txt”中,它具有示例配置文件的页面源代码。一个人可以将其更改为.html扩展程序,并将其放在浏览器中并查看页面。我要做的是将该文件放在浏览器窗口中,然后将整个页面的文本剪切并粘贴到一个单独的文件中,例如“ test2_simple_cutpaste.txt”中的示例。这样,新文件仅具有在配置文件中实际看到的单词。
I have saved pages web pages in text (as .txt files), lots of them. These are public profile pages from a social media site. I want to do a rough measure of how much stuff is on these profile pages. When I save these text files as .html, then open them in a browser, I can see the profile presented. But the text file is a poor indication of how developed the content is on the profile page. If I do character counts on this, it is completely uncorrelated to how developed the viewable profile is (so I learned that html files are such are not good proxies of what shows up when you view the file, since there is a lot of text that does not get rendered in browser windows).
The typical parsing functions from r to extract from .html files seems to drop a lot of the content - I think these profile pages are not very well structured.
I can open these files in an application like chrome from R. But is there a way (programmatically from R) to cut/paste the text rendered in Chrome to another file, as a way of measuring the text that appears in these profiles? I would like to create something automated from R, and loop it.
I'll place a dropbox link to example files (input and output) here -> https://www.dropbox.com/sh/4fqxwbj74tnfaxq/AACtexD7OVYYrMoTDrudbacba?dl=0. In the file, "test2_simple_pagecode.txt", this has the page source code of a sample profile. One could change this to a .html extension, and bring it up in a browser and view the page. What I want to do is bring that file up in a browser window, then cut and paste the text of the entire page into a separate file like the example in "test2_simple_cutpaste.txt". This way, the new file only has words that are actually seen in the profile.
发布评论
评论(1)
此页面严重依赖JavaScript来渲染页面。我建议您研究rselenium来处理页面。 Rselenium将能够处理JavaScript,您将能够使用“ rvest”软件包来提取感兴趣的信息。
这是提取存储在此人个人资料中的信息的非常快速且非常肮脏的方法,但是那里也存储了很多无关的信息。
看来,配置文件中的信息被存储在HTML代码中的评论中。下面的示例提取了评论,删除Unicode字符并解析JSON数据。
This page relies heavily on javascript to render the page. I suggest looking into rselenium to process the page. RSelenium will be able to process the javascript and you would be able to use the "rvest" package to extract the information of interest.
Here is very quick and very dirty way to extract the information stored in the person’s profile, but there is also a lot of extraneous information stored there also.
It appears that the information in profile is stored as JSON data in a comment in the html code. The example below extracts that comment, removes the unicode character and parses the JSON data.