从R到从.html文件中提取的典型解析功能似乎会删除很多内容 - 我认为这些配置文件页面结构不太好。
我将在此处放置一个Dropbox链接到示例文件(输入和输出) - > 。在文件中,“ test2_simple_pagecode.txt”中,它具有示例配置文件的页面源代码。一个人可以将其更改为.html扩展程序,并将其放在浏览器中并查看页面。我要做的是将该文件放在浏览器窗口中,然后将整个页面的文本剪切并粘贴到一个单独的文件中,例如“ test2_simple_cutpaste.txt”中的示例。这样,新文件仅具有在配置文件中实际看到的单词。
I have saved pages web pages in text (as .txt files), lots of them. These are public profile pages from a social media site. I want to do a rough measure of how much stuff is on these profile pages. When I save these text files as .html, then open them in a browser, I can see the profile presented. But the text file is a poor indication of how developed the content is on the profile page. If I do character counts on this, it is completely uncorrelated to how developed the viewable profile is (so I learned that html files are such are not good proxies of what shows up when you view the file, since there is a lot of text that does not get rendered in browser windows).
The typical parsing functions from r to extract from .html files seems to drop a lot of the content - I think these profile pages are not very well structured.
I can open these files in an application like chrome from R. But is there a way (programmatically from R) to cut/paste the text rendered in Chrome to another file, as a way of measuring the text that appears in these profiles? I would like to create something automated from R, and loop it.
I'll place a dropbox link to example files (input and output) here -> https://www.dropbox.com/sh/4fqxwbj74tnfaxq/AACtexD7OVYYrMoTDrudbacba?dl=0. In the file, "test2_simple_pagecode.txt", this has the page source code of a sample profile. One could change this to a .html extension, and bring it up in a browser and view the page. What I want to do is bring that file up in a browser window, then cut and paste the text of the entire page into a separate file like the example in "test2_simple_cutpaste.txt". This way, the new file only has words that are actually seen in the profile.
此页面严重依赖JavaScript来渲染页面。我建议您研究rselenium来处理页面。 Rselenium将能够处理JavaScript,您将能够使用“ rvest”软件包来提取感兴趣的信息。
This page relies heavily on javascript to render the page. I suggest looking into rselenium to process the page. RSelenium will be able to process the javascript and you would be able to use the "rvest" package to extract the information of interest.
Here is very quick and very dirty way to extract the information stored in the person’s profile, but there is also a lot of extraneous information stored there also.
It appears that the information in profile is stored as JSON data in a comment in the html code. The example below extracts that comment, removes the unicode character and parses the JSON data.