如何“拉动”来自 html 文件的特定数据并处理数据
我是编程新手,我有一个问题,如何从网站页面中提取特定信息,处理数据以检查并查看其是否满足某些参数,以及存储满足参数的页面的 url。
问题是这样的:
- 有一个网站有几篇文章。 -我希望能够列出网站上包含少于 x 个单词的文章的 URL 列表。
我不需要编码或任何方面的帮助,因为我对此很陌生,这本质上是我学习编程的自我练习。
我只是想知道如何解决这个问题。我了解 HTML 和最少的 Ruby,这就是我的知识范围。
我只是不知道如何从网页“提取”数据。 :S 我将使用什么来提取 HTML?提取 HTML 后我该如何处理它?将其转换为红宝石?如果是这样,怎么办?
I am new to programming and I have a question about how to pull specific information from a page on a website, crunch the data to check and see if it meets certain parameters, and store urls of the pages that meet the parameters.
The problem is this:
-There is a website with several articles.
-I would like to be able to make a list of urls of articles on the site that contain fewer than x number of words.
I don't need help with the coding or anything because I am new to this and this is essentially a self-exercise for me to learn to program.
I just have questions as how to go about this. I know HTML and minimal Ruby and that's the extent of my knowledge.
I just don't know how to "pull" the data from a webpage. :S What would I use to pull HTML? What do I do with the HTML after I pull it? Convert it to Ruby? If so, how?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先:
Nokogiri 是一个用 Ruby 解析 HTML/XML 文档的库。请查看网页以获取有关如何安装/使用它的更多信息。
Start with:
Nokogiri is a library to parse HTML/XML documents in Ruby. Have a look on the webpage for more information on how to install/use it.