如何抓取 Hype Machine 等网站?

发布于 2024-09-12 04:00:35 字数 230 浏览 1 评论 0原文

我对网站抓取(即它是如何完成的等)很好奇,特别是我想编写一个脚本来执行网站的任务 炒作机器。 我实际上是一名软件工程本科生(四年级),但是我们并没有真正涵盖任何 Web 编程,因此我对 Javascript/RESTFul API/所有 Web 事物的理解非常有限,因为我们主要关注理论和客户端应用程序。 非常感谢任何帮助或指示。

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine.
I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications.
Any help or directions greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

离去的眼神 2024-09-19 04:00:35

首先要检查的是网站是否已经提供某种结构化数据,或者您是否需要自己解析 HTML。似乎有一个 最新歌曲的 RSS feed。如果这就是您正在寻找的,那么最好从那里开始。

您可以使用脚本语言下载提要并对其进行解析。我使用 python,但如果您愿意,您可以选择不同的脚本语言。这里有一些关于如何在 python 中下载 url在 python 中解析 XML

当您编写下载网站或 RSS 提要的程序时,需要注意的另一件事是抓取脚本的运行频率。如果您让它不断运行,以便在新数据可用时立即获取新数据,您将给网站带来大量负载,并且他们很可能会阻止您。尽量不要比需要的频率更频繁地运行脚本。

The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.

You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.

Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.

携余温的黄昏 2024-09-19 04:00:35

您可能需要查看以下书籍:

“Webbots、Spiders 和 Screen Scrapers:使用 PHP/CURL 开发 Internet 代理指南”
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/ dp/1593271204

“C# 机器人的 HTTP 编程秘诀”
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/ dp/0977320677

“Java 机器人的 HTTP 编程秘诀”
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/ dp/0977320669

You may want to check the following books:

"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL"
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204

"HTTP Programming Recipes for C# Bots"
http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677

"HTTP Programming Recipes for Java Bots"
http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669

嘿看小鸭子会跑 2024-09-19 04:00:35

我认为你必须分析的最重要的事情是你想要提取哪种信息。如果您想像 google 一样提取整个网站,您最好的选择可能是分析 Apache.org 中的 nutch 等工具或 flappor 解决方案 http ://ww.houunder.org 如果您需要提取非结构化数据文档(网站、文档、pdf)上的特定区域,您可能可以扩展 nutch 插件来满足特定需求。 nutch.apache.org

另一方面,如果您需要提取使用页面 DOM 设置规则的网站的特定文本或剪切区域,则可能您需要检查的内容与 mozenda.com 等工具更相关。使用这些工具,您将能够设置提取规则,以便废弃网站上的特定信息。您必须考虑到网页上的任何更改都会给您的机器人带来错误。

最后,如果您计划使用信息源开发一个网站,您可以从 spinn3r.com 等公司购买信息,因为他们出售可供消费的特定领域的信息。您将能够在基础设施方面节省大量资金。
希望有帮助!
塞巴斯蒂安。

I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org

On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.

Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure.
hope it helps!.
sebastian.

踏雪无痕 2024-09-19 04:00:35

Python 有 feedparser 模块,位于 feedparser.org,它实际上处理各种风格的 RSS 和各种风格的 ATOM。没有理由重新发明轮子。

Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文