当前位置：文江博客话题详情

Windows information-retrieval

有没有流行的网络数据抓取工具？

发布于 2024-08-15 15:01:37 字数 55 浏览 11 评论 0原文

我正在从事信息提取工作，我需要一个工具来从网络上抓取数据 page，windows下有流行的吗？

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（1）

诠释孤独 2024-08-22 15:01:37

来自：http://en.wikipedia.org/wiki/Web_crawler：

Aspseek 是一个用 C 语言编写的爬虫、索引器和搜索引擎，并根据 GPL 授权
arachnode.net 是一个使用 SQL 2008 和 Lucene 用 C# 编写的 .NET 网络爬虫。
DataparkSearch是一个在GNU通用公共许可证下发布的爬虫和搜索引擎。
GNU Wget 是一个用 C 语言编写并在 GPL 下发布的命令行操作的爬虫。它通常用于镜像 Web 和 FTP 站点。
GRUB 是 Wikia 搜索 ( http://wikiasearch.com ）用于抓取网络。
Heritrix 是互联网档案馆的归档质量爬虫，设计用于归档大部分网络的定期快照。它是用 Java 编写的。
ht://Dig 在其索引引擎中包含一个网络爬虫。
HTTrack 使用网络爬虫创建网站镜像以供离线查看。它是用 C 语言编写的，并在 GPL 下发布。
ICDL Crawler是一个用C++编写的跨平台网络爬虫，旨在仅使用计算机的空闲CPU资源来爬取基于网站解析模板的网站。
mnoGoSearch 是一个用 C 语言编写的爬虫、索引器和搜索引擎，并在 GPL 下获得许可
Nutch 是一个用 Java 编写的爬虫，并在 Apache 许可证下发布。它可以与 Lucene 文本索引包结合使用。
Pavuk 是一个命令行 Web 镜像工具，带有可选的 X11 GUI 爬虫程序，并在 GPL 下发布。与 wget 和 httrack 相比，它具有许多高级功能，例如基于正则表达式的过滤和文件创建规则。
YaCy，一个免费的分布式搜索引擎，基于点对点网络原理构建（根据 GPL 许可）。

还有一些阅读：
蜘蛛黑客 100 个工业级技巧和技巧工具：

Spidering Hacks 专为开发人员、研究人员、技术助理、图书馆员和高级用户编写，提供有关蜘蛛抓取和抓取方法的专家提示。您将从蜘蛛抓取概念、工具（Perl、LWP、开箱即用实用程序）和道德规范（如何知道何时走得太远：什么是可接受的和不可接受的）开始速成课程。接下来，您将从数据库收集媒体文件和数据。然后，您将学习如何解释和理解数据，将其重新用于其他应用程序，甚至构建授权接口以将数据集成到您自己的内容中。

From: http://en.wikipedia.org/wiki/Web_crawler:

Aspseek is a crawler, indexer and a search engine written in C and licenced under the GPL
arachnode.net is a .NET web crawler written in C# using SQL 2008 and Lucene.
DataparkSearch is a crawler and search engine released under the GNU General Public License.
GNU Wget is a command-line-operated crawler written in C and released under the GPL. It is typically used to mirror Web and FTP sites.
GRUB is an open source distributed search crawler that Wikia Search ( http://wikiasearch.com ) uses to crawl the web.
Heritrix is the Internet Archive's archival-quality crawler, designed for archiving periodic snapshots of a large portion of the Web. It was written in Java.
ht://Dig includes a Web crawler in its indexing engine.
HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl Web sites based on Web-site Parse Templates using computer's free CPU resources only.
mnoGoSearch is a crawler, indexer and a search engine written in C and licenced under the GPL
Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text-indexing package.
Pavuk is a command-line Web mirror tool with optional X11 GUI crawler and released under the GPL. It has bunch of advanced features compared to wget and httrack, e.g., regular expression based filtering and file creation rules.
YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).

And some reading:
Spidering Hacks 100 Industrial-Strength Tips & Tools:

Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content.

回复收藏 0 原文

~没有更多了~

关于作者

感情旳空白

暂无简介

文章

评论

2465 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

牛↙奶布丁

文章 0 评论 0

COSO

文章 0 评论 0

落叶

文章 0 评论 0

暗地喜欢

文章 0 评论 0

qq_i8qOEG

文章 0 评论 0

qq_Wl4Sbi

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文