如何将动态站点变成可以从 CD 演示的静态站点？

发布于 2024-07-05 16:52:49 字数 264 浏览 11 评论 0原文

我需要找到一种方法来抓取我们公司的一个 Web 应用程序，并从中创建一个静态站点，该站点可以刻录到 CD 上，供旅行销售人员用来演示该网站。后端数据存储分布在许多系统中，因此仅在销售人员笔记本电脑上的虚拟机上运行该网站是行不通的。而且他们在某些客户那里无法访问互联网（没有互联网，手机......原始，我知道）。

有没有人对可以处理链接清理、flash、一点ajax、css等的爬虫有什么好的建议？我知道可能性很小，但我想在开始编写自己的工具之前我应该先把这个问题提出来。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如果没结果 2024-07-12 16:52:49

只是因为没有人复制粘贴工作命令......我正在尝试......十年后。 :D

wget --mirror --convert-links --adjust-extension --page-requisites \
--no-parent http://example.org

这对我来说就像一种魅力。

Just because nobody copy pasted a working command ... I am trying ... ten years later. :D

wget --mirror --convert-links --adjust-extension --page-requisites \
--no-parent http://example.org

It worked like a charm for me.

回复收藏 0 原文

画▽骨i 2024-07-12 16:52:49

通过使用 WebCrawler，例如其中之一：

DataparkSearch 是在 GNU 下发布的爬虫和搜索引擎通用公共许可证。
GNU Wget 是一个用 C 语言编写并在 GPL 下发布的命令行操作的爬虫。它通常用于镜像 Web 和 FTP 站点。
HTTrack 使用网络爬虫创建网站镜像以供离线查看。它是用 C 语言编写的，并在 GPL 下发布。
ICDL Crawler 是一个用 C++ 编写的跨平台网络爬虫，旨在仅使用计算机的空闲 CPU 资源来爬取基于网站解析模板的网站。
JSpider 是一个高度可配置和可定制的网络蜘蛛引擎，在 GPL 下发布。
Larbin，作者：Sebastien Ailleret
Webtools4larbin，作者：Andreas Beder
Methabot 是一个速度优化的网络爬虫和命令行实用程序，用 C 语言编写，并根据 2 条款 BSD 许可证发布。它具有广泛的配置系统、模块系统，并支持通过本地文件系统、HTTP 或 FTP 进行有针对性的爬行。
Jaeksoft WebSearch 是基于 Apache Lucene 构建的网络爬虫和索引器。它是根据 GPL v3 许可证发布的。
Nutch 是一个用 Java 编写的爬虫程序，并在 Apache 许可证下发布。它可以与 Lucene 文本索引包结合使用。
Pavuk 是一个命令行网络镜像工具，带有可选的 X11 GUI 爬虫，并在 GPL 下发布。与 wget 和 httrack 相比，它具有许多高级功能，例如。基于正则表达式的过滤和文件创建规则。
WebVac 是斯坦福 WebBase 项目使用的爬虫。
WebSPHINX（Miller 和 Bharat，1998）由实现多线程网页检索和 HTML 解析的 Java 类库以及用于设置起始 URL、提取下载数据并实现基本文本的图形用户界面组成。基于搜索引擎。
WIRE - Web 信息检索环境 [15] 是一个用 C++ 编写并在 GPL 下发布的网络爬虫，包括用于调度页面下载的多个策略以及用于生成下载页面的报告和统计数据的模块，因此它已用于 Web 表征。
LWP::RobotUA（Langheinrich，2004）是一个 Perl 类，用于实现在 Perl 5 许可证下分发的表现良好的并行网络机器人。
Web Crawler .NET 的开源网络爬虫类（用 C# 编写）。
Sherlock Holmes Sherlock Holmes 在本地和网络上收集文本数据（文本文件、网页等）并为其编制索引。 Holmes 由捷克门户网站 Centrum 赞助并用于商业用途。 Onet.pl 也使用它。
YaCy，一个免费的分布式搜索引擎，基于点对点网络原理构建（根据 GPL 许可）。
Ruya Ruya 是一个开源、高性能、广度优先、基于级别的网络爬虫。它用于以良好的方式抓取英语和日语网站。它是在 GPL 下发布的，并且完全用 Python 语言编写。 SingleDomainDelayCrawler 实现遵循 robots.txt 并具有抓取延迟。
通用信息爬虫快速开发的网络爬虫。抓取保存并分析数据。
Agent Kernel 一个 Java 框架，用于爬行时的调度、线程和存储管理。
Spider News，有关在 Perl 中构建 Spider 的信息。
Arachnode.NET 是一个开源混杂网络爬虫，用于下载、索引和存储 Internet 内容，包括电子邮件地址、文件、超链接、图像和网页。 Arachnode.net 使用 SQL Server 2005 用 C# 编写，并在 GPL 下发布。
dine 是一个多线程 Java HTTP 客户端/爬虫，可以使用 LGPL 下发布的 JavaScript 进行编程。
Crawljax 是一个 Ajax 爬虫，它基于动态构建“状态流图”的方法，对 Ajax 应用程序中的各种导航路径和状态进行建模。 Crawljax 是用 Java 编写的，并在 BSD 许可证下发布。

By using a WebCrawler, e.g. one of these:

DataparkSearch is a crawler and search engine released under the GNU General Public License.
GNU Wget is a command-line operated crawler written in C and released under the GPL. It is typically used to mirror web and FTP sites.
HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl websites based on Website Parse Templates using computer's free CPU resources only.
JSpider is a highly configurable and customizable web spider engine released under the GPL.
Larbin by Sebastien Ailleret
Webtools4larbin by Andreas Beder
Methabot is a speed-optimized web crawler and command line utility written in C and released under a 2-clause BSD License. It features a wide configuration system, a module system and has support for targeted crawling through local filesystem, HTTP or FTP.
Jaeksoft WebSearch is a web crawler and indexer build over Apache Lucene. It is released under the GPL v3 license.
Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text indexing package.
Pavuk is a command line web mirror tool with optional X11 GUI crawler and released under the GPL. It has bunch of advanced features compared to wget and httrack, eg. regular expression based filtering and file creation rules.
WebVac is a crawler used by the Stanford WebBase Project.
WebSPHINX (Miller and Bharat, 1998) is composed of a Java class library that implements multi-threaded web page retrieval and HTML parsing, and a graphical user interface to set the starting URLs, to extract the downloaded data and to implement a basic text-based search engine.
WIRE - Web Information Retrieval Environment [15] is a web crawler written in C++ and released under the GPL, including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for web characterization.
LWP::RobotUA (Langheinrich , 2004) is a Perl class for implementing well-behaved parallel web robots distributed under Perl 5's license.
Web Crawler Open source web crawler class for .NET (written in C#).
Sherlock Holmes Sherlock Holmes gathers and indexes textual data (text files, web pages, ...), both locally and over the network. Holmes is sponsored and commercially used by the Czech web portal Centrum. It is also used by Onet.pl.
YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).
Ruya Ruya is an Open Source, high performance breadth-first, level-based web crawler. It is used to crawl English and Japanese websites in a well-behaved manner. It is released under the GPL and is written entirely in the Python language. A SingleDomainDelayCrawler implementation obeys robots.txt with a crawl delay.
Universal Information Crawler Fast developing web crawler. Crawls Saves and analyzes the data.
Agent Kernel A Java framework for schedule, thread, and storage management when crawling.
Spider News, Information regarding building a spider in perl.
Arachnode.NET, is an open source promiscuous Web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server 2005 and is released under the GPL.
dine is a multithreaded Java HTTP client/crawler that can be programmed in JavaScript released under the LGPL.
Crawljax is an Ajax crawler based on a method which dynamically builds a `state-flow graph' modeling the various navigation paths and states within an Ajax application. Crawljax is written in Java and released under the BSD License.