如何将动态站点变成可以从 CD 演示的静态站点?

发布于 2024-07-05 16:52:49 字数 264 浏览 11 评论 0原文

我需要找到一种方法来抓取我们公司的一个 Web 应用程序,并从中创建一个静态站点,该站点可以刻录到 CD 上,供旅行销售人员用来演示该网站。 后端数据存储分布在许多系统中,因此仅在销售人员笔记本电脑上的虚拟机上运行该网站是行不通的。 而且他们在某些客户那里无法访问互联网(没有互联网,手机......原始,我知道)。

有没有人对可以处理链接清理、flash、一点ajax、css等的爬虫有什么好的建议? 我知道可能性很小,但我想在开始编写自己的工具之前我应该​​先把这个问题提出来。

I need to find a way to crawl one of our company's web applications and create a static site from it that can be burned to a cd and used by traveling sales people to demo the web site. The back end data store is spread across many, many systems so simply running the site on a VM on the sale person's laptop won't work. And they won't have access to the internet while at some clients (no internet, cell phone....primitive, I know).

Does anyone have any good recommendations for crawlers that can handle things like link cleanup, flash, a little ajax, css, etc? I know odds are slim, but I figured I'd throw the question out here before I jump into writing my own tool.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

如果没结果 2024-07-12 16:52:49

只是因为没有人复制粘贴工作命令......我正在尝试......十年后。 :D

wget --mirror --convert-links --adjust-extension --page-requisites \
--no-parent http://example.org

这对我来说就像一种魅力。

Just because nobody copy pasted a working command ... I am trying ... ten years later. :D

wget --mirror --convert-links --adjust-extension --page-requisites \
--no-parent http://example.org

It worked like a charm for me.

画▽骨i 2024-07-12 16:52:49

通过使用 WebCrawler,例如其中之一:

  • DataparkSearch 是在 GNU 下发布的爬虫和搜索引擎通用公共许可证。
  • GNU Wget 是一个用 C 语言编写并在 GPL 下发布的命令行操作的爬虫。 它通常用于镜像 Web 和 FTP 站点。
  • HTTrack 使用网络爬虫创建网站镜像以供离线查看。 它是用 C 语言编写的,并在 GPL 下发布。
  • ICDL Crawler 是一个用 C++ 编写的跨平台网络爬虫,旨在仅使用计算机的空闲 CPU 资源来爬取基于网站解析模板的网站。
  • JSpider 是一个高度可配置和可定制的网络蜘蛛引擎,在 GPL 下发布。
  • Larbin,作者:Sebastien Ailleret
  • Webtools4larbin,作者:Andreas Beder
  • Methabot 是一个速度优化的网络爬虫和命令行实用程序,用 C 语言编写,并根据 2 条款 BSD 许可证发布。 它具有广泛的配置系统、模块系统,并支持通过本地文件系统、HTTP 或 FTP 进行有针对性的爬行。
  • Jaeksoft WebSearch 是基于 Apache Lucene 构建的网络爬虫和索引器。 它是根据 GPL v3 许可证发布的。
  • Nutch 是一个用 Java 编写的爬虫程序,并在 Apache 许可证下发布。 它可以与 Lucene 文本索引包结合使用。
  • Pavuk 是一个命令行网络镜像工具,带有可选的 X11 GUI 爬虫,并在 GPL 下发布。 与 wget 和 httrack 相比,它具有许多高级功能,例如。 基于正则表达式的过滤和文件创建规则。
  • WebVac 是斯坦福 WebBase 项目使用的爬虫。
  • WebSPHINX(Miller 和 Bharat,1998)由实现多线程网页检索和 HTML 解析的 Java 类库以及用于设置起始 URL、提取下载数据并实现基本文本的图形用户界面组成。基于搜索引擎。
  • WIRE - Web 信息检索环境 [15] 是一个用 C++ 编写并在 GPL 下发布的网络爬虫,包括用于调度页面下载的多个策略以及用于生成下载页面的报告和统计数据的模块,因此它已用于 Web 表征。
  • LWP::RobotUA(Langheinrich,2004)是一个 Perl 类,用于实现在 Perl 5 许可证下分发的表现良好的并行网络机器人。
  • Web Crawler .NET 的开源网络爬虫类(用 C# 编写)。
  • Sherlock Holmes Sherlock Holmes 在本地和网络上收集文本数据(文本文件、网页等)并为其编制索引。 Holmes 由捷克门户网站 Centrum 赞助并用于商业用途。 Onet.pl 也使用它。
  • YaCy,一个免费的分布式搜索引擎,基于点对点网络原理构建(根据 GPL 许可)。
  • Ruya Ruya 是一个开源、高性能、广度优先、基于级别的网络爬虫。 它用于以良好的方式抓取英语和日语网站。 它是在 GPL 下发布的,并且完全用 Python 语言编写。 SingleDomainDelayCrawler 实现遵循 robots.txt 并具有抓取延迟。
  • 通用信息爬虫 快速开发的网络爬虫。 抓取 保存并分析数据。
  • Agent Kernel 一个 Java 框架,用于爬行时的调度、线程和存储管理。
  • Spider News,有关在 Perl 中构建 Spider 的信息。
  • Arachnode.NET 是一个开源混杂网络爬虫,用于下载、索引和存储 Internet 内容,包括电子邮件地址、文件、超链接、图像和网页。 Arachnode.net 使用 SQL Server 2005 用 C# 编写,并在 GPL 下发布。
  • dine 是一个多线程 Java HTTP 客户端/爬虫,可以使用 LGPL 下发布的 JavaScript 进行编程。
  • Crawljax 是一个 Ajax 爬虫,它基于动态构建“状态流图”的方法,对 Ajax 应用程序中的各种导航路径和状态进行建模。 Crawljax 是用 Java 编写的,并在 BSD 许可证下发布。

By using a WebCrawler, e.g. one of these:

  • DataparkSearch is a crawler and search engine released under the GNU General Public License.
  • GNU Wget is a command-line operated crawler written in C and released under the GPL. It is typically used to mirror web and FTP sites.
  • HTTrack uses a Web crawler to create a mirror of a web site for off-line viewing. It is written in C and released under the GPL.
  • ICDL Crawler is a cross-platform web crawler written in C++ and intended to crawl websites based on Website Parse Templates using computer's free CPU resources only.
  • JSpider is a highly configurable and customizable web spider engine released under the GPL.
  • Larbin by Sebastien Ailleret
  • Webtools4larbin by Andreas Beder
  • Methabot is a speed-optimized web crawler and command line utility written in C and released under a 2-clause BSD License. It features a wide configuration system, a module system and has support for targeted crawling through local filesystem, HTTP or FTP.
  • Jaeksoft WebSearch is a web crawler and indexer build over Apache Lucene. It is released under the GPL v3 license.
  • Nutch is a crawler written in Java and released under an Apache License. It can be used in conjunction with the Lucene text indexing package.
  • Pavuk is a command line web mirror tool with optional X11 GUI crawler and released under the GPL. It has bunch of advanced features compared to wget and httrack, eg. regular expression based filtering and file creation rules.
  • WebVac is a crawler used by the Stanford WebBase Project.
  • WebSPHINX (Miller and Bharat, 1998) is composed of a Java class library that implements multi-threaded web page retrieval and HTML parsing, and a graphical user interface to set the starting URLs, to extract the downloaded data and to implement a basic text-based search engine.
  • WIRE - Web Information Retrieval Environment [15] is a web crawler written in C++ and released under the GPL, including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for web characterization.
  • LWP::RobotUA (Langheinrich , 2004) is a Perl class for implementing well-behaved parallel web robots distributed under Perl 5's license.
  • Web Crawler Open source web crawler class for .NET (written in C#).
  • Sherlock Holmes Sherlock Holmes gathers and indexes textual data (text files, web pages, ...), both locally and over the network. Holmes is sponsored and commercially used by the Czech web portal Centrum. It is also used by Onet.pl.
  • YaCy, a free distributed search engine, built on principles of peer-to-peer networks (licensed under GPL).
  • Ruya Ruya is an Open Source, high performance breadth-first, level-based web crawler. It is used to crawl English and Japanese websites in a well-behaved manner. It is released under the GPL and is written entirely in the Python language. A SingleDomainDelayCrawler implementation obeys robots.txt with a crawl delay.
  • Universal Information Crawler Fast developing web crawler. Crawls Saves and analyzes the data.
  • Agent Kernel A Java framework for schedule, thread, and storage management when crawling.
  • Spider News, Information regarding building a spider in perl.
  • Arachnode.NET, is an open source promiscuous Web crawler for downloading, indexing and storing Internet content including e-mail addresses, files, hyperlinks, images, and Web pages. Arachnode.net is written in C# using SQL Server 2005 and is released under the GPL.
  • dine is a multithreaded Java HTTP client/crawler that can be programmed in JavaScript released under the LGPL.
  • Crawljax is an Ajax crawler based on a method which dynamically builds a `state-flow graph' modeling the various navigation paths and states within an Ajax application. Crawljax is written in Java and released under the BSD License.
╰つ倒转 2024-07-12 16:52:49

如果不将网络服务器刻录到 CD,您将无法处理 AJAX 请求之类的事情,我知道您已经说过这是不可能的。

wget 将为您下载该网站(使用 -r 参数表示“递归”),但是任何动态内容(例如报告等)当然都无法正常工作,您只会获得一个快照。

You're not going to be able to handle things like AJAX requests without burning a webserver to the CD, which I understand you have already said is impossible.

wget will download the site for you (use the -r parameter for "recursive"), but any dynamic content like reports and so on of course will not work properly, you'll just get a single snapshot.

梦途 2024-07-12 16:52:49

wget 或curl 都可以递归地跟踪链接并镜像整个站点,因此这可能是一个不错的选择。 不过,您将无法使用网站的真正交互式部分,例如搜索引擎或任何修改数据的内容。

是否有可能创建可以从销售人员的笔记本电脑上运行且应用程序可以与之交互的虚拟后端服务?

wget or curl can both recursively follow links and mirror an entire site, so that might be a good bet. You won't be able to use truly interactive parts of the site, like search engines, or anything that modifies the data, thoguh.

Is it possible at all to create dummy backend services that can run from the sales folks' laptops, that the app can interface with?

街角卖回忆 2024-07-12 16:52:49

如果您最终不得不从网络服务器上运行它,您可能需要查看:

ServerToGo

它允许您从 CD 运行 WAMPP 堆栈,并提供 mysql/php/apache 支持。 数据库在启动时被复制到当前用户的临时目录,并且可以完全运行,而无需用户安装任何东西!

If you do end up having to run it off of a webserver, you might want to take a look at:

ServerToGo

It lets you run a WAMPP stack off of a CD, complete with mysql/php/apache support. The db's are copied to the current users temp directory on launch, and can be run entirely without the user installing anything!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文