JavaScript CoffeeScript C# screen-scraping

如何使用js/coffee来屏幕抓取asp页面？

发布于 2024-11-17 23:48:53 字数 477 浏览 1 评论 0原文

我有一个网站，我想从中提取数据，但它确实停留在石器时代。没有 Web 服务，没有 API，它基本上是一个 ASP/会话/基于表的布局页面。相当难看。

我想只是屏幕抓取它并使用 js (coffeescript) 来自动化它。我想知道这是否可能。我可以使用 C# 和 linqpad 来完成此操作，但随后我不得不使用正则表达式来解析表（以及子表和子子表）。另外，如果我使用 js 或 Coffeescript 来实现，我会更熟悉这些语言，并且能够使用 jQuery 从 DOM 中提取元素。

我在这里看到两种可能性：

使用 C# 并找到一个可以执行 Jquery 之类的操作的库，但在 C# 代码中
使用咖啡脚本 (js) 并使用 jquery 来查找我在页面中查找的元素，

我也想自动化稍微翻页（获取下一组结果）。这严格供个人使用——我不会提取某人的搜索结果用于我的业务。我只是想让一个蹩脚的搜索引擎做我想做的事。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

dawn曙光 2024-11-24 23:48:54

我编写了一个类，允许您提供一堆 url 和一个代码块来抓取 chrome 扩展内的页面。您可以在此处找到 github 存储库：https://github.com/jkarmel/Executor。它可以使用更多测试，我需要处理文档，但看起来它可能就是您正在寻找的内容。

以下是如何使用它从几个不同页面获取所有链接：

/*
* background.js by Jeremy Karmel. 
*/

URLS = ['http://www.apple.com/',
        'http://www.google.com/',
        'http://www.facebook.com/',
        'http://www.stanford.edu'];

//Function will be provided to exector to collect information
var getLinks = function() {
    var links = [];
    var numLinks = $('a');
    $links.each(function(i, val) {links.push(val.href)});
    var request = {data: links, url: window.location.href};
    chrome.extension.sendRequest(request);
}

var main = function() {
    var specForUsersTopics = {
        urls     : URLS,
        code     : getLinks,

        callback : function(results) {
            for (var url in results) {
                console.log(url + ' has ' + results[url].length + ' links.');
                var links = results[url];
                for (var i = 0; i < links.length; i++) 
                    console.log('   ' + links[i]);
            }
            console.log('all done!!!!');
        }
    };
    var exec = Executor(specForUsersTopics);
    exec.start();
}

main();

因此，基本上，收集链接的代码将提供给执行器实例，然后您可以对回调中的结果执行任何您想要的操作。它可以处理较长的 url 列表（~1000），并且一次可以处理多个（默认 == 5）。它现在不能很好地处理代码块中的错误，因此请务必测试您提供的代码。

I wrote a class that allows you to supply a bunch of urls and a code block to scrape pages inside a chrome extension. You can find the github repo here: https://github.com/jkarmel/Executor. It could use some more testing and I need to work on the documentation, but it looks like it might be what you are looking for.

Here is how you would use it to get the all the links from a few different pages:

/*
* background.js by Jeremy Karmel. 
*/

URLS = ['http://www.apple.com/',
        'http://www.google.com/',
        'http://www.facebook.com/',
        'http://www.stanford.edu'];

//Function will be provided to exector to collect information
var getLinks = function() {
    var links = [];
    var numLinks = $('a');
    $links.each(function(i, val) {links.push(val.href)});
    var request = {data: links, url: window.location.href};
    chrome.extension.sendRequest(request);
}

var main = function() {
    var specForUsersTopics = {
        urls     : URLS,
        code     : getLinks,

        callback : function(results) {
            for (var url in results) {
                console.log(url + ' has ' + results[url].length + ' links.');
                var links = results[url];
                for (var i = 0; i < links.length; i++) 
                    console.log('   ' + links[i]);
            }
            console.log('all done!!!!');
        }
    };
    var exec = Executor(specForUsersTopics);
    exec.start();
}

main();

So basically the code to collect the links would be supplied to the executor instance and then you would do whatever you wanted with the results in the callback. It can deal with longish lists of url (~1000) and it will work on more than one at a time (default == 5). It doesn't handle errors in the code block very well right now, so be sure to test the code you are supplying.

回复收藏 0 原文