如何使用js/coffee来屏幕抓取asp页面?

发布于 2024-11-17 23:48:53 字数 477 浏览 1 评论 0原文

我有一个网站,我想从中提取数据,但它确实停留在石器时代。没有 Web 服务,没有 API,它基本上是一个 ASP/会话/基于表的布局页面。相当难看。

我想只是屏幕抓取它并使用 js (coffeescript) 来自动化它。我想知道这是否可能。我可以使用 C# 和 linqpad 来完成此操作,但随后我不得不使用正则表达式来解析表(以及子表和子子表)。另外,如果我使用 js 或 Coffeescript 来实现,我会更熟悉这些语言,并且能够使用 jQuery 从 DOM 中提取元素。

我在这里看到两种可能性:

  • 使用 C# 并找到一个可以执行 Jquery 之类的操作的库,但在 C# 代码中
  • 使用咖啡脚本 (js) 并使用 jquery 来查找我在页面中查找的元素,

我也想自动化稍微翻页(获取下一组结果)。这严格供个人使用——我不会提取某人的搜索结果用于我的业务。我只是想让一个蹩脚的搜索引擎做我想做的事。

I've got a website that I'd like to pull data from and it's really stuck in the stone ages. There's no web service, no API and it's very much an ASP/Session/table-based-layout page. Pretty fugly.

I'd like to just screen scrape it and use js (coffeescript) to automate that. I wonder if this is possible. I could do this with C# and linqpad but then I'm stuck parsing the tables (and sub-tables and sub-sub-tables) with regex. Plus if I do it with js or coffeescript I'll get much more comfortable with those languages and I'll be able to use jQuery for pulling elements out of the DOM.

I see two possibilities here:

  • use C# and find a library that will do things like Jquery but in C# code
  • use coffeescript (js) and use jquery to find the elements that I'm looking for in the page

I'd also like to automate the page a bit (get next set of results). This is strictly for personal use -- I'm not pulling results of someone's search to use in my business. I just want to make a crappy search engine do what I want.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

dawn曙光 2024-11-24 23:48:54

我编写了一个类,允许您提供一堆 url 和一个代码块来抓取 chrome 扩展内的页面。您可以在此处找到 github 存储库:https://github.com/jkarmel/Executor。它可以使用更多测试,我需要处理文档,但看起来它可能就是您正在寻找的内容。

以下是如何使用它从几个不同页面获取所有链接:

/*
* background.js by Jeremy Karmel. 
*/

URLS = ['http://www.apple.com/',
        'http://www.google.com/',
        'http://www.facebook.com/',
        'http://www.stanford.edu'];

//Function will be provided to exector to collect information
var getLinks = function() {
    var links = [];
    var numLinks = $('a');
    $links.each(function(i, val) {links.push(val.href)});
    var request = {data: links, url: window.location.href};
    chrome.extension.sendRequest(request);
}

var main = function() {
    var specForUsersTopics = {
        urls     : URLS,
        code     : getLinks,

        callback : function(results) {
            for (var url in results) {
                console.log(url + ' has ' + results[url].length + ' links.');
                var links = results[url];
                for (var i = 0; i < links.length; i++) 
                    console.log('   ' + links[i]);
            }
            console.log('all done!!!!');
        }
    };
    var exec = Executor(specForUsersTopics);
    exec.start();
}

main();

因此,基本上,收集链接的代码将提供给执行器实例,然后您可以对回调中的结果执行任何您想要的操作。它可以处理较长的 url 列表(~1000),并且一次可以处理多个(默认 == 5)。它现在不能很好地处理代码块中的错误,因此请务必测试您提供的代码。

I wrote a class that allows you to supply a bunch of urls and a code block to scrape pages inside a chrome extension. You can find the github repo here: https://github.com/jkarmel/Executor. It could use some more testing and I need to work on the documentation, but it looks like it might be what you are looking for.

Here is how you would use it to get the all the links from a few different pages:

/*
* background.js by Jeremy Karmel. 
*/

URLS = ['http://www.apple.com/',
        'http://www.google.com/',
        'http://www.facebook.com/',
        'http://www.stanford.edu'];

//Function will be provided to exector to collect information
var getLinks = function() {
    var links = [];
    var numLinks = $('a');
    $links.each(function(i, val) {links.push(val.href)});
    var request = {data: links, url: window.location.href};
    chrome.extension.sendRequest(request);
}

var main = function() {
    var specForUsersTopics = {
        urls     : URLS,
        code     : getLinks,

        callback : function(results) {
            for (var url in results) {
                console.log(url + ' has ' + results[url].length + ' links.');
                var links = results[url];
                for (var i = 0; i < links.length; i++) 
                    console.log('   ' + links[i]);
            }
            console.log('all done!!!!');
        }
    };
    var exec = Executor(specForUsersTopics);
    exec.start();
}

main();

So basically the code to collect the links would be supplied to the executor instance and then you would do whatever you wanted with the results in the callback. It can deal with longish lists of url (~1000) and it will work on more than one at a time (default == 5). It doesn't handle errors in the code block very well right now, so be sure to test the code you are supplying.

隔纱相望 2024-11-24 23:48:54

我喜欢 Curtain A)“使用 C# 并找到一个库...”

“HTML Agility Pack”可能正是您正在寻找的:

http://htmlagilitypack.codeplex.com/

I'm liking Curtain A) "use C# and find a library..."

"HTML Agility Pack" might be just what you're looking for:

http://htmlagilitypack.codeplex.com/

老娘不死你永远是小三 2024-11-24 23:48:54

您可以使用 Node.js、jsdom 和 jQuery 轻松完成此操作。请参阅本教程(JavaScript 语言)。

You can do it easily with Node.js, jsdom, and jQuery. See this tutorial (in JavaScript).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文