从 ASP.NET webForm 获取数据
我对网络开发相当陌生,以前我从未做过任何屏幕抓取或网络爬行,但昨天我的一个朋友问我是否能够从 这个网站,不是我的,也不是他的,但数据是公开的,甚至可以下载。 数据的问题是,它只能作为每个日期或公司的一个文件提供,而不是多个日期或公司的一个文件,这涉及到大量繁琐的“点击”日历,所以他认为如果我将能够创建一些应用程序,可以一键抓取所有数据并将其输出到一个文件或类似的文件中。
该网站使用 aspx webFrom 和 __doPostBack 来检索不同日期的数据,甚至是下载数据的链接数据输入XSL 不是通常的“href=...”链接,我认为它们是某些 asp 脚本的引用...
老实说,我尝试的唯一方法是 PHP cURL,但它不起作用,但自从我第一次尝试 cURL有时,我什至不知道它是否不起作用,因为 cURL 不可能,或者只是因为我不知道如何使用它。 我只稍微精通 PHP 和 JavaScript,但不精通 ASP,尽管我不介意学习新的东西。
所以我的问题是.. 是否有可能从这样的网站获取数据?如果是的话,您能否给我一些关于如何解决此类问题的提示?
该网站再次位于此处 http://extranet.net4gas.cz/capacity_ee.aspx
谢谢
I'm fairly new to web development and never before did i do any screen-scraping nor web-crawling, but yesterday a friend of mine asked me if i would be able to grab some data from this website, which is not mine, nor his, but the data is publicly available even for download.
The problem with the data is, it's available only as one file per one date or company, rather than one file for multiple dates or companies, which involves a lot of tedious 'clicking trough' the calendar and so he thought it would be nice if i would be able to create some app that could grab all the data with one click and output it in one single file or something similar..
The website uses aspx webFrom with __doPostBack to retrieve the data for different dates, even the links to download the data in XSL aren't the usual "href=…" links, they are, i assume, references for some asp script…
To be honest the only thing i tried was PHP cURL which didn't work, but since i tried cURL for the first time, i don't even know if it didn't work because it is not possible with cURL, or just because i don't know how to work with it.
I am only somewhat proficient in PHP and JavaScript, but not in ASP, though i would't mind learning something new.
So my question is..
Is it at all possible to grab the data from a website like this? and if it is, would you be so kind as to give me some hints on how to approach this kind of problem?
the website, again, is here http://extranet.net4gas.cz/capacity_ee.aspx
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
C# 有一个很好的 WebClient 类来完成这项工作:
一旦您将页面 html 放入字符串中,您就可以使用正则表达式来抓取您要查找的内容。
这是一个非常基本的正则表达式来给出提示:
C# has a nice WebClient class to do the job:
once you have the page html in a string you use regular expressions to scrape the content you are looking for.
here is a very basic regular expression to give a hint:
Marosko,正如您所说,网站上的数据是向公众开放的,因此您肯定可以从中获取数据。现在,它是减少手动点击日期并从中抓取数据。我个人不太了解 Curl 如何工作,但我确信它会涉及大量编码。我宁愿建议您使用一些自动化工具(例如软件应用程序)来自动化整个过程。尝试一下 Automation Anywhere,我几个月前买了它用于一些数据提取目的,它运行得很好。它是自动化的,您可以检查它显示的屏幕抓取功能。这是我最喜欢的:)
查尔斯
Marosko, as you said the data on website is open for public, so for sure you can scrape data out of it. Now, it is to decrease the manual click through dates and scraping data out of it. I personally don't have much idea about how Curl will work but I am sure it will involve a lot of coding. I would rather suggest you to automate the entire process using some automation tool, like a software application. Try Automation Anywhere, I bought it few months back for some data extraction purpose and it worked very well. It is automated and you can check the screen scraping capabilities it shows. Its my favorite :)
Charles