您好,我想创建一个桌面应用程序(c# 问题)来抓取或操作第 3 方网页上的表单。 基本上,我在桌面应用程序的表单中输入我的数据,它会转到第 3 方网站,并使用脚本或后台的任何内容,在那里输入我的数据(包括我的登录信息),然后为我单击“提交”按钮。只是想避免加载浏览器!
我在这个领域没有做太多(任何!)工作,我想知道像 perl、python、ruby 等脚本语言是否允许我这样做? 或者只是使用 c# 和 .net 来完成所有的抓取工作? 哪一款 IYO 最好?
我认为脚本可能需要将不同平台上的应用程序中的某些内容挂接到同一脚本中(例如,在 symbian 移动设备中,我无法像桌面版本那样在 c# 中开发它)。
它不是一个网络应用程序,否则我也可以使用原始网站。 我意识到这一切听起来毫无意义,但这种特定形式的自动化对我来说将真正节省时间。
Hi I want to create a desktop app (c# prob) that scrapes or manipulates a form on a 3rd party web page. Basically I enter my data in the form in the desktop app, it goes away to the 3rd party website and, using the script or whatever in the background, enters my data there (incl my login) and clicks the submit button for me.I just want to avoid loading up the browser!
Not having done much (any!) work in this area I was wondering would a scripting language like perl, python, ruby etc allow me to do such? Or simply do it all the scraping using c# and .net? Which one is best IYO?
I was thinking script as may need to hook into the same script something from applications on different platforms (eg symbian mobile where I wouldnt be able to develop it in c# as I would the desktop version).
Its not a web app otherwise I may as well use the original site. I realise it all sounds pointless but the automation for this specific form would be a real time saver for me.
发布评论
评论(13)
不要忘记查看BeautifulSoup,强烈推荐。
例如,请参阅options-for-html-scraping。
如果您需要为此任务选择一种编程语言,我会选择
Python
。要更直接地解决您的问题,请参阅 twill,一种用于 Web 浏览的简单脚本语言。
Do not forget to look at BeautifulSoup, comes highly recommended.
See, for example, options-for-html-scraping.
If you need to select a programming language for this task, I'd say
Python
.A more direct solution to your question, see twill, a simple scripting language for Web browsing.
我使用 C# 进行抓取。 请参阅有用的 HtmlAgilityPack 包。
对于解析页面,我使用 XPATH 或正则表达式。 如果您需要,.NET 还可以轻松处理 cookie。
我编写了一个小类,其中包含创建 WebRequest、发送它、等待响应、保存 cookie、处理网络错误和重新传输等的所有细节 - 最终结果是,在大多数情况下我可以只调用“GetRequest\PostRequest”并获取 HtmlDocument 返回。
I use C# for scraping. See the helpful HtmlAgilityPack package.
For parsing pages, I either use XPATH or regular expressions. .NET can also easily handle cookies if you need that.
I've written a small class that wraps all the details of creating a WebRequest, sending it, waiting for a response, saving the cookies, handling network errors and retransmitting, etc. - the end result is that for most situations I can just call "GetRequest\PostRequest" and get an HtmlDocument back.
您可以尝试使用 .NET HTML Agility Pack:
http://www.codeplex.com/htmlagilitypack
You could try using the .NET HTML Agility Pack:
http://www.codeplex.com/htmlagilitypack
C# 非常适合您的屏幕抓取需求。 .NET 的正则表达式功能非常好。 然而,对于如此简单的任务,您将很难找到一种不能相对轻松地完成您想要的事情的语言。 考虑到您已经在使用 C# 进行编程,我建议坚持这样做。
内置的屏幕抓取功能也是一流的。
C# is more than suitable for your screen scraping needs. .NET's Regex functionality is really nice. However, with such a simple task, you'll be hard to find a language that doesn't do what you want relatively easily. Considering you're already programming in C#, I'd say stick with that.
The built in screen scraping functionality is also top notch.
我们将 Groovy 与 NekoHTML 结合使用。 (另请注意,您现在可以在 Google App Engine 上运行 Groovy。)
以下是 Keplar 博客上的一些可运行代码示例:
通过 Groovy 抓取获得更好的竞争情报
We use Groovy with NekoHTML. (Also note that you can now run Groovy on Google App Engine.)
Here is some example, runnable code on the Keplar blog:
Better competitive intelligence through scraping with Groovy
IMO Perl 内置的正则表达式功能和操作文本的能力将使其成为屏幕抓取的有力竞争者。
IMO Perl's built in regular expression functionality and ability to manipulate text would make it a pretty good contender for screen scraping.
红宝石真是太棒了!...
尝试它的 hpricot/mechanize
Ruby is pretty great !...
try its hpricot/mechanize
格罗维非常好。
例子:
http://froth-and- java.blogspot.com/2007/06/html-screen-scraping-with-groovy.html
Groovy 和 HtmlUnit 也是非常好的搭配:
http://groovy.codehaus.org/Testing+Web+Applications
Htmlunit 将模拟一个支持 Javascript 的完整浏览器。
Groovy is very good.
Example:
http://froth-and-java.blogspot.com/2007/06/html-screen-scraping-with-groovy.html
Groovy and HtmlUnit is also a very good match:
http://groovy.codehaus.org/Testing+Web+Applications
Htmlunit will simulate a full browser with Javascript support.
PHP 是一个很好的竞争者,因为它具有良好的Perl 兼容的正则表达式支持和 cURL 库。
PHP is a good contender due to its good Perl-Compatible Regex support and cURL library.
HTML Agility Pack (c#)
Mozilla 解析器 (Java)
可靠的 XPath 支持
你必须先设置环境变量才能工作,这很痛苦
在 org.dom4j.Node 和 org.w3c.dom.Node 之间进行转换以获取不同的属性是一个真正的痛苦
在非标准 html 上崩溃(0.3 修复了这个)
XPath 的最佳解决方案
访问 NodeList 中节点上的数据时出现问题的最佳解决方案
使用 for(int i=1;i<=list_size;i++) 来解决这个问题
Beautiful Soup (Python)
我没有太多经验,但这是我发现的
我更喜欢 Mozilla HTML Parser
HTML Agility Pack (c#)
Mozilla Parser (Java)
Solid XPath support
you have to set enviroment variables before it will work which is a pain
casting between org.dom4j.Node and org.w3c.dom.Node to get different properties is a real pain
dies on non-standard html (0.3 fixes this)
best solution for XPath
problems accessing data on Nodes in a NodeList
use a for(int i=1;i<=list_size;i++) to get around that
Beautiful Soup (Python)
I don't have much experience but here's what I've found
I prefer Mozilla HTML Parser
看一下 HP 的 Web 语言(以前称为 WEBL)。
http://en.wikipedia.org/wiki/Web_Language
Take a look at HP's Web Language (formerly WEBL).
http://en.wikipedia.org/wiki/Web_Language
或者坚持使用 C# 中的 WebClient 和一些字符串操作。
Or stick with WebClient in C# and some string manipulations.
我赞同 python(或 Beautiful Soup)的推荐。 我目前正在使用 python 进行一个小型屏幕抓取项目,而 python 3 对 cookie 身份验证(通过 CookieJar 和 urllib)等内容的自动处理极大地简化了事情。 Python 支持您可能需要的所有更高级的功能(例如正则表达式),并且具有能够快速处理此类项目的优势(处理低级内容时不会有太多开销)。 它也是相对跨平台的。
I second the recommendation for python (or Beautiful Soup). I'm currently in the middle of a small screen-scraping project using python, and python 3's automatic handling of things like cookie authentication (through CookieJar and urllib) are greatly simplifying things. Python supports all of the more advanced features you might need (like regexes), as well as having the benefit of being able to handle projects like this quickly (not too much overhead in dealing with low level stuff). It's also relatively cross-platform.