最好的屏幕抓取语言是什么?

发布于 2024-07-17 10:00:57 字数 418 浏览 12 评论 0 原文

您好,我想创建一个桌面应用程序(c# 问题)来抓取或操作第 3 方网页上的表单。 基本上,我在桌面应用程序的表单中输入我的数据,它会转到第 3 方网站,并使用脚本或后台的任何内容,在那里输入我的数据(包括我的登录信息),然后为我单击“提交”按钮。只是想避免加载浏览器!

我在这个领域没有做太多(任何!)工作,我想知道像 perl、python、ruby 等脚本语言是否允许我这样做? 或者只是使用 c# 和 .net 来完成所有的抓取工作? 哪一款 IYO 最好?

我认为脚本可能需要将不同平台上的应用程序中的某些内容挂接到同一脚本中(例如,在 symbian 移动设备中,我无法像桌面版本那样在 c# 中开发它)。

它不是一个网络应用程序,否则我也可以使用原始网站。 我意识到这一切听起来毫无意义,但这种特定形式的自动化对我来说将真正节省时间。

Hi I want to create a desktop app (c# prob) that scrapes or manipulates a form on a 3rd party web page. Basically I enter my data in the form in the desktop app, it goes away to the 3rd party website and, using the script or whatever in the background, enters my data there (incl my login) and clicks the submit button for me.I just want to avoid loading up the browser!

Not having done much (any!) work in this area I was wondering would a scripting language like perl, python, ruby etc allow me to do such? Or simply do it all the scraping using c# and .net? Which one is best IYO?

I was thinking script as may need to hook into the same script something from applications on different platforms (eg symbian mobile where I wouldnt be able to develop it in c# as I would the desktop version).

Its not a web app otherwise I may as well use the original site. I realise it all sounds pointless but the automation for this specific form would be a real time saver for me.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(13

幸福%小乖 2024-07-24 10:00:57

不要忘记查看BeautifulSoup,强烈推荐。

例如,请参阅options-for-html-scraping
如果您需要为此任务选择一种编程语言,我会选择Python

要更直接地解决您的问题,请参阅 twill,一种用于 Web 浏览的简单脚本语言。

Do not forget to look at BeautifulSoup, comes highly recommended.

See, for example, options-for-html-scraping.
If you need to select a programming language for this task, I'd say Python.

A more direct solution to your question, see twill, a simple scripting language for Web browsing.

情痴 2024-07-24 10:00:57

我使用 C# 进行抓取。 请参阅有用的 HtmlAgilityPack 包。
对于解析页面,我使用 XPATH 或正则表达式。 如果您需要,.NET 还可以轻松处理 cookie。

我编写了一个小类,其中包含创建 WebRequest、发送它、等待响应、保存 cookie、处理网络错误和重新传输等的所有细节 - 最终结果是,在大多数情况下我可以只调用“GetRequest\PostRequest”并获取 HtmlDocument 返回。

I use C# for scraping. See the helpful HtmlAgilityPack package.
For parsing pages, I either use XPATH or regular expressions. .NET can also easily handle cookies if you need that.

I've written a small class that wraps all the details of creating a WebRequest, sending it, waiting for a response, saving the cookies, handling network errors and retransmitting, etc. - the end result is that for most situations I can just call "GetRequest\PostRequest" and get an HtmlDocument back.

月亮坠入山谷 2024-07-24 10:00:57

您可以尝试使用 .NET HTML Agility Pack:

http://www.codeplex.com/htmlagilitypack

“这是一个敏捷的 HTML 解析器,它构建一个读/写 DOM 并支持普通的 XPATH 或 XSLT(实际上您不必了解 XPATH 或 XSLT 即可使用它,不用担心......)。它是一个 .NET 代码库,允许您解析“网络外”HTML 文件。解析器对“现实世界”格式错误的 HTML 非常宽容。该对象模型与 System.Xml 的建议非常相似,但对于 HTML 文档(或流)。”

You could try using the .NET HTML Agility Pack:

http://www.codeplex.com/htmlagilitypack

"This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams)."

も让我眼熟你 2024-07-24 10:00:57

C# 非常适合您的屏幕抓取需求。 .NET 的正则表达式功能非常好。 然而,对于如此简单的任务,您将很难找到一种不能相对轻松地完成您想要的事情的语言。 考虑到您已经在使用 C# 进行编程,我建议坚持这样做。

内置的屏幕抓取功能也是一流的。

C# is more than suitable for your screen scraping needs. .NET's Regex functionality is really nice. However, with such a simple task, you'll be hard to find a language that doesn't do what you want relatively easily. Considering you're already programming in C#, I'd say stick with that.

The built in screen scraping functionality is also top notch.

长梦不多时 2024-07-24 10:00:57

我们将 Groovy 与 NekoHTML 结合使用。 (另请注意,您现在可以在 Google App Engine 上运行 Groovy。)

以下是 Keplar 博客上的一些可运行代码示例:

通过 Groovy 抓取获得更好的竞争情报

We use Groovy with NekoHTML. (Also note that you can now run Groovy on Google App Engine.)

Here is some example, runnable code on the Keplar blog:

Better competitive intelligence through scraping with Groovy

小傻瓜 2024-07-24 10:00:57

IMO Perl 内置的正则表达式功能和操作文本的能力将使其成为屏幕抓取的有力竞争者。

IMO Perl's built in regular expression functionality and ability to manipulate text would make it a pretty good contender for screen scraping.

夜访吸血鬼 2024-07-24 10:00:57

红宝石真是太棒了!...
尝试它的 hpricot/mechanize

Ruby is pretty great !...
try its hpricot/mechanize

晨光如昨 2024-07-24 10:00:57

格罗维非常好。

例子:
http://froth-and- java.blogspot.com/2007/06/html-screen-scraping-with-groovy.html

Groovy 和 HtmlUnit 也是非常好的搭配:
http://groovy.codehaus.org/Testing+Web+Applications
Htmlunit 将模拟一个支持 Javascript 的完整浏览器。

Groovy is very good.

Example:
http://froth-and-java.blogspot.com/2007/06/html-screen-scraping-with-groovy.html

Groovy and HtmlUnit is also a very good match:
http://groovy.codehaus.org/Testing+Web+Applications
Htmlunit will simulate a full browser with Javascript support.

薄荷港 2024-07-24 10:00:57

PHP 是一个很好的竞争者,因为它具有良好的Perl 兼容的正则表达式支持cURL 库。

PHP is a good contender due to its good Perl-Compatible Regex support and cURL library.

贪恋 2024-07-24 10:00:57

HTML Agility Pack (c#)

  1. XPath 是 borked,清理 html 使其与 xml 兼容的方式将删除标签,您必须调整表达式才能使其正常工作。
  2. 易于使用

Mozilla 解析器 (Java)

  1. 可靠的 XPath 支持

  2. 你必须先设置环境变量才能工作,这很痛苦

  3. 在 org.dom4j.Node 和 org.w3c.dom.Node 之间进行转换以获取不同的属性是一个真正的痛苦

  4. 在非标准 html 上崩溃(0.3 修复了这个)

  5. XPath 的最佳解决方案

  6. 访问 NodeList 中节点上的数据时出现问题的最佳解决方案

    使用 for(int i=1;i<=list_size;i++) 来解决这个问题

Beautiful Soup (Python)

我没有太多经验,但这是我发现的

  1. 没有 XPath 支持
  2. 很好的 html 路径界面

我更喜欢 Mozilla HTML Parser

HTML Agility Pack (c#)

  1. XPath is borked, the way the html is cleaned to make it xml compliant it will drop tags and you have to adjust the expression to get it to work.
  2. simple to use

Mozilla Parser (Java)

  1. Solid XPath support

  2. you have to set enviroment variables before it will work which is a pain

  3. casting between org.dom4j.Node and org.w3c.dom.Node to get different properties is a real pain

  4. dies on non-standard html (0.3 fixes this)

  5. best solution for XPath

  6. problems accessing data on Nodes in a NodeList

    use a for(int i=1;i<=list_size;i++) to get around that

Beautiful Soup (Python)

I don't have much experience but here's what I've found

  1. no XPath support
  2. nice interface to pathing html

I prefer Mozilla HTML Parser

紫﹏色ふ单纯 2024-07-24 10:00:57

看一下 HP 的 Web 语言(以前称为 WEBL)。

http://en.wikipedia.org/wiki/Web_Language

Take a look at HP's Web Language (formerly WEBL).

http://en.wikipedia.org/wiki/Web_Language

神仙妹妹 2024-07-24 10:00:57

或者坚持使用 C# 中的 WebClient 和一些字符串操作。

Or stick with WebClient in C# and some string manipulations.

旧时浪漫 2024-07-24 10:00:57

我赞同 python(或 Beautiful Soup)的推荐。 我目前正在使用 python 进行一个小型屏幕抓取项目,而 python 3 对 cookie 身份验证(通过 CookieJar 和 urllib)等内容的自动处理极大地简化了事情。 Python 支持您可能需要的所有更高级的功能(例如正则表达式),并且具有能够快速处理此类项目的优势(处理低级内容时不会有太多开销)。 它也是相对跨平台的。

I second the recommendation for python (or Beautiful Soup). I'm currently in the middle of a small screen-scraping project using python, and python 3's automatic handling of things like cookie authentication (through CookieJar and urllib) are greatly simplifying things. Python supports all of the more advanced features you might need (like regexes), as well as having the benefit of being able to handle projects like this quickly (not too much overhead in dealing with low level stuff). It's also relatively cross-platform.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文