我有一个复杂的屏幕抓取脚本,它使用 Selenium2、Selenium Web 驱动程序和 PHP 绑定脚本放在一起,所以最后,我有一个驱动 Selenium 的 PHP 脚本,它反过来获取 URL ,解析一些 Javascript,填写表格,等等,然后返回最终我想要的 HTML。这一切在我的本地计算机上运行得很好(作为开发和概念验证环境)。
所以。
对于生产,我需要这个脚本每天自动运行 3 次。我试图弄清楚在我的服务器上设置所有内容是否对我来说会更好(意思是:弄清楚如何让 Firefox for Linux 运行,然后是 Java,然后是 Selenium2,等等......对我来说不是微不足道的; 该死的吉姆,我是一名编码员,而不是系统管理员!),或者我是否可以使用第三方 Selenium 测试服务,例如 Sauce Labs 的 OnDemand,或任何其他基于云的 Selenium 服务< /a>.
这些第三方解决方案似乎都是为“单元测试”而设置的,这完全不是我正在做的事情。我不知道这些东西,或者使用 PHPUnit,或者通过构建进行测试,或者其他什么。我只想每天运行 3 次简单的 PHP 脚本,并让它与 Selenium 对话来驱动浏览器并进行屏幕抓取。
对于我想要实现的目标来说,这些第三方解决方案之一是否是一个好主意,或者它们是否太过分/离我的(相对简单的)目标太远?
I have a complex screen-scraping script that I've put together that uses Selenium2, the Selenium web driver and PHP binding script, so at the end of it all, I have a PHP script that drives Selenium, which in turn fetches a URL, parses some Javascript, fills out a form, blah blah blah, and then returns the HTML that is ultimately what I'm after. It all works great on my local computer (as a development and proof-of-concept environment).
So.
For production, I need this script to run automatically three times every day. I am trying to figure out if it would be better for me to set up everything on my server (meaning: figure out how to get Firefox for Linux going, then Java, then Selenium2, etc, etc... not trivial for me; Damn it Jim, I'm a coder, not a sysadmin!), or if I can use a 3rd-party Selenium testing service like Sauce Labs' OnDemand, or any of these other cloud-based Selenium services.
Those 3rd party solutions seem like they're all set up for "unit testing," which is totally not what I'm doing. I don't know about that stuff, or using PHPUnit, or doing tests with builds, or whatever. I just want to run my straightforward PHP script 3x/day and have it talk to Selenium to drive a browser and do my screen scraping.
Are one of those 3rd party solutions a good idea for what I'm trying to accomplish, or are they overkill/too far away from my (relatively simple) goal?
发布评论
评论(1)
首先,我想让你知道,我将 Selenium 与 Ruby 一起使用,所以我假设运行你的 php 脚本将启动 selenium webdriver 并运行你的测试...我将只解释如何轻松地每天运行你的脚本 3 次,而无需需要成为系统管理员。
Linux 有一个非常稳定和强大的命令,称为 cron,这是您需要使用的。它允许您安排每天/每小时/任何时间发生的操作。
您要做的第一件事是转到脚本所在的目录。我将把你的脚本称为 script.php。
首先要确保脚本的第一行是:
在目录中,您将执行以下命令以使系统可以访问您的文件:
现在使用以下命令设置您的 cron 作业:
然后放入您的作业:
00 - 表示 00 分钟。
4,12,20 - 是小时(它是 24 小时制。)
第一个:* - 每天
第二个:* - 每月
第三个:* - 一周中的每一天
所以这个脚本每天都会运行,每周、每月的下午 4 点、中午和晚上 8 点。
显然,将目录更改为系统上的脚本,并将时间设置为您希望进行抓取的时间。
我希望这有帮助!
-java/firefox 的附加内容-
首先,请对这一切持保留态度,因为我使用的是 Ruby:)
好的,要让 java/firefox 运行,您可能需要独立使用 selenium。您可以在此处获取它。
然后要运行 selenium 服务器,您只需:
您可以运行将独立服务器在 cron 作业中启动,然后在脚本文件中将其关闭。
First, I want to let you know that I use Selenium with Ruby so I am assuming that running your php script will start up the selenium webdriver and run your tests... I will just explain how easily run your script 3 times a day without needing to be a sysadmin master.
Linux has an extremely stable and robust command called cron which is what you will need to use. It allows you to schedule actions to happen daily/hourly/whatever.
The first thing you want to do is to go to the directory with your script. I will refer to your script as script.php.
First thing is to make sure that the top line of your script is:
In the directory you will execute the following command to make your file accessible by the system:
Now set up your cron job with the following command:
Then put in your job:
00 - Means at 00 minutes.
4,12,20 - Are the hours (it is a 24 hour clock.)
The first: * - Every day
The second: * - Every month
The third: * - Every Day of the week
So this script would run every day, every week, every month at 4,noon and 8pm.
Obviously change the directory to the script on your system and set the times to whenever you want the scraping to occur.
I hope this helps!
-Appended stuff for the java/firefox-
First off, take this all with a grain of salt since I am using Ruby :)
Okay to get java/firefox running you will probably want to grab the selenium standalone. You can grab it here.
Then to run the selenium server you just:
You can run put the standalone server starting in the cron job and then close it in your script file.