获取网站数据(内容)的最佳方式?
我需要抓取一些网站数据(内容) 这些网站提供列表,我需要抓取这些列表并根据内容过滤它们,
有什么软件可以做到这一点? PHP 脚本? 如果没有,我可以从哪里开始对此功能进行编程?
I need to grab some websites data (content)
those websites provide listings I need to grab those and filter them according to the content
any software can do that? php script?
if not, where can I start to program this functionality?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
使用 file_get_contents() 返回整个文件的字符串,然后解析该字符串以提取内容。
其他选项是 cURL 或 wget,它们将获取整个文件,然后使用 AWK 和 SED 或 PERL 等处理它们,
具体取决于您需要抓取目标页面的频率。如果偶尔使用 PHP,但您需要从浏览器触发它,并且请记住 PHP 中的正则表达式可能非常耗时。
如果您想定期抓取文件,则可以在后台运行带有 cURL/wget + sed 和 awk 的 BASH 脚本,无需干预。
Use file_get_contents() which returns the whole file a string then parse the string to extract the content.
Other options would be cURL or wget which will get the whole file and then process them with such as AWK and SED or PERL
Depends how often you need to scrape the target page. If occasionaly then PHP, but you will need to trigger it from a browser and remeber regexp in PHP can be time consuming.
If you want to scrape the file on a regular basis then a BASH script with cURL/wget + sed and awk can be run from cron without intervention and in the background.
如果它的 php .. 可能会帮助你.. http:// www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial
当然,您需要根据您的要求自定义正则表达式。
您还可以找到大量其他示例.. http://www.google.com/search?source=ig&hl=en&rlz=&=&q=php+web +scraper&aq=f&oq=&aqi=
If its php .. may be this helps you .. http://www.thefutureoftheweb.com/blog/web-scrape-with-php-tutorial
Of course, you'll need to customise the regular expression depending upon your requirements.
Also loads of other examples you could find .. http://www.google.com/search?source=ig&hl=en&rlz=&=&q=php+web+scraper&aq=f&oq=&aqi=
没有什么神奇的事情。因为每个页面的内容都不一样。
当您谈论 PHP 时,我将为您提供有关该语言的一些线索。
您可以使用 curl 获取网页。
获取内容后,可以使用正则表达式进行解析。
根据您想要做什么,您必须自己开发应用程序。
There's no magical thing. Because every page content is different.
As you talk about PHP, I'm going to give you some clues with this language.
You can fetch a web page using curl.
After getting the content, you can parse it using regular expressions.
Depending of what you want to do, you'll have to develop the application by yourself.