我最近发现了 RapidMiner,我对它的功能感到非常兴奋。但是我仍然不确定该程序是否可以帮助我满足我的特定需求。我希望程序从我用另一个程序生成的 URL 列表中抓取 xpath 匹配项。 (它比 RapidMiner 中的“爬行网络”运算符有更多选项)
我已经看过 Neil Mcguigan 的以下教程:http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html。但我尝试抓取的网站有数千个页面,我不想将它们全部存储在我的电脑上。而且网络爬虫缺乏关键功能,因此我无法将其用于我的目的。有没有办法让它读取 URL,并从每个 URL 中抓取 xpath?
我还研究了其他从页面中提取 html 的工具,但由于我不是程序员,所以我无法弄清楚它们是如何工作(甚至安装)的。另一方面,Rapidminer 很容易安装,操作员的描述很有意义,但我无法以正确的顺序连接它们。
我需要一些投入来保持动力。我想知道我可以使用什么运算符来代替“从文件处理文档”。我看过“处理来自网络的文档”,但它没有输入,并且仍然需要爬行。非常感谢任何帮助。
期待您的回复。
I've recently discovered RapidMiner, and I'm very excited about it's capabilities. However I'm still unsure if the program can help me with my specific needs. I want the program to scrape xpath matches from an URL list I've generated with another program. (it has more options then the 'crawl web' operator in RapidMiner)
I've seen the following tutorials from Neil Mcguigan: http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html. But the websites I try to scrape have thousands of pages, and I don't want to store them all on my pc. And the web crawler simply lacks critical features so I'm unable to use it for my purposes. Is there a way I can just make it read the URLS, and scrape the xpath's from each of those URLS?
I've also looked at other tools for extracting html from pages, but I've been unable to figure out how they work (or even install) since I'm not a programmer. Rapidminer on the other hand is easy to install, the operator descriptions make sense but I've been unable to connect them in the right order.
I need to have some input to keep the motivation going. I would like to know what operator I could use instead of 'process documents from files.' I've looked at 'process documents from web' but it doesn't have an input, and it still needs to crawl. Any help is much appreciated.
Looking forward to your replies.
发布评论
评论(2)
使用 RapidMiner 进行网页抓取而不在内部保存 html 页面的过程分为两步:
第 1 步按照 http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html,作者:Neil McGuigan,有以下区别:
操作员。不会有指定输出的选项
目录,因为结果将被加载到ExampleSet中。
ExampleSet 将包含与抓取规则匹配的链接。
第 2 步 按照 http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html 但仅从 7:40 开始,有以下区别:
ExampleSet 将包含与 XPath 查询匹配的链接和属性。
Web scraping without saving the html pages internally using RapidMiner is a two step process:
Step 1 Follow the video at http://vancouverdata.blogspot.com/2011/04/rapidminer-web-crawling-rapid-miner-web.html by Neil McGuigan with the following difference:
operator. There will not be an option to specify the output
directory, because the results will be loaded into the ExampleSet.
ExampleSet will contain links matching the crawling rules.
Step 2 Follow the video at http://vancouverdata.blogspot.com/2011/04/web-scraping-rapidminer-xpath-web.html but only from 7:40 with the following difference:
ExampleSet will contain the links and the attributes matching the XPath queries.
我和你有同样的问题,也许来自 RapidMiner 论坛的这些帖子会对你有所帮助:
http://rapid-i.com/rapidforum/index.php/topic ,2753.0.html
和
http://rapid-i.com/rapidforum/index.php?topic =3851.0.html
再见;)
I have quite the same problem than you and maybe these posts from RapidMiner's forum will help you a little :
http://rapid-i.com/rapidforum/index.php/topic,2753.0.html
and
http://rapid-i.com/rapidforum/index.php?topic=3851.0.html
See ya ;)