数据提取?
我正在寻找从各种网站提取各种数据的方法。我知道你可以购买一些程序,但由于我正在努力学习,所以我想自己做。有人对一般结构有什么建议吗?如果是的话,你会用什么语言编写它。我的第一个想法是java,但我非常愿意也很感激听到其他人的意见。
I am looking for methods to extract various data from various websites. I know there are programs out there you can buy but being that I am trying to learn I want to do it myself. Does anyone have any suggestions on a general structure and if so, what language would you write it in. My first thought was java but I am more than willing and grateful to hear anyone else's opinion.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您想从网站中提取什么样的数据?什么网站?等等。关于你的想法/项目的更多细节将会有所帮助,
我最近需要研究并尝试一些 html 解析器来以更统一的格式获取我需要的一些数据。
我尝试了 JTidy (http://jtidy.sourceforge.net/) 并研究了 Web-Harvest (< a href="http://web-harvest.sourceforge.net/" rel="nofollow noreferrer">http://web-harvest.sourceforge.net/)。 JTidy 并不能完全满足我的要求,而 Web-Harvest 则太过分了。
决定使用 Java + htmlparser (http://htmlparser.sourceforge.net/)
我最终 只需很少的开发时间即可获得我所需要的内容,并且 htmlparser 允许您形成“过滤器”来搜索 DOM 中的特定内容。
What kind of data are you trying to extract from websites? What websites? etc. A little more detail on your idea/project would be helpful
I recently had the need to look into and try a few html parsers to get some data I needed in a more consolidated format.
I tried JTidy (http://jtidy.sourceforge.net/) and looked into Web-Harvest (http://web-harvest.sourceforge.net/). JTidy wouldn't quite do what I wanted and Web-Harvest was overkill.
I ultimately settled on using Java + htmlparser (http://htmlparser.sourceforge.net/)
It took very little development time to get what I needed and htmlparser allows you to form 'filters' that search for specific things in the DOM.
看看hadoop(网格)和solr(爬虫和索引器)。它们分别支持繁重的处理和高效的索引(用于高效的搜索)。
look at hadoop (grids) and solr (crawlers and indexers ). They both support heavy processing and efficient indexing (for efficient searching) respectively.