如何从网站收集数据

发布于 2024-12-21 15:56:58 字数 188 浏览 2 评论 0原文

前言:我对几种语言(C++、VB、C#、Java、许多网络语言)拥有广泛的大学知识,所以选择您喜欢的语言即可。

我想制作一个比较数字的 Android 应用程序,但为了做到这一点,我需要一个数据库。我是一个单人团队,数据每两周更新一次,所以我想从也会更新的维基上获取这些数据。

所以我的问题是:如何使用上述语言之一从网站访问信息?

Preface: I have a broad, college knowledge, of a handful of languages (C++, VB,C#,Java, many web languages), so go with which ever you like.

I want to make an android app that compares numbers, but in order to do that I need a database. I'm a one man team, and the numbers get updated biweekly so I want to grab those numbers off of a wiki that gets updated as well.

So my question is: how can I access information from a website using one of the languages above?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

猥︴琐丶欲为 2024-12-28 15:56:58

我对问题的理解是:某个实体每隔一周生成一个数据集(即数字),并且您需要下载该数据集进行处理(例如排序)。

理想情况下,维护 wiki 的网站将提供一项服务,例如 RESTful 界面,以便轻松收集数据。如果是这样的话,我会选择任何可以轻松操作 HTTP 请求和请求的语言。响应,并使您的数据操作变得容易。正如之前的发帖者所说,Java 可以很好地工作。

如果您被 wiki 页面困住了,您有几个选择。您可以解析浏览器接收到的 HTML(Perl 是一种不错的语言)。或者您可以使用为此目的构建的工具,例如前面提到的 Jsoup。

您的问题还提到了一些实现细节,例如需要数据库。显然,没有足够的上下文信息让我知道这是否是最佳的,所以我不会解决问题的这方面。

What I understand the problem to be: Some entity generates a data set (i.e. numbers) every other week and you have a need to download that data set for treatment (e.g. sorting).

Ideally, the web site maintaining the wiki would provide a Service, like a RESTful interface, to easily gather the data. If that were the case, I'd go with any language that provides easy manipulation of HTTP request & response, and makes your data manipulation easy. As a previous poster said, Java would work well.

If you are stuck with the wiki page, you have a couple of options. You can parse the HTML your browser receives (Perl comes to mind as a decent language for that). Or you can use tools built for that purpose such as the aforementioned Jsoup.

Your question also mentions some implementation details such as needing a database. Evidently, there isn't enough contextual information for me to know whether that's optimal, so I won't address this aspect of the problem.

弥枳 2024-12-28 15:56:58

http://jsoup.org/ 是一个很棒的 Java 工具,用于访问 html 页面上的内容

http://jsoup.org/ is a great Java tool for accessing content on html pages

清眉祭 2024-12-28 15:56:58

考虑一下 https://scraperwiki.com/ - 这是一个用户可以贡献抓取工具的网站。只要您公开您的抓取工具,它就是免费的。抓取工具的结果以 csv 和 JSON 形式公开。

如果您不知道“scraper”是什么,请谷歌“屏幕抓取” - 对于编码人员来说,这是一个漫长而令人沮丧的传统,自从网络计算开始以来,他们就一直在处理与您相同的问题。

Consider https://scraperwiki.com/ - it's a site where users can contribute scrapers. It's free as long as you let your scraper be public. The results of your scraper are exposed as csv and JSON.

If you don't know what a "scraper" is, google "screen scraping" - it's a long and frustrating tradition for coders, who have dealt with the same problem you have since the beginning of networked computing.

人疚 2024-12-28 15:56:58

您可以查看:http://web-harvest.sourceforge.net/

You could check out :http://web-harvest.sourceforge.net/

无可置疑 2024-12-28 15:56:58

对于 Python,BeautifulSoup 是最宽容的 HTML 解析器之一。 文档还列出了 Ruby 和 Java 中的类似库,因此您可能会在那里找到相关的东西。

For Python, BeautifulSoup is one of the most tolerant HTML parsers out there. The documentation also lists similar libraries in Ruby and Java, so you'll probably find something relevant there.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文