从网站上显示的图表获取数据

发布于 2024-11-05 18:47:51 字数 757 浏览 1 评论 0 原文

绘制一个像这样的图表

在此处输入图像描述

我被要求使用 Latex (更准确地说,tikz 和/或 pgf) 。如果我有数据,这不会是问题,但我没有。我所拥有的只是可以显示图表的网站,但我不知道如何从那里获取数据。

今天我花了一天时间尝试获取这些数据,包括写信给 Google 并使用一种追踪线条并推断图形点的软件,例如 Datathief 和 DigitizeIt,但我没有成功。我认为后者不起作用,因为图中的线条太细并且有不止一种蓝色阴影。当然,我尝试使用 Paint 和 Gimp 来提高图像质量,但仍然无法实现。

我还尝试使用 eps2pgf,这是一个将 eps 数字转换为 pgf 代码的 Java 脚本,但即使这样,对于我使用图像捕获 (mac) 和打印屏幕 (Windows) 保存的图形也不起作用,说实话,这将是我的最后一次选项,因为它是一种“蛮力方法”,会生成一个无法真正改进的丑陋代码。

毕竟我决定开始学习Python,因为我的导师,让我用tikz画这张图的人,说有一个Python代码可以从这样的网站获取数据。现在我什至不确定Python是否能完成这项工作(尽管我很高兴有借口学习它),当然学习一门新语言并做类似的事情需要时间,所以我想知道是否真的有一个从该网站获取数据的方法,最好使用 Python,但如果没有,则使用任何其他方法。

I was asked to draw a graph like this one

enter image description here

using Latex (more precisely, tikz and/or pgf). This would not be a problem if I had the data, but I don't. All I have is the website from where graphs can be displayed, but I don't know how to get the data from there.

I spent the day today trying to get this data, including writing to Google and using a type of software which traces the line and infers the points of a graph, such as Datathief and DigitizeIt, but I was unsuccessful. I think the latter did not work because the lines in the graph are too thin and have more than one shade of blue. Of course, I tried to improve the picture quality using Paint and Gimp but I still couldn't make it work.

I also tried using eps2pgf, a Java script which transforms eps figures into pgf code, but even that was not working for the graphs I saved using Image Capture (mac) and Print Screen (Windows), and to be honest this would be my last option since it is a "brute force approach", spitting an ugly code that you can't really improve on.

After all that I decided to start learning Python, because my supervisor, the person who asked me to draw this picture using tikz, said that there is a Python code to get data from websites like this. Now I am not even sure Python will do the job (though I am happy for the excuse to learn it) and of course it takes time to learn a new language and do something like that, so I want to know whether there is really a way to get the data from that website, using preferably Python but if not, any other method.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

绿阴红影里的.如风往事 2024-11-12 18:47:51

好吧,如果 Google 为这些数据提供 API,那就太好了!也就是说,您仍然可以从网站中抓取一些数据。操作方法如下...

安装 Firebug

我更喜欢 Firefox 的 Firebug,但 Chrome 的开发者工具也应该可以工作。

调查
首先,让我们访问相关的 url 并使用 Firebug 尝试看看发生了什么。使用 F12 激活 Firebug 或转到“工具”->“Firebug”->“打开 Firebug”。首先单击“网络”选项卡并重新加载页面。这显示了提出的所有请求,并将让您深入了解该网站的运作方式。通常 Flash 插件会在外部加载数据,而不是将其嵌入到实际插件中,如果您查看请求,您会看到标记为 POST service 的请求。如果将鼠标悬停在其上,Firebug 会显示完整的网址,并且您会看到该页面向 http://www.google.com/transparencyreport/traffic/service 发出了请求。您可以单击请求并查看发送的标头、发布数据、响应和用于执行请求的 cookie。

RequestDetail

如果您查看响应,您会看到似乎格式错误的 JSON。据我所知,这似乎包含标准化流量数据点的列表。实际上,您可以从 firebug 中剪切并粘贴响应,但由于这是一个 python 问题,所以让我们更加努力一些。

将数据输入 Python

为了成功发出 post 请求,我们需要执行(几乎)浏览器执行的所有操作。我们可以作一点欺骗,只需复制请求标头并从 firebug 中发布数据,即可欺骗真实请求。

标题和标题发布数据

使用三引号将多行字符串粘贴到 shell 中。复制请求标头并将其粘贴进去。
Request Headers

>>> headers = """ <paste headers> """

接下来将其转换为 httplib2 的字典。我将使用列表理解(它根据换行符分割字符串,然后分割第一个 : 上的行并去除尾随空格,这给了我一个由 dict 组成的两个元素列表的列表可以转换成字典),但你可以随心所欲地这样做。您也可以手动创建字典,我只是发现这样更快。

>>> headers = dict([[s.strip() for s in line.split(':', 1)]
                               for line in headers.strip().split('\n')])

并复制帖子数据。
复制我们感兴趣的图表所用的发布数据

>>> body = """ <paste post data> """

提出请求
我将使用 httplib2 但还有一些其他 http 客户端和一些不错的工具像 mechanizescrapy。我们将使用 API 的 url、我们复制的标头以及从 Firebug 复制的发布数据来发出 POST 请求。该请求返回响应标头和内容的元组。

>>> import httplib2 
>>> h = httplib2.Http()
>>> url = 'http://www.google.com/transparencyreport/traffic/service'
>>> resp, content = h.request(url, 'POST', body=body, headers=headers)

按摩数据

原始格式真的很奇怪,只有顶部似乎包含数据点,所以我将放弃其余部分。

>>> cleaned = content.split("'")[0][4:-1] + ']' 

现在它是有效的 JSON,因此我们可以将其反序列化为本机 Python 数据类型。

>>> import json
>>> data = json.loads(cleaned)

我感兴趣的所有点都是浮动点,因此我将根据浮动点进行过滤。

>>> data = [x for x in data if type(x) == float]

处理/保存数据

现在我们有了数据,检查它,进行额外的处理等......

>>> data[:5] 
<<< 
[44.73874282836914,
 45.4061279296875,
 47.5350456237793,
 44.56114196777344,
 46.08817672729492]

或者只是保存它。

>>> with open('data.json', 'w') as f:
...:     f.write(json.dumps(data))

我们还可以使用 pyplot 将其绘制出来.sourceforge.net/" rel="noreferrer">matplotlib (或其他一些图形/绘图库)。

>>> import matplotlib.pyplot as plt
>>> plt.plot(data)

Pyplot

结论

如果您只对一些事情感兴趣,您可以调整图表以显示什么然后使用正确请求 http://www.google.com/transparencyreport/traffic/service 所使用的请求标头/发布数据。您可能想比我更仔细地检查实际响应,我只是丢弃了对我来说没有意义的部分。希望他们能为这些数据公开一个公共 API。

Well, it'd be great if Google provided an API for this data! That said, you can still scrape some data out of the site. Here's how to go about it...

Install Firebug

I prefer Firebug for Firefox, but Chrome's developer tools should also work.

Investigate
First things first, let's visit the url in question and use Firebug try and see what's going on. Activate Firebug with F12 or go to Tools->Firebug->Open Firebug. Click on the Net tab first and reload the page. This shows all the requests made, and will give you some insight into how the site works. Usually flash plugins load data externally, as opposed to having it embedded in the actual plugin, and if you look at the requests you'll see request labeled POST service. If you hover over it, firebug shows the full url and you'll see the page made a request to http://www.google.com/transparencyreport/traffic/service. You can click on the request and look at the headers sent, the post data, the response and cookies used to perform the request.

Request detail

If you look at the response, you'll see what appears to be malformed JSON. From what I can tell this appears to contain the list of normalized traffic data points. You could actually cut and paste the response out of firebug, but since this IS a python question, let's work a bit harder.

Getting the data into Python

To make the post request successfully, we'll need to do (nearly) everything the browser does. We can cheat a bit and just copy the request headers and post data out of firebug, to spoof a real request.

Headers & post data

Use triple quotes to paste multi-line strings into the shell. Copy the request headers and paste it in.
Request Headers

>>> headers = """ <paste headers> """

Next convert it to a dict for httplib2. I'm going to use a list comprehension (which splits the string based on newlines, then splits the line on the first : and strips trailing whitespace, which gives me a list of two-elemnt lists that dict can convert into a dictionary), but you could do this however you want. You could manually create the dict too, I just find this faster.

>>> headers = dict([[s.strip() for s in line.split(':', 1)]
                               for line in headers.strip().split('\n')])

And copy in the post data.
Copy post data used for the chart we are interested in

>>> body = """ <paste post data> """

Make the request
I'm going to use httplib2 but there are a few other http clients and some nice tools for scraping the web like mechanize and scrapy. We'll make the POST request using the url to the API, the headers we copied and the post data we copied from firebug. The request returns a tuple of response headers and content.

>>> import httplib2 
>>> h = httplib2.Http()
>>> url = 'http://www.google.com/transparencyreport/traffic/service'
>>> resp, content = h.request(url, 'POST', body=body, headers=headers)

Massage Data

The original format is really weird and only the top bit seems to contain the data points, so I'll ditch the rest.

>>> cleaned = content.split("'")[0][4:-1] + ']' 

Now that it's valid JSON, so we can deserialize it into native python data types.

>>> import json
>>> data = json.loads(cleaned)

All of the points I'm interested in are floats, so I'll filter based on that.

>>> data = [x for x in data if type(x) == float]

Process/Save Data

Now that we have our data, inspect it, do additional processing, etc...

>>> data[:5] 
<<< 
[44.73874282836914,
 45.4061279296875,
 47.5350456237793,
 44.56114196777344,
 46.08817672729492]

...or just save it.

>>> with open('data.json', 'w') as f:
...:     f.write(json.dumps(data))

We could also plot it out using pyplot from matplotlib (or some other graphing/plotting library).

>>> import matplotlib.pyplot as plt
>>> plt.plot(data)

Pyplot

Conclusion

If you are just interested in a few things you can adjust the chart to display what you want and then use the request headers/post data used by the proper request to http://www.google.com/transparencyreport/traffic/service. You'll might want to inspect the actual response closer than I did, I just discarded the parts that didn't make sense to me. Hopefully they'll expose a public API for this data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文