绘制一个像这样的图表
我被要求使用 Latex (更准确地说,tikz 和/或 pgf) 。如果我有数据,这不会是问题,但我没有。我所拥有的只是可以显示图表的网站,但我不知道如何从那里获取数据。
今天我花了一天时间尝试获取这些数据,包括写信给 Google 并使用一种追踪线条并推断图形点的软件,例如 Datathief 和 DigitizeIt,但我没有成功。我认为后者不起作用,因为图中的线条太细并且有不止一种蓝色阴影。当然,我尝试使用 Paint 和 Gimp 来提高图像质量,但仍然无法实现。
我还尝试使用 eps2pgf,这是一个将 eps 数字转换为 pgf 代码的 Java 脚本,但即使这样,对于我使用图像捕获 (mac) 和打印屏幕 (Windows) 保存的图形也不起作用,说实话,这将是我的最后一次选项,因为它是一种“蛮力方法”,会生成一个无法真正改进的丑陋代码。
毕竟我决定开始学习Python,因为我的导师,让我用tikz画这张图的人,说有一个Python代码可以从这样的网站获取数据。现在我什至不确定Python是否能完成这项工作(尽管我很高兴有借口学习它),当然学习一门新语言并做类似的事情需要时间,所以我想知道是否真的有一个从该网站获取数据的方法,最好使用 Python,但如果没有,则使用任何其他方法。
I was asked to draw a graph like this one
using Latex (more precisely, tikz and/or pgf). This would not be a problem if I had the data, but I don't. All I have is the website from where graphs can be displayed, but I don't know how to get the data from there.
I spent the day today trying to get this data, including writing to Google and using a type of software which traces the line and infers the points of a graph, such as Datathief and DigitizeIt, but I was unsuccessful. I think the latter did not work because the lines in the graph are too thin and have more than one shade of blue. Of course, I tried to improve the picture quality using Paint and Gimp but I still couldn't make it work.
I also tried using eps2pgf, a Java script which transforms eps figures into pgf code, but even that was not working for the graphs I saved using Image Capture (mac) and Print Screen (Windows), and to be honest this would be my last option since it is a "brute force approach", spitting an ugly code that you can't really improve on.
After all that I decided to start learning Python, because my supervisor, the person who asked me to draw this picture using tikz, said that there is a Python code to get data from websites like this. Now I am not even sure Python will do the job (though I am happy for the excuse to learn it) and of course it takes time to learn a new language and do something like that, so I want to know whether there is really a way to get the data from that website, using preferably Python but if not, any other method.
发布评论
评论(1)
好吧,如果 Google 为这些数据提供 API,那就太好了!也就是说,您仍然可以从网站中抓取一些数据。操作方法如下...
安装 Firebug
我更喜欢 Firefox 的 Firebug,但 Chrome 的开发者工具也应该可以工作。
调查
首先,让我们访问相关的 url 并使用 Firebug 尝试看看发生了什么。使用 F12 激活 Firebug 或转到“工具”->“Firebug”->“打开 Firebug”。首先单击“网络”选项卡并重新加载页面。这显示了提出的所有请求,并将让您深入了解该网站的运作方式。通常 Flash 插件会在外部加载数据,而不是将其嵌入到实际插件中,如果您查看请求,您会看到标记为
POST service
的请求。如果将鼠标悬停在其上,Firebug 会显示完整的网址,并且您会看到该页面向http://www.google.com/transparencyreport/traffic/service
发出了请求。您可以单击请求并查看发送的标头、发布数据、响应和用于执行请求的 cookie。如果您查看响应,您会看到似乎格式错误的 JSON。据我所知,这似乎包含标准化流量数据点的列表。实际上,您可以从 firebug 中剪切并粘贴响应,但由于这是一个 python 问题,所以让我们更加努力一些。
将数据输入 Python
为了成功发出 post 请求,我们需要执行(几乎)浏览器执行的所有操作。我们可以作一点欺骗,只需复制请求标头并从 firebug 中发布数据,即可欺骗真实请求。
标题和标题发布数据
使用三引号将多行字符串粘贴到 shell 中。复制请求标头并将其粘贴进去。
接下来将其转换为 httplib2 的字典。我将使用列表理解(它根据换行符分割字符串,然后分割第一个 : 上的行并去除尾随空格,这给了我一个由
dict
组成的两个元素列表的列表可以转换成字典),但你可以随心所欲地这样做。您也可以手动创建字典,我只是发现这样更快。并复制帖子数据。
提出请求
我将使用 httplib2 但还有一些其他 http 客户端和一些不错的工具像 mechanize 和 scrapy。我们将使用 API 的 url、我们复制的标头以及从 Firebug 复制的发布数据来发出 POST 请求。该请求返回响应标头和内容的元组。
按摩数据
原始格式真的很奇怪,只有顶部似乎包含数据点,所以我将放弃其余部分。
现在它是有效的 JSON,因此我们可以将其反序列化为本机 Python 数据类型。
我感兴趣的所有点都是浮动点,因此我将根据浮动点进行过滤。
处理/保存数据
现在我们有了数据,检查它,进行额外的处理等......
或者只是保存它。
我们还可以使用 pyplot 将其绘制出来.sourceforge.net/" rel="noreferrer">matplotlib (或其他一些图形/绘图库)。
结论
如果您只对一些事情感兴趣,您可以调整图表以显示什么然后使用正确请求
http://www.google.com/transparencyreport/traffic/service
所使用的请求标头/发布数据。您可能想比我更仔细地检查实际响应,我只是丢弃了对我来说没有意义的部分。希望他们能为这些数据公开一个公共 API。Well, it'd be great if Google provided an API for this data! That said, you can still scrape some data out of the site. Here's how to go about it...
Install Firebug
I prefer Firebug for Firefox, but Chrome's developer tools should also work.
Investigate
First things first, let's visit the url in question and use Firebug try and see what's going on. Activate Firebug with F12 or go to Tools->Firebug->Open Firebug. Click on the Net tab first and reload the page. This shows all the requests made, and will give you some insight into how the site works. Usually flash plugins load data externally, as opposed to having it embedded in the actual plugin, and if you look at the requests you'll see request labeled
POST service
. If you hover over it, firebug shows the full url and you'll see the page made a request tohttp://www.google.com/transparencyreport/traffic/service
. You can click on the request and look at the headers sent, the post data, the response and cookies used to perform the request.If you look at the response, you'll see what appears to be malformed JSON. From what I can tell this appears to contain the list of normalized traffic data points. You could actually cut and paste the response out of firebug, but since this IS a python question, let's work a bit harder.
Getting the data into Python
To make the post request successfully, we'll need to do (nearly) everything the browser does. We can cheat a bit and just copy the request headers and post data out of firebug, to spoof a real request.
Headers & post data
Use triple quotes to paste multi-line strings into the shell. Copy the request headers and paste it in.
Next convert it to a dict for httplib2. I'm going to use a list comprehension (which splits the string based on newlines, then splits the line on the first : and strips trailing whitespace, which gives me a list of two-elemnt lists that
dict
can convert into a dictionary), but you could do this however you want. You could manually create the dict too, I just find this faster.And copy in the post data.
Make the request
I'm going to use httplib2 but there are a few other http clients and some nice tools for scraping the web like mechanize and scrapy. We'll make the POST request using the url to the API, the headers we copied and the post data we copied from firebug. The request returns a tuple of response headers and content.
Massage Data
The original format is really weird and only the top bit seems to contain the data points, so I'll ditch the rest.
Now that it's valid JSON, so we can deserialize it into native python data types.
All of the points I'm interested in are floats, so I'll filter based on that.
Process/Save Data
Now that we have our data, inspect it, do additional processing, etc...
...or just save it.
We could also plot it out using pyplot from matplotlib (or some other graphing/plotting library).
Conclusion
If you are just interested in a few things you can adjust the chart to display what you want and then use the request headers/post data used by the proper request to
http://www.google.com/transparencyreport/traffic/service
. You'll might want to inspect the actual response closer than I did, I just discarded the parts that didn't make sense to me. Hopefully they'll expose a public API for this data.