使用 Mathematica 从 HTML 中提取信息
有没有一种简单的方法可以使用 Mathematica 从特定 HTML 表中提取数据? Import
似乎非常强大,并且 Mathematica 似乎能够很好地处理 XML 等格式。
下面是一个示例:http://en.wikipedia.org/wiki/Unemployment_by_country
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
对于这方面的一般示例,有以下操作方法:
对于这个特定示例,只需将其导入
清理即可这种导入相当简单。该表有 3 列,因此从其余内容中提取它:
您可能想要删除方括号引用 (??):
另请注意,如果您希望表中包含标题,您可以将其添加回来,您可能会这样做
纯粹主义者可能会反对最后一步,但当您抓取数据时,通常您只想完成工作,并且每个站点都是针对具体情况的潜在客户。因此,一些手动检查和灵活性可以让您获得最快的总体结果。
编辑
如果您想要标志,您也可以从
CountryData
获取它们。需要进一步清理,否则会发生很多遗漏。清理工作包括删除括号中对“主权国家”的引用。例如“关岛(美国)”-> “高姆”。这仍然会产生一些
CountryData
无法识别的输出。190 次中有 6 次缺失。从输出中删除这些缺失:
请注意,这需要一段时间才能渲染。
您显然可以使用
Grid
根据需要设置Grid
的样式code> 选项,如果需要的话还可以调整图像大小。For general examples of this there are these How tos:
For this specific example just import it
Cleaning it up is fairly straight forward with this import. The table is 3 columns so extract it from the rest of the stuff:
You will presumably want to remove the square bracket references (??):
Note also you can add the header back if you want it in your table, which you probably do
purists might object to the last step but when you are scraping data generally you just want to get the job done and each site is a case by case prospect. So some manual inspection and flexibility gets you the fastest overall result.
Edit
if you wanted the flags you could also get them from
CountryData
. Some further cleaning up is needed otherwise a lot of misses will occur. The cleanup involves removing the reference to the "sovereign country" in parenthesis. e.g. "Guam ( United States )" -> "Gaum".This will still produce some output that
CountryData
does not recognize.6 misses out of 190. Remove those misses from the output:
Note that this takes a while to render.
You can obviously style the
Grid
as desired usingGrid
options and also resize the images if needed.虽然使用
Import
可能是一种更好、更稳健的方法,但我发现,至少对于这个特定问题,我自己的 HTML 解析器(发布于 此线程),可以在少量的情况下正常工作后处理。如果您从那里获取代码并执行它,并使用此函数对其进行扩充:那么我认为,您可以通过此代码获得相当完整的数据:
结果如下:
我喜欢这种方法的原因(而不是比如说,
Import->XMLObject
),因为我使用最少的语法将网页转换为 Mathematica 表达式(与 XML 对象不同),所以通常很容易建立一组替换规则在每个给定的情况下进行正确的后处理 案件。最后的免责声明是,我的解析器并不健壮,并且肯定包含许多错误,因此请注意。While the use of
Import
is probably a better and more robust way, I found that, at least for this particular problem, my own HTML parser (published in this thread), works fine with a small amount of post-processing. If you take the code from there and execute it, augmenting it with this function:Then you get, I think, a pretty much complete data by this code:
Here is how the result looks:
What I like about this approach (as opposed to say,
Import->XMLObject
) is that, since I convert the web page into Mathematica expression with minimal syntax (unlike e.g. XML objects), it is often very easy to establish a set of replacement rules which does the right post-processing in each given case. A final disclaimer is that my parser is not robust and does for sure contain a number of bugs, so be warned.不是如何导入 HTML 的直接答案(其他人已经很好地解释了),但从 HTML 表获取数据正是我最初制作 表格粘贴调色板。
如果您的目标只是获取数据,这可能比尝试解析页面更容易、更快。
使用调色板的说明
评估创建调色板的表达式,转到调色板 ->安装调色板...并永久保存以供以后使用(如果您愿意)。
选择网页上表格的一部分。如果您使用的是 Firefox,请按住 CTRL 选择表格的任意矩形部分(非常有用!)复制它。
如果您使用的是 Firefox 或 Chrome,请按调色板上的
TSV
按钮将数据粘贴到笔记本中的当前插入点处。我不确定其他浏览器在复制时是否也用制表符分隔项目。结果将如下所示:
如您所见,需要进行一些后处理才能将年份转换为正确的格式(字符串或整数?)
这是旧的调色板代码。我意识到它需要清理,但它照常工作,而且我还没有时间修复它。在下面的评论中报告任何问题。
Not a direct answer to how to import HTML (which others have explained nicely), but getting data from HTML tables is precisely why I originally made my table paste palette.
If your aim is to just get the data, this is probably going to be easier and faster than trying to parse the page.
Instructions on using the palette
Evaluate the expression that creates the palette, go to Palettes -> Install Palette... and save it permanently for later use (if you wish).
Select a part of the table on the webpage. If you are working with Firefox, hold down CTRL to select any rectangular section of the table (very useful!) Copy it.
If you are using Firefox or Chrome, press the
TSV
button on the palette to paste the data into the notebook at the current insertion point. I'm not sure if other browsers also separate items with tabs when copying.The result will look like this:
As you can see, some post-processing is needed to convert years to a proper format (string or integer?)
This is the old palette code. I realize it's in need of cleanup, but it works as it is, and I haven't had time to fix it up yet. Report any issues in comments below.
当然,结果通常需要进一步处理。你想如何形象化它?
您可以使用以下命令查找所有
Import
类型Of course, the result will frequently need further processing. How do you want to visualize it?
You can find all
Import
types using如果您想采用 Import[ ... , "XMLObject" ] 路线,这里概述了您可以执行的操作。
首先,获取页面:
接下来,获取感兴趣的表(在本例中,大表也恰好是此页面上七个表中的第一个):
接下来,从
中获取一行
>table,我选择了与阿尔及利亚对应的第四行:row = Cases[table, XMLElement["tr", ___], [Infinity]][[4]]
接下来,提取表数据元素 ()从这一行:
从这些元素中,您可以选择例如国旗缩略图,如下所示:
最后导入该图像缩略图(由于某种原因需要在前面添加“http:”):
这就是笔记本的样子(缩略图,加上其他输入):
If you want to go the Import[ ... , "XMLObject" ] route, here is an outline of what you can do.
First, get the page:
Next, get the table of interest (in this case the big table also happens to be the first of seven tables on this page):
Next, get a
row
from thetable
, I picked the fourth row which corresponds with Algeria:row = Cases[table, XMLElement["tr", ___], [Infinity]][[4]]
Next, extract the table data elements () from this row:
Out of those elements, you can pick for example the country flag thumbnail, like so:
Finally import that image thumbnail (it needed "http:" prepended for some reason):
This is what the notebook looks like (the thumbnail, plus the other inputs):
对于某些“简单”的值,是的。请参阅此处:Mathematica 8 的 HTML 导入文档。
您可以从表格导入使用
"Data"
格式选项,例如Import["file.hml", "Data"]
。这是一个开始,但是您的链接是整个 DOM 树的表、div 和其他内容。它已被记录下来,但内容很薄弱,您必须进行实验。但它确实可以与 URL 一起使用。这确实有效。经过一些清理,您可以使用此处的数据:
For certain values of 'easy', yes. See here: HTML Import documentation for Mathematica 8.
You can import from tables using the
"Data"
format option, e.g.Import["file.hml", "Data"]
. That's a start, but your link is a whole DOM-tree's worth of tables, divs and other things. It's documented, but thinly, and you'd have to experiment. It does work with URLs though.This actually works. With a bit of cleaning you could use the data here: