将给定 URL 中的 HTML 表格抓取到 CSV 中
我寻找一种可以在命令行上运行的工具,如下所示:
tablescrape 'http://someURL.foo.com' [n]
如果未指定 n
并且页面上有多个 HTML 表格,它应该总结它们(标题行、总行数) )在编号列表中。 如果指定了 n 或只有一个表,它应该解析该表并将其以 CSV 或 TSV 格式输出到 stdout。
潜在的附加功能:
- 如果你想真正想象一下,你可以解析表中的表,但就我的目的而言——从维基百科页面等获取数据——这太过分了。
- 将任何 unicode 关联起来的选项。
- 应用任意正则表达式替换来修复解析表中的怪异现象的选项。
你会用什么来拼凑这样的东西? Perl 模块 HTML::TableExtract可能是一个很好的起点,甚至可以处理嵌套表的情况。 这也可能是一个非常短的 Python 脚本,其中包含 BeautifulSoup。 YQL 会是一个很好的起点? 或者,理想情况下,您是否写过类似的内容并有指向它的指针? (我肯定不是第一个需要这个的人。)
相关问题:
I seek a tool that can be run on the command line like so:
tablescrape 'http://someURL.foo.com' [n]
If n
is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list.
If n
is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.
Potential additional features:
- To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill.
- An option to asciify any unicode.
- An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.
What would you use to cobble something like this together?
The Perl module HTML::TableExtract might be a good place to start and can even handle the case of nested tables.
This might also be a pretty short Python script with BeautifulSoup.
Would YQL be a good starting point?
Or, ideally, have you written something similar and have a pointer to it?
(I'm surely not the first person to need this.)
Related questions:
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是我的第一次尝试:
http://yootles.com/outbox/tablescrape.py
它需要需要做更多的工作,比如更好的 asciifying,但它是可用的。例如,如果您将其指向奥林匹克记录列表:
它会告诉您有 8 项显然,第二个和第三个表(男性和女性记录)是您想要的:
然后,如果您再次运行它,要求第二个表,
您将得到一个合理的明文数据表:
This is my first attempt:
http://yootles.com/outbox/tablescrape.py
It needs a bit more work, like better asciifying, but it's usable. For example, if you point it at this list of Olympic records:
it tells you that there are 8 tables available and it's clear that the 2nd and 3rd ones (men's and women's records) are the ones you want:
Then if you run it again, asking for the 2nd table,
You get a reasonable plaintext data table:
使用 TestPlan 我制作了一个粗略的脚本。考虑到网络表格的复杂性,它可能需要在所有网站上进行定制。
第一个脚本列出了页面上的表:
然后第二个脚本将一个表的数据提取到 CSV 文件中。
我的 CSV 文件如下所示。请注意,维基百科在每个单元格中都提取了信息。有很多方法可以摆脱它,但不是通用的方式。
Using TestPlan I produced a rough script. Given the complexity of web tables it'll likely need to be tailored on all sites.
This first script lists the tables on the page:
The second script then extracts the data of one table into a CSV file.
My CSV file looks like below. Note that wikipedia has extract information in each cell. There are many ways to get rid of it, but not in a generic fashion.
使用
jq
和pup
,并向 这个答案:jq
和pup
都是自己超级有用。它看起来就像这些工具之一(或者xidel)应该能够将 HTML 表直接提取到分隔文本文件中,但我想事实并非如此。幸运的是,管道,伙计。他们太好了!用法
更新:结果是Xidel (0.9.8) 可以做到。兼容的 CSV 会很棘手(转义分隔符和引号,天哪),但制表符分隔非常简单,并且可以通过其他工具进行转换,例如 Miller,或 LibreOffice Calc。制表符分隔格式的一个优点是许多其他 Unix 文本处理工具已经理解它(
cut -d $'\t'
、sort -t $'\t'
>,awk -F '\t'
),在紧要关头,您可以自己在 shell 脚本中编写一个近乎万无一失的解析器,例如。输出:
这将接力赛中的名字融合在一起,并且“国家”栏中有一些令人讨厌的前导空格,但它让你非常接近。
Using
jq
andpup
, and a tip of the hat to this SO answer:Both
jq
andpup
are super-useful on their own. It seemed like one of those tools(or else xidel)should be able to extract HTML tables directly to a delimited text file, but I guess it isn't so. Fortunately, pipes, man. They're so good!Usage
Update: Turns out Xidel (0.9.8) can do it. Compliant CSV would be tricky (escaping delimiters and quoting quotes, oh my), but tab-delimited is pretty straightforward, and could be converted by another tool like Miller, or LibreOffice Calc. An advantage of the tab-delimited format is many other Unix text-processing tools already understand it (
cut -d $'\t'
,sort -t $'\t'
,awk -F '\t'
), and in a pinch, you can write a nearly-foolproof parser yourself, e.g., in shell script.Output:
That smooshes the names together in the relay events, and there are some pesky leading spaces on the "Nation" column, but it gets you pretty close.