如何将 HTML 表格抓取为 CSV?
问题
我在工作中使用一个工具,可以让我进行查询并获取 HTML 信息表。 我没有任何后端访问权限。
这些信息将会更加有用。如何将这些数据通过屏幕抓取到 CSV 文件?
我的第一个想法
如果我可以将这些信息放入电子表格中进行排序、求平均值等,那么 我知道 jQuery,我想我可以用它来去掉屏幕上的表格格式,插入逗号和换行符,然后将整个乱七八糟的内容复制到记事本中并另存为 CSV。 还有更好的想法吗?
解决方案
是的,伙计们,这确实就像复制和粘贴一样简单。 我不觉得自己很傻吗。
具体来说,当我粘贴到电子表格中时,我必须选择“选择性粘贴”并选择格式“文本”。 否则,即使我突出显示整个电子表格,它也会尝试将所有内容粘贴到单个单元格中。
The Problem
I use a tool at work that lets me do queries and get back HTML tables of info. I do not have any kind of back-end access to it.
A lot of this info would be much more useful if I could put it into a spreadsheet for sorting, averaging, etc. How can I screen-scrape this data to a CSV file?
My First Idea
Since I know jQuery, I thought I might use it to strip out the table formatting onscreen, insert commas and line breaks, and just copy the whole mess into notepad and save as a CSV. Any better ideas?
The Solution
Yes, folks, it really was as easy as copying and pasting. Don't I feel silly.
Specifically, when I pasted into the spreadsheet, I had to select "Paste Special" and choose the format "text." Otherwise it tried to paste everything into a single cell, even if I highlighted the whole spreadsheet.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
但是,这是一种手动解决方案,而不是自动解决方案。
However, this is a manual solution not an automated one.
使用 python:
例如,假设您想从某些网站以 csv 形式抓取外汇报价,例如:fxquotes
然后...
编辑:从表中获取值:
示例来自: palewire
using python:
for example imagine you want to scrape forex quotes in csv form from some site like:fxquotes
then...
edit: to get values from a table:
example from: palewire
这是我使用(当前)最新版本的 BeautifulSoup 的 python 版本,可以使用例如
脚本从标准输入读取 HTML,并以正确的 CSV 格式输出在所有表中找到的文本。
This is my python version using the (currently) latest version of BeautifulSoup which can be obtained using, e.g.,
The script reads HTML from the standard input, and outputs the text found in all tables in proper CSV format.
甚至更容易(因为它会为您保存下次)...
在 Excel
数据/导入外部数据/新 Web 查询
中将带您到 url 提示。 输入您的网址,它将界定要导入的页面上的可用表格。 瞧。
Even easier (because it saves it for you for next time) ...
In Excel
Data/Import External Data/New Web Query
will take you to a url prompt. Enter your url, and it will delimit available tables on the page to import. Voila.
我想到了两种方法(特别是对于我们这些没有 Excel 的人来说):
importHTML
功能:=importHTML("http://example.com/page/with/table", "table", 索引
复制
并粘贴值
read_html
和to_csv
函数Two ways come to mind (especially for those of us that don't have Excel):
importHTML
function:=importHTML("http://example.com/page/with/table", "table", index
copy
andpaste values
shortly after importread_html
andto_csv
functions快速而肮脏:
从浏览器复制到 Excel,另存为 CSV。
更好的解决方案(适合长期使用):
用您选择的语言编写一些代码,将 html 内容拉下来,并刮掉您想要的部分。 您可能可以在数据检索之上添加所有数据操作(排序、平均等)。 这样,您只需运行代码即可获得所需的实际报告。
这完全取决于您执行此特定任务的频率。
Quick and dirty:
Copy out of browser into Excel, save as CSV.
Better solution (for long term use):
Write a bit of code in the language of your choice that will pull the html contents down, and scrape out the bits that you want. You could probably throw in all of the data operations (sorting, averaging, etc) on top of the data retrieval. That way, you just have to run your code and you get the actual report that you want.
It all depends on how often you will be performing this particular task.
Excel可以打开http页面。
例如:
单击文件,打开
在文件名下,粘贴 URL,即:如何将 HTML 表格抓取到 CSV?
单击“确定”
Excel 会尽力将 html 转换为表格。
它不是最优雅的解决方案,但确实有效!
Excel can open a http page.
Eg:
Click File, Open
Under filename, paste the URL ie: How can I scrape an HTML table to CSV?
Click ok
Excel does its best to convert the html to a table.
Its not the most elegant solution, but does work!
使用 BeautifulSoup 的基本 Python 实现,同时考虑 rowspan 和 colspan:
Basic Python implementation using BeautifulSoup, also considering both rowspan and colspan:
下面是一个经过测试的示例,它结合了 grequest 和 soup 从结构化网站下载大量页面:
Here is a tested example that combines grequest and soup to download large quantities of pages from a structured website:
你试过用excel打开吗?
如果您将 Excel 中的电子表格另存为 html,您将看到 Excel 使用的格式。
从我编写的一个网络应用程序中,我吐出这个 html 格式,以便用户可以导出到 Excel。
Have you tried opening it with excel?
If you save a spreadsheet in excel as html you'll see the format excel uses.
From a web app I wrote I spit out this html format so the user can export to excel.
如果您进行屏幕抓取并且您尝试转换的表具有给定的 ID,则您始终可以对 html 进行正则表达式解析以及一些脚本来生成 CSV。
If you're screen scraping and the table you're trying to convert has a given ID, you could always do a regex parse of the html along with some scripting to generate a CSV.