将 HTML 行解析为 CSV

发布于 2024-07-26 09:08:04 字数 900 浏览 9 评论 0原文

首先，html 行看起来像这样：

<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>

我会显示真正的 html，但很遗憾，我不知道如何阻止它。 感到羞耻

使用 BeautifulSoup (Python) 或任何其他推荐的屏幕抓取/解析方法，我想将同一目录中的大约 1200 个 .htm 文件输出为 CSV 格式。这最终将进入 SQL 数据库。每个目录代表一年，我计划至少做5年。

根据一些建议，我一直在闲逛 glob 作为实现此目的的最佳方法。这就是我到目前为止所拥有的并且被困住了。

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/data/NHL/pl0708/pl02*.htm'):
#these files go from pl020001.htm to pl021230.htm sequentially
    soup = BeautifulSoup(open(filename["r"]))
    for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

我意识到这很丑陋，但这是我第一次尝试这样的事情。在意识到我不必手动将数千个文件复制并粘贴到 Excel 中后，我花了几个月的时间才解决这个问题。我还意识到，我可以出于沮丧而反复踢计算机，但它仍然可以工作（不推荐）。我已经很接近了，我需要知道下一步要做什么来制作这些 CSV 文件。请帮忙，否则我的显示器最终会被锤子敲击。

原文

First off the html row looks like this:

<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>

I would show the real html but I am sorry to say don't know how to block it. feels shame

Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files in the same directory into a CSV format. This will eventually go into an SQL database. Each directory represents a year and I plan to do at least 5 years.

I have been goofing around with glob as the best way to do this from some advice. This is what I have so far and am stuck.

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/data/NHL/pl0708/pl02*.htm'):
#these files go from pl020001.htm to pl021230.htm sequentially
    soup = BeautifulSoup(open(filename["r"]))
    for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

I realize this is ugly but it's my first time attempting anything like this. This one problem has taken me months to get to this point after realizing that I don't have to manually go through thousands of files copy and pasting into excel. I have also realized that I can kick my computer repeatedly out of frustration and it still works (not recommended). I am getting close and I need to know what to do next to make those CSV files. Please help or my monitor finally gets hammer punched.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

还如梦归 2024-08-02 09:08:04

您需要通过添加 import csv 来导入 csv 模块到文件的顶部。

然后，您需要在行循环之外创建一个 csv 文件，如下所示：

writer = csv.writer(open("%s.csv" % filename, "wb"))

然后您需要实际从循环中的 html 行中提取数据，类似于

values = (td.fetchText() for td in row)
writer.writerow(values)

You need to import the csv module by adding import csv to the top of your file.

Then you'll need something to create a csv file outside your loop of the rows, like so:

writer = csv.writer(open("%s.csv" % filename, "wb"))

Then you need to actually pull the data out of the html row in your loop, similar to

values = (td.fetchText() for td in row)
writer.writerow(values)

回复收藏 0 原文

命硬 2024-08-02 09:08:04

你并没有真正解释为什么你被困住了——到底是什么不起作用？

以下行很可能是您的问题：

soup = BeautifulSoup(open(filename["r"]))

在我看来应该是：

soup = BeautifulSoup(open(filename, "r"))

以下行：

for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

看起来它只会挑选出偶数行（假设您的偶数行具有“evenColor”类，奇数行具有“oddColor”类）。假设您希望所有行的类为 EvenColor 或 oddColor，则可以使用正则表达式来匹配类值：

for row in soup.findAll("tr", attrs={ "class" : re.compile(r"evenColor|oddColor") })

You don't really explain why you are stuck - what's not working exactly?

The following line may well be your problem:

soup = BeautifulSoup(open(filename["r"]))

It looks to me like this should be:

soup = BeautifulSoup(open(filename, "r"))

The following line:

for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

looks like it will only pick out even rows (assuming your even rows have the class 'evenColor' and odd rows have 'oddColor'). Assuming you want all rows with a class of either evenColor or oddColor, you can use a regular expression to match the class value:

for row in soup.findAll("tr", attrs={ "class" : re.compile(r"evenColor|oddColor") })

回复收藏 0 原文

执手闯天涯 2024-08-02 09:08:04

看起来不错，BeautifulSoup 对此很有用（尽管我个人倾向于使用 lxml）。您应该能够获取您获得的数据，并使用 csv 模块制作一个 csv 文件，而不会出现任何明显的问题...

我认为您需要实际告诉我们问题是什么。 “还是不行”并不是问题描述。

回复收藏 0 原文

~没有更多了~

关于作者

┈┾☆殇

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

将 HTML 行解析为 CSV

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

巷子口的你

微信用户

神妖

鞋纸虽美，但不合脚ㄋ〞

7460852697

ligengkai

友情链接

将 HTML 行解析为 CSV

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

巷子口的你

微信用户

神妖

鞋纸虽美，但不合脚ㄋ〞

7460852697

ligengkai

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。