将 HTML 行解析为 CSV

发布于 2024-07-26 09:08:04 字数 900 浏览 3 评论 0原文

首先,html 行看起来像这样:

<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>

我会显示真正的 html,但很遗憾,我不知道如何阻止它。 感到羞耻

使用 BeautifulSoup (Python) 或任何其他推荐的屏幕抓取/解析方法,我想将同一目录中的大约 1200 个 .htm 文件输出为 CSV 格式。 这最终将进入 SQL 数据库。 每个目录代表一年,我计划至少做5年。

根据一些建议,我一直在闲逛 glob 作为实现此目的的最佳方法。 这就是我到目前为止所拥有的并且被困住了。

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/data/NHL/pl0708/pl02*.htm'):
#these files go from pl020001.htm to pl021230.htm sequentially
    soup = BeautifulSoup(open(filename["r"]))
    for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

我意识到这很丑陋,但这是我第一次尝试这样的事情。 在意识到我不必手动将数千个文件复制并粘贴到 Excel 中后,我花了几个月的时间才解决这个问题。 我还意识到,我可以出于沮丧而反复踢计算机,但它仍然可以工作(不推荐)。 我已经很接近了,我需要知道下一步要做什么来制作这些 CSV 文件。 请帮忙,否则我的显示器最终会被锤子敲击。

First off the html row looks like this:

<tr class="evenColor"> blahblah TheTextIneed blahblah and ends with </tr>

I would show the real html but I am sorry to say don't know how to block it. feels shame

Using BeautifulSoup (Python) or any other recommended Screen Scraping/Parsing method I would like to output about 1200 .htm files in the same directory into a CSV format. This will eventually go into an SQL database. Each directory represents a year and I plan to do at least 5 years.

I have been goofing around with glob as the best way to do this from some advice. This is what I have so far and am stuck.

import glob
from BeautifulSoup import BeautifulSoup

for filename in glob.glob('/home/phi/data/NHL/pl0708/pl02*.htm'):
#these files go from pl020001.htm to pl021230.htm sequentially
    soup = BeautifulSoup(open(filename["r"]))
    for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

I realize this is ugly but it's my first time attempting anything like this. This one problem has taken me months to get to this point after realizing that I don't have to manually go through thousands of files copy and pasting into excel. I have also realized that I can kick my computer repeatedly out of frustration and it still works (not recommended). I am getting close and I need to know what to do next to make those CSV files. Please help or my monitor finally gets hammer punched.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

还如梦归 2024-08-02 09:08:04

您需要通过添加 import csv 来导入 csv 模块到文件的顶部。

然后,您需要在行循环之外创建一个 csv 文件,如下所示:

writer = csv.writer(open("%s.csv" % filename, "wb"))

然后您需要实际从循环中的 html 行中提取数据,类似于

values = (td.fetchText() for td in row)
writer.writerow(values)

You need to import the csv module by adding import csv to the top of your file.

Then you'll need something to create a csv file outside your loop of the rows, like so:

writer = csv.writer(open("%s.csv" % filename, "wb"))

Then you need to actually pull the data out of the html row in your loop, similar to

values = (td.fetchText() for td in row)
writer.writerow(values)
命硬 2024-08-02 09:08:04

你并没有真正解释为什么你被困住了——到底是什么不起作用?

以下行很可能是您的问题:

soup = BeautifulSoup(open(filename["r"]))

在我看来应该是:

soup = BeautifulSoup(open(filename, "r"))

以下行:

for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

看起来它只会挑选出偶数行(假设您的偶数行具有“evenColor”类,奇数行具有“oddColor”类) 。 假设您希望所有行的类为 EvenColor 或 oddColor,则可以使用正则表达式来匹配类值:

for row in soup.findAll("tr", attrs={ "class" : re.compile(r"evenColor|oddColor") })

You don't really explain why you are stuck - what's not working exactly?

The following line may well be your problem:

soup = BeautifulSoup(open(filename["r"]))

It looks to me like this should be:

soup = BeautifulSoup(open(filename, "r"))

The following line:

for row in soup.findAll("tr", attrs={ "class" : "evenColor" })

looks like it will only pick out even rows (assuming your even rows have the class 'evenColor' and odd rows have 'oddColor'). Assuming you want all rows with a class of either evenColor or oddColor, you can use a regular expression to match the class value:

for row in soup.findAll("tr", attrs={ "class" : re.compile(r"evenColor|oddColor") })
执手闯天涯 2024-08-02 09:08:04

看起来不错,BeautifulSoup 对此很有用(尽管我个人倾向于使用 lxml)。 您应该能够获取您获得的数据,并使用 csv 模块制作一个 csv 文件,而不会出现任何明显的问题...

我认为您需要实际告诉我们问题是什么。 “还是不行”并不是问题描述。

That looks fine, and BeautifulSoup is useful for this (although I personally tend to use lxml). You should be able to take that data you get, and make a csv file out of is using the csv module without any obvious problems...

I think you need to actually tell us what the problem is. "It still doesn't work" is not a problem descripton.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文