从 html 文件中提取文本?

发布于 2024-12-07 13:40:40 字数 705 浏览 6 评论 0原文

我有一个网页,其中包含一堆文本,我想从页面中提取文本并将其写入文件。我正在尝试使用 BeautifulSoup,但不确定它是否能轻松实现我想要的功能。故事是这样的: 我相信我想要提取的文本位于:

<td colspan="2" class="msg_text_cell" style="text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;" rowspan="2" valign="top" width="100%">

<p></p><div style="overflow: hidden; width: 550px; height: 48px;">

我想要做的是仅选择之间的文本行,但不包括上面的开始和结束文本。请注意,上面的开始 html 本身在一行上,但结束文本有时出现在我想要的最后一个文本之后,但不在新行上。

我似乎不知道如何用 BeautifulSoup 做我想做的事,但可能是我的不熟悉造成了阻碍。

另外,我想要提取的文本在页面中出现了 50 次,因此我希望所有此类文本都用“+++++++++++++++++++++++”之类的内容分隔使其更易于阅读。

非常感谢您的帮助。

I have a web page which contains a bunch of text and I want to extract just the text from the page and write it to a file. I am trying to use BeautifulSoup but am not sure it easily does what I want. Here is the story: I believe that the text I want to extract lies between:

<td colspan="2" class="msg_text_cell" style="text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;" rowspan="2" valign="top" width="100%">

and

<p></p><div style="overflow: hidden; width: 550px; height: 48px;">

What I want to do is the select just the text lines between, but no including the above begin and end text. Note that the begin html above is on a line by itself but the end text sometimes occurs just after the last text I want but is not on a new line.

I can not seem to see how to do what I want with BeautifulSoup, but probably it is my unfamiliarity getting in the way.

Also, the text I want to extract occurs say 50 times in the page, so I want all such text separated by something like '+++++++++++++++++++++' to make it easier to read.

Thanks much for your help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

豆芽 2024-12-14 13:40:40

如果您对 Ruby 有一定的了解,我可以向您推荐 Nokogiri,它是屏幕抓取方面的一个令人惊叹的宝石。

If ever you know a notch of Ruby, i can point you to Nokogiri which is an amazing gem for screen scraping.

单调的奢华 2024-12-14 13:40:40

简而言之,您可以循环包含所需文本的预期 dom 元素并以这种方式提取它...使用 jquery 类似 $('td.msg_text_cell').each( function (idx,el) {
idx 将是从上面的选择器中找到的 jQuery 对象数组中的索引,获取具有 msg_text_cell 类的所有 td ...
})

你也可以使用原生js,所以不要认为我在推动jquery......只是一个我更熟悉的框架

simply put you can loop over expected dom elements that contain the text you want and extract it that way ... using jquery something like $('td.msg_text_cell').each( function (idx,el) {
idx would be the index in the array of jQuery objects found from the selector above getting all tds with a class of msg_text_cell ...
})

you can do with native js also so don't think that i'm pushing jquery ... just a framework i'm more familiar with

黑色毁心梦 2024-12-14 13:40:40

您可以使用 BeautifulSoup 轻松完成此操作

from bs4 import BeautifulSoup as bs
soup = "<td colspan=\"2\" class=\"msg_text_cell\" style=\"text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;\" rowspan=\"2\" valign=\"top\" width=\"100%\"> <p>The text</p><div style=\"overflow: hidden; width: 550px; height: 48px;\">"
soup = bs(soup)
soup.find('p')

内找到类似文本的内容

您现在可以在标签

Output: <p>The text</p>

您现在可以添加循环来修改变量。

然后您可以保存在文件中。

with open("data.csv","w") as tW:
writer = csv.writer(tW,delimiter=",")
writer.writerow(["Ptag"])
for i in soup:
    p = i.get_text()
    writer.writerow([p])

You can do it easily with BeautifulSoup

from bs4 import BeautifulSoup as bs
soup = "<td colspan=\"2\" class=\"msg_text_cell\" style=\"text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;\" rowspan=\"2\" valign=\"top\" width=\"100%\"> <p>The text</p><div style=\"overflow: hidden; width: 550px; height: 48px;\">"
soup = bs(soup)
soup.find('p')

You can now found something like the text inside the

tag

Output: <p>The text</p>

You can now add loop to modify the variable.

Then you can save in a file.

with open("data.csv","w") as tW:
writer = csv.writer(tW,delimiter=",")
writer.writerow(["Ptag"])
for i in soup:
    p = i.get_text()
    writer.writerow([p])
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文