从 html 文件中提取文本？

发布于 2024-12-07 13:40:40 字数 705 浏览 0 评论 0原文

我有一个网页，其中包含一堆文本，我想从页面中提取文本并将其写入文件。我正在尝试使用 BeautifulSoup，但不确定它是否能轻松实现我想要的功能。故事是这样的：我相信我想要提取的文本位于：

<td colspan="2" class="msg_text_cell" style="text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;" rowspan="2" valign="top" width="100%">

和

<p></p><div style="overflow: hidden; width: 550px; height: 48px;">

我想要做的是仅选择之间的文本行，但不包括上面的开始和结束文本。请注意，上面的开始 html 本身在一行上，但结束文本有时出现在我想要的最后一个文本之后，但不在新行上。

我似乎不知道如何用 BeautifulSoup 做我想做的事，但可能是我的不熟悉造成了阻碍。

另外，我想要提取的文本在页面中出现了 50 次，因此我希望所有此类文本都用“+++++++++++++++++++++++”之类的内容分隔使其更易于阅读。

非常感谢您的帮助。

原文

I have a web page which contains a bunch of text and I want to extract just the text from the page and write it to a file. I am trying to use BeautifulSoup but am not sure it easily does what I want. Here is the story: I believe that the text I want to extract lies between:

<td colspan="2" class="msg_text_cell" style="text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;" rowspan="2" valign="top" width="100%">

and

<p></p><div style="overflow: hidden; width: 550px; height: 48px;">

What I want to do is the select just the text lines between, but no including the above begin and end text. Note that the begin html above is on a line by itself but the end text sometimes occurs just after the last text I want but is not on a new line.

I can not seem to see how to do what I want with BeautifulSoup, but probably it is my unfamiliarity getting in the way.

Also, the text I want to extract occurs say 50 times in the page, so I want all such text separated by something like '+++++++++++++++++++++' to make it easier to read.

Thanks much for your help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

豆芽 2024-12-14 13:40:40

如果您对 Ruby 有一定的了解，我可以向您推荐 Nokogiri，它是屏幕抓取方面的一个令人惊叹的宝石。

回复收藏 0 原文

单调的奢华 2024-12-14 13:40:40

简而言之，您可以循环包含所需文本的预期 dom 元素并以这种方式提取它...使用 jquery 类似 $('td.msg_text_cell').each( function (idx,el) {
idx 将是从上面的选择器中找到的 jQuery 对象数组中的索引，获取具有 msg_text_cell 类的所有 td ...
}）

你也可以使用原生js，所以不要认为我在推动jquery......只是一个我更熟悉的框架

回复收藏 0 原文

黑色毁心梦 2024-12-14 13:40:40

您可以使用 BeautifulSoup 轻松完成此操作

from bs4 import BeautifulSoup as bs
soup = "<td colspan=\"2\" class=\"msg_text_cell\" style=\"text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;\" rowspan=\"2\" valign=\"top\" width=\"100%\"> <p>The text</p><div style=\"overflow: hidden; width: 550px; height: 48px;\">"
soup = bs(soup)
soup.find('p')

内找到类似文本的内容

您现在可以在标签

Output: <p>The text</p>

您现在可以添加循环来修改变量。

然后您可以保存在文件中。

with open("data.csv","w") as tW:
writer = csv.writer(tW,delimiter=",")
writer.writerow(["Ptag"])
for i in soup:
    p = i.get_text()
    writer.writerow([p])

You can do it easily with BeautifulSoup

from bs4 import BeautifulSoup as bs
soup = "<td colspan=\"2\" class=\"msg_text_cell\" style=\"text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;\" rowspan=\"2\" valign=\"top\" width=\"100%\"> <p>The text</p><div style=\"overflow: hidden; width: 550px; height: 48px;\">"
soup = bs(soup)
soup.find('p')

You can now found something like the text inside the

tag

Output: <p>The text</p>

You can now add loop to modify the variable.

Then you can save in a file.

with open("data.csv","w") as tW:
writer = csv.writer(tW,delimiter=",")
writer.writerow(["Ptag"])
for i in soup:
    p = i.get_text()
    writer.writerow([p])

回复收藏 0 原文

~没有更多了~