从 html 文件中提取文本?
我有一个网页,其中包含一堆文本,我想从页面中提取文本并将其写入文件。我正在尝试使用 BeautifulSoup,但不确定它是否能轻松实现我想要的功能。故事是这样的: 我相信我想要提取的文本位于:
<td colspan="2" class="msg_text_cell" style="text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;" rowspan="2" valign="top" width="100%">
和
<p></p><div style="overflow: hidden; width: 550px; height: 48px;">
我想要做的是仅选择之间的文本行,但不包括上面的开始和结束文本。请注意,上面的开始 html 本身在一行上,但结束文本有时出现在我想要的最后一个文本之后,但不在新行上。
我似乎不知道如何用 BeautifulSoup 做我想做的事,但可能是我的不熟悉造成了阻碍。
另外,我想要提取的文本在页面中出现了 50 次,因此我希望所有此类文本都用“+++++++++++++++++++++++”之类的内容分隔使其更易于阅读。
非常感谢您的帮助。
I have a web page which contains a bunch of text and I want to extract just the text from the page and write it to a file. I am trying to use BeautifulSoup but am not sure it easily does what I want. Here is the story: I believe that the text I want to extract lies between:
<td colspan="2" class="msg_text_cell" style="text-align: justify; background-color: rgb(212, 225, 245); background-image: none; background-repeat: repeat-x;" rowspan="2" valign="top" width="100%">
and
<p></p><div style="overflow: hidden; width: 550px; height: 48px;">
What I want to do is the select just the text lines between, but no including the above begin and end text. Note that the begin html above is on a line by itself but the end text sometimes occurs just after the last text I want but is not on a new line.
I can not seem to see how to do what I want with BeautifulSoup, but probably it is my unfamiliarity getting in the way.
Also, the text I want to extract occurs say 50 times in the page, so I want all such text separated by something like '+++++++++++++++++++++' to make it easier to read.
Thanks much for your help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您对 Ruby 有一定的了解,我可以向您推荐 Nokogiri,它是屏幕抓取方面的一个令人惊叹的宝石。
If ever you know a notch of Ruby, i can point you to Nokogiri which is an amazing gem for screen scraping.
简而言之,您可以循环包含所需文本的预期 dom 元素并以这种方式提取它...使用 jquery 类似 $('td.msg_text_cell').each( function (idx,el) {
idx 将是从上面的选择器中找到的 jQuery 对象数组中的索引,获取具有 msg_text_cell 类的所有 td ...
})
你也可以使用原生js,所以不要认为我在推动jquery......只是一个我更熟悉的框架
simply put you can loop over expected dom elements that contain the text you want and extract it that way ... using jquery something like $('td.msg_text_cell').each( function (idx,el) {
idx would be the index in the array of jQuery objects found from the selector above getting all tds with a class of msg_text_cell ...
})
you can do with native js also so don't think that i'm pushing jquery ... just a framework i'm more familiar with
您可以使用 BeautifulSoup 轻松完成此操作
内找到类似文本的内容
您现在可以在标签
您现在可以添加循环来修改变量。
然后您可以保存在文件中。
You can do it easily with BeautifulSoup
You can now found something like the text inside the
tag
You can now add loop to modify the variable.
Then you can save in a file.