在Python中N个单词后分割HTML

发布于 2024-07-10 08:14:48 字数 712 浏览 7 评论 0原文

有没有办法将一长串HTML后的N个单词分割开? 显然我可以使用:

' '.join(foo.split(' ')[:n])

获取纯文本字符串的前 n 个单词,但这可能会在 html 标签中间分开,并且不会生成有效的 html,因为它不会关闭已打开的标签。

我需要在 zope / plone 站点中执行此操作 - 如果这些产品中存在可以执行此操作的标准内容,那将是理想的。

例如,假设我有文本:

<p>This is some text with a 
  <a href="http://www.example.com/" title="Example link">
     bit of linked text in it
  </a>.
</p>

我要求它在 5 个单词后分割,它应该返回:

<p>This is some text with</p>

7 个单词:

<p>This is some text with a 
  <a href="http://www.example.com/" title="Example link">
     bit
  </a>
</p>

Is there any way to split a long string of HTML after N words? Obviously I could use:

' '.join(foo.split(' ')[:n])

to get the first n words of a plain text string, but that might split in the middle of an html tag, and won't produce valid html because it won't close the tags that have been opened.

I need to do this in a zope / plone site - if there is something as standard in those products that can do it, that would be ideal.

For example, say I have the text:

<p>This is some text with a 
  <a href="http://www.example.com/" title="Example link">
     bit of linked text in it
  </a>.
</p>

And I ask it to split after 5 words, it should return:

<p>This is some text with</p>

7 words:

<p>This is some text with a 
  <a href="http://www.example.com/" title="Example link">
     bit
  </a>
</p>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

徒留西风 2024-07-17 08:14:48

看一下 django.utils 中的 truncate_html_words 函数。文本。 即使您不使用 Django,那里的代码也完全可以满足您的需求。

Take a look at the truncate_html_words function in django.utils.text. Even if you aren't using Django, the code there does exactly what you want.

攒一口袋星星 2024-07-17 08:14:48

我听说 Beautiful Soup 非常擅长解析 html。 它可能能够帮助您获得正确的 html。

I've heard that Beautiful Soup is very good at parsing html. It will probably be able to help you get correct html out.

阳光的暖冬 2024-07-17 08:14:48

我要提到的是用 Python 构建的基本 HTMLParser ,因为我不确定你想要达到的最终结果是什么,它可能会也可能不会让你到达那里,你将主要与处理程序一起工作

I was going to mention the base HTMLParser that's built in Python, since I'm not sure what the end-result your trying to get to is, it may or may not get you there, you'll work with the handlers primarily

爱你是孤单的心事 2024-07-17 08:14:48

您可以混合使用正则表达式、BeautifulSoup 或 Tidy(我更喜欢 BeautifulSoup)。
这个想法很简单——首先去除所有 HTML 标签。 找到第 n 个单词(这里 n=7),找到第 n 个单词在字符串中出现的次数,直到 n 个单词 - 因为你只查找最后一个出现的单词以用于截断。

这是一段代码,虽然有点混乱但有效

import re
from BeautifulSoup import BeautifulSoup
import tidy

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

input_string='<p>This is some text with a <a href="http://www.example.com/" '\
    'title="Example link">bit of linked text in it</a></p>'

s=remove_html_tags(input_string).split(' ')[:7]

###required to ensure that only the last occurrence of the nth word is                                                                                      
#  taken into account for truncating.                                                                                                                       
#  coz if the nth word could be 'a'/'and'/'is'....etc                                                                                                       
#  which may occur multiple times within n words                                                                                                            
temp=input_string
k=s.count(s[-1])
i=1
j=0
while i<=k:
    j+=temp.find(s[-1])
    temp=temp[j+len(s[-1]):]
    i+=1
####                                                                                                                                                        
output_string=input_string[:j+len(s[-1])]

print "\nBeautifulSoup\n", BeautifulSoup(output_string)
print "\nTidy\n", tidy.parseString(output_string)

输出就是你想要的

BeautifulSoup
<p>This is some text with a <a href="http://www.example.com/" title="Example link">bit</a></p>

Tidy
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 6 November 2007), see www.w3.org">
<title></title>
</head>
<body>
<p>This is some text with a <a href="http://www.example.com/"
title="Example link">bit</a></p>
</body>
</html>

希望这有助于

编辑:更好的正则表达式

`p = re.compile(r'<[^<]*?>')`

You can use a mix of regex, BeautifulSoup or Tidy (I prefer BeautifulSoup).
The idea is simple - strip all the HTML tags first. Find the nth word (n=7 here), find the number of times the nth word appears in the string till n words - coz u are looking only for the last occurrence to be used for truncation.

Here is a piece of code, though a bit messy but works

import re
from BeautifulSoup import BeautifulSoup
import tidy

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

input_string='<p>This is some text with a <a href="http://www.example.com/" '\
    'title="Example link">bit of linked text in it</a></p>'

s=remove_html_tags(input_string).split(' ')[:7]

###required to ensure that only the last occurrence of the nth word is                                                                                      
#  taken into account for truncating.                                                                                                                       
#  coz if the nth word could be 'a'/'and'/'is'....etc                                                                                                       
#  which may occur multiple times within n words                                                                                                            
temp=input_string
k=s.count(s[-1])
i=1
j=0
while i<=k:
    j+=temp.find(s[-1])
    temp=temp[j+len(s[-1]):]
    i+=1
####                                                                                                                                                        
output_string=input_string[:j+len(s[-1])]

print "\nBeautifulSoup\n", BeautifulSoup(output_string)
print "\nTidy\n", tidy.parseString(output_string)

The output is what u want

BeautifulSoup
<p>This is some text with a <a href="http://www.example.com/" title="Example link">bit</a></p>

Tidy
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 6 November 2007), see www.w3.org">
<title></title>
</head>
<body>
<p>This is some text with a <a href="http://www.example.com/"
title="Example link">bit</a></p>
</body>
</html>

Hope this helps

Edit: A better regex

`p = re.compile(r'<[^<]*?>')`
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文