Python html 处理

发布于 2025-01-04 00:54:05 字数 298 浏览 0 评论 0原文

我有一个包含俄语文本的 html 文件。我如何获取文本中的所有单词而不需要 html 标签、特殊符号等？

示例：

<html>...<body>...<div id='text'>Foo bar! Foo, bar.</div></body></html>

我需要：

['foo','bar','Foo','bar']

我尝试过nltk，但它不支持俄语单词。

原文

I have a html file with russian text. How i can get all words in text without html tags, special symbols, etc ?

Example:

<html>...<body>...<div id='text'>Foo bar! Foo, bar.</div></body></html>

I need:

['foo','bar','Foo','bar']

I tried nltk, but it does not support russian words.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

み零 2025-01-11 00:54:05

一定要尝试BeautifulSoup，它支持 Unicode。

回复收藏 0 原文

蹲墙角沉默 2025-01-11 00:54:05

我正在使用 lxml 库来解析 xml/html。 lxml 适用于任何 unicode 数据。

回复收藏 0 原文

两仪 2025-01-11 00:54:05

使用lxml。它可以删除标签、元素等：

import urllib2

from lxml import etree


URL = 'http://stackoverflow.com/questions/9230675/python-html-processing'

html = urllib2.urlopen(URL).read()
tree = etree.fromstring(html, parser=etree.HTMLParser())

tree.xpath('//script')
# [<Element script at 102f831b0>,
#  ...
#  <Element script at 102f83ba8>]

tree.xpath('//style')
# [<Element style at 102f83c58>]

tags_to_strip = ['script', 'style']
etree.strip_elements(tree, *tags_to_strip)

tree.xpath('//style')
# []

tree.xpath('//script')
# []

body = tree.xpath('//body')
body = body[0]

text = ' '.join(body.itertext())
tokens = text.split()
# [u'Stack',
#  u'Exchange',
#  u'log',
#  u'in',
#  ...
#  u'Stack',
#  u'Overflow',
#  u'works',
#  u'best',
#  u'with',
#  u'JavaScript',
#  u'enabled']

如果是俄语文本，您会得到如下所示的标记：

# [u'\xd1\x8d\xd1\x84\xd1\x84\xd0\xb5\xd0\xba\xd1\x82\xd1\x8b\xe2\x80\xa6',
#  u'\xd0\x9c\xd0\xb0\xd1\x80\xd0\xba',
#  ...
#  u'\xd0\x9c\xd0\xb0\xd0\xb9\xd0\xb5\xd1\x80']

错误处理是您的家庭作业。

Use lxml. It can strip tags, elements, and more:

import urllib2

from lxml import etree


URL = 'http://stackoverflow.com/questions/9230675/python-html-processing'

html = urllib2.urlopen(URL).read()
tree = etree.fromstring(html, parser=etree.HTMLParser())

tree.xpath('//script')
# [<Element script at 102f831b0>,
#  ...
#  <Element script at 102f83ba8>]

tree.xpath('//style')
# [<Element style at 102f83c58>]

tags_to_strip = ['script', 'style']
etree.strip_elements(tree, *tags_to_strip)

tree.xpath('//style')
# []

tree.xpath('//script')
# []

body = tree.xpath('//body')
body = body[0]

text = ' '.join(body.itertext())
tokens = text.split()
# [u'Stack',
#  u'Exchange',
#  u'log',
#  u'in',
#  ...
#  u'Stack',
#  u'Overflow',
#  u'works',
#  u'best',
#  u'with',
#  u'JavaScript',
#  u'enabled']

In case of text in russian you get tokens looking likes this:

# [u'\xd1\x8d\xd1\x84\xd1\x84\xd0\xb5\xd0\xba\xd1\x82\xd1\x8b\xe2\x80\xa6',
#  u'\xd0\x9c\xd0\xb0\xd1\x80\xd0\xba',
#  ...
#  u'\xd0\x9c\xd0\xb0\xd0\xb9\xd0\xb5\xd1\x80']

Errors handling is your home assignment.

回复收藏 0 原文

雅心素梦 2025-01-11 00:54:05

使用正则表达式删除标签。 Nltk 的重点是语言分析（名词与动词）和词义（语义），而不是字符串删除和模式匹配，尽管我可以看到有人感到困惑。

这是使用正则表达式的删除函数

import re
def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

Use regex to remove the tags. Nltk is all about language analysis (nouns vs verbs) and word meaning (semantics) not string removal and pattern matching although I can see how someoneaybe confused.

Here is a removal function using regex

import re
def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

回复收藏 0 原文

~没有更多了~

关于作者

时光是把杀猪刀

暂无简介

文章

25 人气

关注发私信

燃烧我的卡路李先生

文章 0 评论 0

关注

qq_2gSKZM

文章 0 评论 0

关注

∞梦里开花

文章 0 评论 0

关注

qq_IklFPL

文章 0 评论 0

关注

迷途知返

文章 0 评论 0

关注

深海不蓝

文章 0 评论 0

友情链接

文江博客

Python html 处理

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

Python html 处理

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（4）

关于作者

相关话题

热门标签

推荐作者

燃烧我的卡路李先生

qq_2gSKZM

∞梦里开花

qq_IklFPL

迷途知返

深海不蓝

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。