从网站列表中提取数据,没有多余的标签
工作代码: 通过 python 和 beautiful soup 进行 Google 字典查找 ->只需执行并输入一个单词即可。
我非常简单地从特定列表项中提取了第一个定义。然而,为了获得纯数据,我必须在换行符处分割数据,然后将其剥离以删除多余的列表标记。
我的问题是,是否有一种方法可以提取特定列表中包含的数据,而无需执行上述字符串操作 - 也许是我尚未看到的 beautiful soup 中的函数?
这是代码的相关部分:
# Retrieve HTML and parse with BeautifulSoup.
doc = userAgentSwitcher().open(queryURL).read()
soup = BeautifulSoup(doc)
# Extract the first list item -> and encode it.
definition = soup('li', limit=2)[0].encode('utf-8')
# Format the return as word:definition removing superfluous data.
print word + " : " + definition.split("<br />")[0].strip("<li>")
Working code: Google dictionary lookup via python and beautiful soup -> simply execute and enter a word.
I've quite simply extracted the first definition from a specific list item. However to get plain data, I've had to split my data at the line break, and then strip it to remove the superfluous list tag.
My question is, is there a method to extract the data contained within a specific list without doing my above string manipulation - perhaps a function in beautiful soup that I have yet to see?
This is the relevant section of code:
# Retrieve HTML and parse with BeautifulSoup.
doc = userAgentSwitcher().open(queryURL).read()
soup = BeautifulSoup(doc)
# Extract the first list item -> and encode it.
definition = soup('li', limit=2)[0].encode('utf-8')
# Format the return as word:definition removing superfluous data.
print word + " : " + definition.split("<br />")[0].strip("<li>")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为您正在寻找 findAll(text=True) 这将从标签中提取文本
将返回在标签边界处断开的所有文本内容的列表
I think you are looking for findAll(text=True) this will extract the text from the tags
Will return a ist of all the text contents broken at the tag boundaries