当前位置：文江博客话题详情

Python PyQuery beautifulsoup

怎么用python修改该html页面？

发布于 2022-09-06 21:49:47 字数 2174 浏览 29 评论 0

原有页面html代码：

<html xmlns="http://www.w3.org/1999/xhtml">
 <head> 
  <meta charset="utf-8" /> 
  <meta content="pdf2htmlEX" name="generator" /> 
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" /> 
  <title></title> 
 </head> 
 <body>
  <div class="t m0 x0 h3 y45e ff2 fs1 fc1 sc0 ls0 ws102">
   abcd
   <span class="ws0">adc</span>
  </div>
  <div class="t m0 x0 h3 y45f ff2 fs1 fc1 sc0 ls0 wse">
   ab
  </div>
  <div class="t m0 xd5 hb y4be ff2 fs3 fc1 sc0 ls7 wse3">
   SUP
   <span class="_ _93"> </span>
   OUT
   <span class="_ _a1"> </span>
   OUT
  </div>
  <div class="t m0 xff h3 y4c1 ff2 fs1 fc1 sc0 ls5c ws10b">
   (V
   <span class="_ _54"> </span>
   V
   <span class="_ _a0">b<span class="_ _92">aa</span></span>
   V
  </div>
 </body>
</html>

要用python程序，将该html页面修改为如下模样：

<html xmlns="http://www.w3.org/1999/xhtml">
 <head> 
  <meta charset="utf-8" /> 
  <meta content="pdf2htmlEX" name="generator" /> 
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible" /> 
  <title></title> 
 </head> 
 <body>
  <div class="t m0 x0 h3 y45e ff2 fs1 fc1 sc0 ls0 ws102">
   4
   <span class="ws0">3</span>
  </div>
  <div class="t m0 x0 h3 y45f ff2 fs1 fc1 sc0 ls0 wse">
   2
  </div>
  <div class="t m0 xd5 hb y4be ff2 fs3 fc1 sc0 ls7 wse3">
   3
   <span class="_ _93">1</span>
   3
   <span class="_ _a1">1</span>
   3
  </div>
  <div class="t m0 xff h3 y4c1 ff2 fs1 fc1 sc0 ls5c ws10b">
   2
   <span class="_ _54">1</span>
   2
   <span class="_ _a0">1<span class="_ _92">2</span></span>
   1
  </div>
 </body>
</html>

对比两个页面代码，可以看到，是要将每一个标签内的每一个text替换为该text的位数，同时要保证原有的dom结构与标签属性不发生任何改变，最后要将结果保存为新页面。

我用beautifulsoup怎么搞也搞不出来，是这个需求太怪异了吗？求大神帮助。（上面的页面只是示例，真实页面dom结构嵌套更多，硬编码是无意义的。）

收藏 0

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

评论（5）

计㈡愣 2022-09-13 21:49:47

import re
with open('1.html', 'r') as r:
    txt = ''.join(r.readlines())

print(txt)  # 你原始的html文本

def replace(match):
    t, s = match.group(1), match.group(1).strip()
    return '>%s<' % (t.replace(s, str(len(s))) if s else t)


txt1 = re.sub(r'>([.\S\s]*?)<', replace, txt)

print(txt1)  # 转换后的html

笑着哭最痛 2022-09-13 21:49:47

import re

def f(m):
    s = m.group(1)
    length = len(s.strip())
    if length == 0:
        return '>{}<'.format(s)
    return '>{}<'.format(re.sub('\S+.?\S?', str(length), s))

p = re.compile('>(.*?)<', re.S)
print(p.sub(f, html))

寄与心 2022-09-13 21:49:47

去找一个html解析器，转化后的结构找到text节点，替换成文本的长度

清旖 2022-09-13 21:49:47

递归解析，再重新构造，用 lxml http://lxml.de/

请叫√我孤独 2022-09-13 21:49:47

建议用javascript啊, 不能再简单了

浏览器F12, 粘贴下面的代码到console里.

function walk(node, fn) {
    if (node) do {
            fn(node);
            walk(node.firstChild, fn);       
    } while (node = node.nextSibling);
}
 
walk(document.body, function(node) {
        if(node.nodeType==1 || node.nodeType==3){       
                console.log(node.nodeValue);   
                node.nodeValue = (node.nodeValue+"").length;       
    }
});

有图有真相:

图片描述

~没有更多了~

关于作者

暂无简介

文章

评论

629 人气

关注发私信

相关话题

热门标签

操作系统程序设计 IT运维 Linux系统管理 JavaScript 服务器应用 solaris C/C++ PHP Shell BSD Vue.js aix Oracle Python HTML 系统管理 HTML5 CSS 前端

推荐作者

夢野间

文章 0 评论 0

百度③文鱼

文章 0 评论 0

小草泠泠

文章 0 评论 0

zhuwenyan

文章 0 评论 0

weirdo

文章 0 评论 0

坚持沉默

文章 0 评论 0

友情链接

我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的隐私政策了解更多相关信息。单击 接受 或继续使用网站，即表示您同意使用 Cookies 和您的相关数据。

原文