Node.toprettyxml() 在 Python 中向 DOCTYPE 添加换行符

发布于 2024-12-23 20:00:46 字数 2722 浏览 1 评论 0 原文

当使用prettify时,我的DOCTYPE被分成三行。我怎样才能让它保持在一根线上?

“损坏的”输出:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE smil
  PUBLIC '-//W3C//DTD SMIL 2.0//EN'
  'http://www.w3.org/2001/SMIL20/SMIL20.dtd'>
<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <meta base="rtmp://cp23636.edgefcs.net/ondemand"/>
  </head>
  <body>
    <switch>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_256.mp4" system-bitrate="336000"/>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_512.mp4" system-bitrate="592000"/>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_768.mp4" system-bitrate="848000"/>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_1128.mp4" system-bitrate="1208000"/>
    </switch>
  </body>
</smil>

脚本:

import csv
import sys
import os.path

from xml.etree import ElementTree
from xml.etree.ElementTree import Element, SubElement, Comment, tostring

from xml.dom import minidom

def prettify(doctype, elem):
    """Return a pretty-printed XML string for the Element.
    """
    rough_string = doctype + ElementTree.tostring(elem, 'utf-8')
    reparsed = minidom.parseString(rough_string)
    return reparsed.toprettyxml(indent="  ", encoding = 'utf-8')

doctype = '<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">'

video_data = ((256, 336000),
              (512, 592000),
              (768, 848000),
              (1128, 1208000))


with open(sys.argv[1], 'rU') as f:
    reader = csv.DictReader(f)
    for row in reader:
        root = Element('smil')
        root.set('xmlns', 'http://www.w3.org/2001/SMIL20/Language')
        head = SubElement(root, 'head')
        meta = SubElement(head, 'meta base="rtmp://cp23636.edgefcs.net/ondemand"')
        body = SubElement(root, 'body')

        switch_tag = ElementTree.SubElement(body, 'switch')

        for suffix, bitrate in video_data:
            attrs = {'src': ("mp4:soundcheck/{year}/{id}/{file_root_name}_{suffix}.mp4"
                             .format(suffix=str(suffix), **row)),
                     'system-bitrate': str(bitrate),
                     }
            ElementTree.SubElement(switch_tag, 'video', attrs)

        file_root_name = row["file_root_name"]
        year = row["year"]
        id = row["id"]
        path = year+'-'+id

        file_name = row['file_root_name']+'.smil'
        full_path = os.path.join(path, file_name)
        output = open(full_path, 'w')
        output.write(prettify(doctype, root))

When using prettify my DOCTYPE is broken into three lines. How can I keep it on one line?

The "broken" output:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE smil
  PUBLIC '-//W3C//DTD SMIL 2.0//EN'
  'http://www.w3.org/2001/SMIL20/SMIL20.dtd'>
<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <meta base="rtmp://cp23636.edgefcs.net/ondemand"/>
  </head>
  <body>
    <switch>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_256.mp4" system-bitrate="336000"/>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_512.mp4" system-bitrate="592000"/>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_768.mp4" system-bitrate="848000"/>
      <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_1128.mp4" system-bitrate="1208000"/>
    </switch>
  </body>
</smil>

The script:

import csv
import sys
import os.path

from xml.etree import ElementTree
from xml.etree.ElementTree import Element, SubElement, Comment, tostring

from xml.dom import minidom

def prettify(doctype, elem):
    """Return a pretty-printed XML string for the Element.
    """
    rough_string = doctype + ElementTree.tostring(elem, 'utf-8')
    reparsed = minidom.parseString(rough_string)
    return reparsed.toprettyxml(indent="  ", encoding = 'utf-8')

doctype = '<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">'

video_data = ((256, 336000),
              (512, 592000),
              (768, 848000),
              (1128, 1208000))


with open(sys.argv[1], 'rU') as f:
    reader = csv.DictReader(f)
    for row in reader:
        root = Element('smil')
        root.set('xmlns', 'http://www.w3.org/2001/SMIL20/Language')
        head = SubElement(root, 'head')
        meta = SubElement(head, 'meta base="rtmp://cp23636.edgefcs.net/ondemand"')
        body = SubElement(root, 'body')

        switch_tag = ElementTree.SubElement(body, 'switch')

        for suffix, bitrate in video_data:
            attrs = {'src': ("mp4:soundcheck/{year}/{id}/{file_root_name}_{suffix}.mp4"
                             .format(suffix=str(suffix), **row)),
                     'system-bitrate': str(bitrate),
                     }
            ElementTree.SubElement(switch_tag, 'video', attrs)

        file_root_name = row["file_root_name"]
        year = row["year"]
        id = row["id"]
        path = year+'-'+id

        file_name = row['file_root_name']+'.smil'
        full_path = os.path.join(path, file_name)
        output = open(full_path, 'w')
        output.write(prettify(doctype, root))

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

孤星 2024-12-30 20:00:46

查看了您当前的脚本以及您就该主题提出的其他问题后,我认为您可以通过使用字符串操作构建 smil 文件来使您的生活变得更加简单。

文件中的几乎所有 xml 都是静态的。您需要担心正确处理的唯一数据是 video 标记的属性值。为此,标准库中有一个方便的函数可以完全满足您的需求: xml.sax.saxutils.quoteattr

因此,考虑到这些要点,下面的脚本应该更容易使用:

import sys, os, csv
from xml.sax.saxutils import quoteattr

smil_header = '''\
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">
<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <meta base="rtmp://cp23636.edgefcs.net/ondemand"/>
  </head>
  <body>
    <switch>
'''
smil_video = '''\
      <video src=%s system-bitrate=%s/>
'''
smil_footer = '''\
    </switch>
  </body>
</smil>
'''

src_format = 'mp4:soundcheck/%(year)s/%(id)s/%(file_root_name)s_%(suffix)s.mp4'

video_data = (
    ('256', '336000'), ('512', '592000'),
    ('768', '848000'), ('1128', '1208000'),
    )

root = os.getcwd()
if len(sys.argv) > 2:
    root = sys.argv[2]

with open(sys.argv[1], 'rU') as stream:

    for row in csv.DictReader(stream):
        smil = [smil_header]
        for suffix, bitrate in video_data:
            row['suffix'] = suffix
            smil.append(smil_video % (
                quoteattr(src_format) % row, quoteattr(bitrate)
                ))
        smil.append(smil_footer)

        directory = os.path.join(root, '%(year)s-%(id)s' % row)
        try:
            os.makedirs(directory)
        except OSError:
            pass
        path = os.path.join(directory, '%(file_root_name)s.smil' % row)
        print ':: writing file:', path
        with open(path, 'wb') as stream:
            stream.write(''.join(smil))

Having looked over your current script and the other questions you've asked on this subject, I think you could make your life a lot simpler by building your smil files using string manipulation.

Almost all the xml in your files is static. The only data you need to worry about processing correctly is the attribute values for the video tag. And for that, there is a handy function in the standard library that does exactly what you want: xml.sax.saxutils.quoteattr.

So, with those points in mind, here is a script that should be a lot easier to work with:

import sys, os, csv
from xml.sax.saxutils import quoteattr

smil_header = '''\
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">
<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
  <head>
    <meta base="rtmp://cp23636.edgefcs.net/ondemand"/>
  </head>
  <body>
    <switch>
'''
smil_video = '''\
      <video src=%s system-bitrate=%s/>
'''
smil_footer = '''\
    </switch>
  </body>
</smil>
'''

src_format = 'mp4:soundcheck/%(year)s/%(id)s/%(file_root_name)s_%(suffix)s.mp4'

video_data = (
    ('256', '336000'), ('512', '592000'),
    ('768', '848000'), ('1128', '1208000'),
    )

root = os.getcwd()
if len(sys.argv) > 2:
    root = sys.argv[2]

with open(sys.argv[1], 'rU') as stream:

    for row in csv.DictReader(stream):
        smil = [smil_header]
        for suffix, bitrate in video_data:
            row['suffix'] = suffix
            smil.append(smil_video % (
                quoteattr(src_format) % row, quoteattr(bitrate)
                ))
        smil.append(smil_footer)

        directory = os.path.join(root, '%(year)s-%(id)s' % row)
        try:
            os.makedirs(directory)
        except OSError:
            pass
        path = os.path.join(directory, '%(file_root_name)s.smil' % row)
        print ':: writing file:', path
        with open(path, 'wb') as stream:
            stream.write(''.join(smil))
開玄 2024-12-30 20:00:46

我认为您至少有三个选择:

  1. 只需接受换行符。它们可能不受欢迎且丑陋,但它们是完全合法的。

  2. 添加一个拼凑,用更好的 DOCTYPE 替换坏的 DOCTYPE。也许是这样的:

    导入重新
    
    Pretty_xml = 美化(doctype,elem)
    m = re.search("()", Pretty_xml, re.DOTALL)
    丑陋的文档类型 = m.group() 
    fixed_xml = Pretty_xml.replace(ugly_doctype,doctype)
    
  3. 使用功能更丰富的 XML 包。 lxml 浮现在脑海中;它主要与 ElementTree 兼容。通过使用 lxml 的 tostring 函数,您赢得了不需要prettify 函数,DOCTYPE 就会按照您的需要显示出来。示例:

    from lxml import etree 
    
    doctype = ''
    
    XML = '<正文><开关><视频src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_256.mp4" 系统比特率="336000"/><视频 src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_512.mp4" 系统比特率="592000" /><视频src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_768.mp4" 系统比特率="848000"/><视频 src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_1128.mp4" system-bitrate="1208000"/>'
    
    elem = etree.fromstring(XML)
    打印 etree.tostring(elem, doctype=doctype, Pretty_print=True,
                         xml_declaration=True,编码=“utf-8”)
    

    输出:

    >
    
    
      <头>
        />
      
      <正文>
        <开关>
          

I think you have at least three options:

  1. Just accept the newlines. They may be undesired and ugly but they are perfectly legal.

  2. Add a kludge that replaces the bad DOCTYPE with a nicer one. Perhaps something like this:

    import re
    
    pretty_xml = prettify(doctype, elem)
    m = re.search("(<!.*dtd'>)", pretty_xml, re.DOTALL)
    ugly_doctype = m.group() 
    fixed_xml = pretty_xml.replace(ugly_doctype, doctype)
    
  3. Use a more feature-rich XML package. lxml springs to mind; it is mostly compatible with ElementTree. By using lxml's tostring function, you won't need the prettify function and the DOCTYPE comes out as you want it. Example:

    from lxml import etree 
    
    doctype = '<!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">'
    
    XML = '<smil xmlns="http://www.w3.org/2001/SMIL20/Language"><head><meta base="rtmp://cp23636.edgefcs.net/ondemand"/></head><body><switch><video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_256.mp4" system-bitrate="336000"/><video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_512.mp4" system-bitrate="592000"/><video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_768.mp4" system-bitrate="848000"/><video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_1128.mp4" system-bitrate="1208000"/></switch></body></smil>'
    
    elem = etree.fromstring(XML)
    print etree.tostring(elem, doctype=doctype, pretty_print=True,
                         xml_declaration=True, encoding="utf-8")
    

    Output:

    <?xml version='1.0' encoding='utf-8'?>
    <!DOCTYPE smil PUBLIC "-//W3C//DTD SMIL 2.0//EN" "http://www.w3.org/2001/SMIL20/SMIL20.dtd">
    <smil xmlns="http://www.w3.org/2001/SMIL20/Language">
      <head>
        <meta base="rtmp://cp23636.edgefcs.net/ondemand"/>
      </head>
      <body>
        <switch>
          <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_256.mp4" system-bitrate="336000"/>
          <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_512.mp4" system-bitrate="592000"/>
          <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_768.mp4" system-bitrate="848000"/>
          <video src="mp4:soundcheck/1/clay_aiken/02_sc_ca_sorry_1128.mp4" system-bitrate="1208000"/>
        </switch>
      </body>
    </smil>
    
咽泪装欢 2024-12-30 20:00:46

我不相信可以删除由 Node.toprettyxmlDOCTYPE 生成的换行符,至少以 Pythonic 的方式是这样。

它是 DocumentType 类的 writexml 方法,该方法从 minidom 模块,插入有问题的换行符。插入的换行符字符串最初来自 Node.toprettyxml 方法,并通过 Document 类的 writexml 方法传递。相同的换行符字符串也传递给 Node 的各个其他子类的 writexml 方法。更改对 Node.prettyxml 的调用中的换行符字符串将更改整个输出 XML 中使用的换行符字符串。

有多种解决此问题的方法:修改 minidom 模块的本地副本、“monkey-patch”DocumentType 类的 writexml 方法或者对 XML 字符串进行后处理以删除不需要的换行符。然而,这些方法都不吸引我。

对我来说,最好的方法似乎就是保持现状。将 DOCTYPE 拆分为多行真的是一个严重的问题吗?

I don't believe it's possible to remove the newlines generated by Node.toprettyxml for the DOCTYPE, at least in a Pythonic way.

It's the writexml method of the DocumentType class, which starts at line 1284 of the minidom module, which inserts the offending newlines. The newline string inserted comes originally from the Node.toprettyxml method and is passed via the writexml method of the Document class. The same newline string is also passed to the writexml method of various other subclasses of Node. Changing the newline string in the call to Node.prettyxml will change the newline string used throughout the outputted XML.

There are various hacky ways around this: modify your local copy of the minidom module, 'monkey-patch' the writexml method of the DocumentType class or post-process the XML string to remove the unwanted newlines. However, none of these approaches appeal to me.

To me, the best approach seems to be to leave things as they are. Is it really a serious problem having the DOCTYPE split over multiple lines?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文