如何在XML中对句子进行象征化并创建新的子节点?
我有看起来像这样的XML:
<para id="0">
<se lang="hi">काकेशिया में तब लड़ाई</se>
<se lang="ru">потом боевые действия на Кавказе</se>
</para>
<para id="1">
...
</para>
<para id="2">
...
</para>
我想通过使用inltk库来将Devanagari文本归为devanagari文本,并获得一个看起来像这样的文件:
<para id="0">
<se lang="hi">
<w>काकेशिया</w>
<w>में</w>
<w>तब</w>
<w>लड़ाई</w>
</se>
<se lang="ru">потом боевые действия на Кавказе</se>
</para>
<para id="1">
...
</para>
<para id="2">
...
</para>
我了解如何使句子标记:
paras = body.getElementsByTagName('para')
for para in paras:
devanagari = para.getElementsByTagName('se')[1].childNodes[0].nodeValue
print(tokenize(devanagari, 'hi'))
但是我不知道的是如何制作childnodes <代码> xml&lt; w&gt; ...&lt;/w&gt; 为每个单词>写入xml,
我该如何使用xml.etree.elementtree?
I have XML which looks like this:
<para id="0">
<se lang="hi">काकेशिया में तब लड़ाई</se>
<se lang="ru">потом боевые действия на Кавказе</se>
</para>
<para id="1">
...
</para>
<para id="2">
...
</para>
and I want to tokenize the devanagari text by using iNLTK library and get a file which looks like this:
<para id="0">
<se lang="hi">
<w>काकेशिया</w>
<w>में</w>
<w>तब</w>
<w>लड़ाई</w>
</se>
<se lang="ru">потом боевые действия на Кавказе</se>
</para>
<para id="1">
...
</para>
<para id="2">
...
</para>
I understand how to tokenize the sentence:
paras = body.getElementsByTagName('para')
for para in paras:
devanagari = para.getElementsByTagName('se')[1].childNodes[0].nodeValue
print(tokenize(devanagari, 'hi'))
but what I don't know is how to make childnodes xml <w>...</w>
for each word and write it into the XML
How can I do that by using xml.etree.ElementTree?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论