删除 unicode 行括号中的字符串 - python

发布于 2024-11-17 11:20:52 字数 889 浏览 4 评论 0原文

我的正则表达式有一些问题,并删除了括号内的强项。

这是我的代码:

import sys, re
import codecs

reload(sys)
sys.setdefaultencoding('utf-8')

reader = codecs.open("input",'r','utf-8')
p = re.compile('s/[\[\(].+?[\]\)]//g', re.DOTALL)
# i've also tried several regex but it didn't work
# p = re.compile('\{\{*?.*?\}\}', re.DOTALL)
# p = re.compile('\{\{*.*?\}\}', re.DOTALL)

for row in reader:
    if ("(" in row) and (")" not in row):
        continue
    if row.count("(") != row.count(")"):
        continue
    else:
        row2 = p.sub('', row)
        print row2

对于输入文本文件,它看起来像这样:

가시 돋친(신랄한)평 spinosity
가장 완전한 (같은 종류의 것 중에서)   unabridged
(알코올이)표준강도(50%) 이하의 underproof
(암초 awash
치명적인(fatal) capital
열을) 전도하다    transmit

所需的输出应如下所示:

가시 돋친평  spinosity
가장 완전한  unabridged
표준강도 이하의    underproof
치명적인    capital

i've got some problems with my regex and removing my the strongs bounded by brackets.

here's my code:

import sys, re
import codecs

reload(sys)
sys.setdefaultencoding('utf-8')

reader = codecs.open("input",'r','utf-8')
p = re.compile('s/[\[\(].+?[\]\)]//g', re.DOTALL)
# i've also tried several regex but it didn't work
# p = re.compile('\{\{*?.*?\}\}', re.DOTALL)
# p = re.compile('\{\{*.*?\}\}', re.DOTALL)

for row in reader:
    if ("(" in row) and (")" not in row):
        continue
    if row.count("(") != row.count(")"):
        continue
    else:
        row2 = p.sub('', row)
        print row2

for the input textfiles it looks something like this:

가시 돋친(신랄한)평 spinosity
가장 완전한 (같은 종류의 것 중에서)   unabridged
(알코올이)표준강도(50%) 이하의 underproof
(암초 awash
치명적인(fatal) capital
열을) 전도하다    transmit

the required output should look like this:

가시 돋친평  spinosity
가장 완전한  unabridged
표준강도 이하의    underproof
치명적인    capital

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

冰雪梦之恋 2024-11-24 11:20:52

这对你有用吗?

# -*- coding: utf-8 -*-
import sys, re
import codecs

#reload(sys)
#sys.setdefaultencoding('utf-8')

#prepareing the examples to work on
writer = codecs.open("input.txt",'w','utf-8')
examples = [u'가시 돋친(신랄한)평 spinosity',
            u'가장 완전한 (같은 종류의 것 중에서)',
            u'알코올이)표준강도(50%) 이하의 underproof',
            u'(암초 awash',
            u'치명적인(fatal) capital']
for exampl in examples:
    writer.write(exampl+"\n")
writer.write(exampl)
writer.close()

reader = codecs.open("input.txt",'r','utf-8')

#order of patterns is important,
#if you remove brackets first, the other won't find anything
patterns_to_remove = [r"\(.{1,}\)",r"[\(\)]"]

#one pattern would work just fine, with the loop is a bit more clear
#pat = r"(\(.{1,}\))|([\(\)])"    
#for row in reader:
#    row = re.sub(pat,'',row)#,re.U)
#    print row

reader.seek(0)
for row in reader:
    for pat in patterns_to_remove:
        row = re.sub(pat,'',row)#,re.U)
    print row
reader.close()

Would this work for you?

# -*- coding: utf-8 -*-
import sys, re
import codecs

#reload(sys)
#sys.setdefaultencoding('utf-8')

#prepareing the examples to work on
writer = codecs.open("input.txt",'w','utf-8')
examples = [u'가시 돋친(신랄한)평 spinosity',
            u'가장 완전한 (같은 종류의 것 중에서)',
            u'알코올이)표준강도(50%) 이하의 underproof',
            u'(암초 awash',
            u'치명적인(fatal) capital']
for exampl in examples:
    writer.write(exampl+"\n")
writer.write(exampl)
writer.close()

reader = codecs.open("input.txt",'r','utf-8')

#order of patterns is important,
#if you remove brackets first, the other won't find anything
patterns_to_remove = [r"\(.{1,}\)",r"[\(\)]"]

#one pattern would work just fine, with the loop is a bit more clear
#pat = r"(\(.{1,}\))|([\(\)])"    
#for row in reader:
#    row = re.sub(pat,'',row)#,re.U)
#    print row

reader.seek(0)
for row in reader:
    for pat in patterns_to_remove:
        row = re.sub(pat,'',row)#,re.U)
    print row
reader.close()
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文