从大型结构化文本文件中提取信息
我需要读取一些大文件(从 50k 到 100k 行),这些文件按空行分隔的组进行结构。 每组以相同的模式“No.999999999 dd/mm/yyyy ZZZ”开始。 这是一些示例数据。
编号813829461 16/09/1987 270
Tit.SUZANO PAPEL E CELULOSE SA (BR/BA)
CNPJ/CIC/N INPI : 16404287000155
代理人:MARCELLO DO NASCIMENTO编号815326777 28/12/1989 351
Tit.SIGLA SISTEMA GLOBO DE GRAVACOES AUDIO VISUAIS LTDA (BR/RJ)
CNPJ/CIC/N°INPI : 34162651000108
后:主名; 名称:产品
马卡报:热带三重奏
分类产品/服务:09.40
*DEFERIDO 符合 123 DE 06/01/2006、PUBLICADA NA RPI 1829、DE 24/01/2006。
代理人:瓦尔德马·罗德里格斯·佩德拉No.900148764 11/01/2007 LD3
Tiit.TIARA BOLSAS E CALÇADOS LTDA
总督:玛西娅·费雷拉·戈麦斯
*编辑:Marcas Marcantes e Patentes Ltda
*Exigência Formal não respondida Satisfatoriamente,Pedido de Registro de Marca 考虑不存在,de acordo com Art。 157 da LPI
*Protocolo da Petição de cumprimento de Exigência 正式:810080140197
我编写了一些代码来相应地解析它。 有什么我可以改进的地方,以提高可读性或性能? 到目前为止我的情况是这样的:
import re, pprint
class Despacho(object):
"""
Class to parse each line, applying the regexp and storing the results
for future use
"""
regexp = {
re.compile(r'No.([\d]{9}) ([\d]{2}/[\d]{2}/[\d]{4}) (.*)'): lambda self: self._processo,
re.compile(r'Tit.(.*)'): lambda self: self._titular,
re.compile(r'Procurador: (.*)'): lambda self: self._procurador,
re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'): lambda self: self._documento,
re.compile(r'Apres.: (.*) ; Nat.: (.*)'): lambda self: self._apresentacao,
re.compile(r'Marca: (.*)'): lambda self: self._marca,
re.compile(r'Clas.Prod/Serv: (.*)'): lambda self: self._classe,
re.compile(r'\*(.*)'): lambda self: self._complemento,
}
def __init__(self):
"""
'complemento' is the only field that can be multiple in a single registry
"""
self.complemento = []
def _processo(self, matches):
self.processo, self.data, self.despacho = matches.groups()
def _titular(self, matches):
self.titular = matches.group(1)
def _procurador(self, matches):
self.procurador = matches.group(1)
def _documento(self, matches):
self.documento = matches.group(1)
def _apresentacao(self, matches):
self.apresentacao, self.natureza = matches.groups()
def _marca(self, matches):
self.marca = matches.group(1)
def _classe(self, matches):
self.classe = matches.group(1)
def _complemento(self, matches):
self.complemento.append(matches.group(1))
def read(self, line):
for pattern in Despacho.regexp:
m = pattern.match(line)
if m:
Despacho.regexp[pattern](self)(m)
def process(rpi):
"""
read data and process each group
"""
rpi = (line for line in rpi)
group = False
for line in rpi:
if line.startswith('No.'):
group = True
d = Despacho()
if not line.strip() and group: # empty line - end of block
yield d
group = False
d.read(line)
arquivo = open('rm1972.txt') # file to process
for desp in process(arquivo):
pprint.pprint(desp.__dict__)
print('--------------')
I need to read some large files (from 50k to 100k lines), structured in groups separated by empty lines. Each group start at the same pattern "No.999999999 dd/mm/yyyy ZZZ". Here´s some sample data.
No.813829461 16/09/1987 270
Tit.SUZANO PAPEL E CELULOSE S.A. (BR/BA)
C.N.P.J./C.I.C./N INPI : 16404287000155
Procurador: MARCELLO DO NASCIMENTONo.815326777 28/12/1989 351
Tit.SIGLA SISTEMA GLOBO DE GRAVACOES AUDIO VISUAIS LTDA (BR/RJ)
C.N.P.J./C.I.C./NºINPI : 34162651000108
Apres.: Nominativa ; Nat.: De Produto
Marca: TRIO TROPICAL
Clas.Prod/Serv: 09.40
*DEFERIDO CONFORME RESOLUÇÃO 123 DE 06/01/2006, PUBLICADA NA RPI 1829, DE 24/01/2006.
Procurador: WALDEMAR RODRIGUES PEDRANo.900148764 11/01/2007 LD3
Tit.TIARA BOLSAS E CALÇADOS LTDA
Procurador: Marcia Ferreira Gomes
*Escritório: Marcas Marcantes e Patentes Ltda
*Exigência Formal não respondida Satisfatoriamente, Pedido de Registro de Marca considerado inexistente, de acordo com Art. 157 da LPI
*Protocolo da Petição de cumprimento de Exigência Formal: 810080140197
I wrote some code that´s parsing it accordingly. There´s anything that I can improve, to improve readability or performance? Here´s what I come so far:
import re, pprint
class Despacho(object):
"""
Class to parse each line, applying the regexp and storing the results
for future use
"""
regexp = {
re.compile(r'No.([\d]{9}) ([\d]{2}/[\d]{2}/[\d]{4}) (.*)'): lambda self: self._processo,
re.compile(r'Tit.(.*)'): lambda self: self._titular,
re.compile(r'Procurador: (.*)'): lambda self: self._procurador,
re.compile(r'C.N.P.J./C.I.C./N INPI :(.*)'): lambda self: self._documento,
re.compile(r'Apres.: (.*) ; Nat.: (.*)'): lambda self: self._apresentacao,
re.compile(r'Marca: (.*)'): lambda self: self._marca,
re.compile(r'Clas.Prod/Serv: (.*)'): lambda self: self._classe,
re.compile(r'\*(.*)'): lambda self: self._complemento,
}
def __init__(self):
"""
'complemento' is the only field that can be multiple in a single registry
"""
self.complemento = []
def _processo(self, matches):
self.processo, self.data, self.despacho = matches.groups()
def _titular(self, matches):
self.titular = matches.group(1)
def _procurador(self, matches):
self.procurador = matches.group(1)
def _documento(self, matches):
self.documento = matches.group(1)
def _apresentacao(self, matches):
self.apresentacao, self.natureza = matches.groups()
def _marca(self, matches):
self.marca = matches.group(1)
def _classe(self, matches):
self.classe = matches.group(1)
def _complemento(self, matches):
self.complemento.append(matches.group(1))
def read(self, line):
for pattern in Despacho.regexp:
m = pattern.match(line)
if m:
Despacho.regexp[pattern](self)(m)
def process(rpi):
"""
read data and process each group
"""
rpi = (line for line in rpi)
group = False
for line in rpi:
if line.startswith('No.'):
group = True
d = Despacho()
if not line.strip() and group: # empty line - end of block
yield d
group = False
d.read(line)
arquivo = open('rm1972.txt') # file to process
for desp in process(arquivo):
pprint.pprint(desp.__dict__)
print('--------------')
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
那是相当不错的。 下面是一些建议,如果您喜欢它们,请告诉我:
That is pretty good. Below some suggestions, let me know if you like'em:
如果您有特定的疑虑,会更容易提供帮助。 性能在很大程度上取决于您正在使用的特定正则表达式引擎的效率。 单个文件中的 100K 行听起来并没有那么大,但这一切都取决于您的环境。
我在 .NET 开发中使用 Expresso 来测试表达式的准确性和性能。
Google 搜索显示了 Kodos,这是一个 GUI Python 正则表达式创作工具。
It would be easier to help if you had a specific concern. Performance will depend greatly on the efficiency of the particular regex engine you are using. 100K lines in a single file doesn't sound that big, but again it all depends on your environment.
I use Expresso in my .NET development to test expressions for accuracy and performance.
A Google search turned up Kodos, a GUI Python regex authoring tool.
总体看起来不错,但为什么会有这样一行:
您已经可以迭代文件对象而无需此中间步骤。
It looks good overall, but why do you have the line:
You can already iterate over the file object without this intermediate step.
我不会在这里使用正则表达式。 如果您知道您的行将以固定字符串开头,为什么不检查这些字符串并围绕它编写逻辑呢?
将上面的代码视为伪代码。
I wouldn't use regex here. If you know that your lines will be starting with fixed strings, why not check those strings and write a logic around it?
Consider the above code as just the pseudocode.
另一个版本只有一个组合正则表达式:
Another version with only one combined regular expression: