解析二维文本

发布于 2024-08-30 16:55:29 字数 2200 浏览 6 评论 0原文

我需要解析文本文件，其中相关信息通常以非线性方式分布在多行中。举个例子：

1234
 1         IN THE SUPERIOR COURT OF THE STATE OF SOME STATE           
 2              IN AND FOR THE COUNTY OF SOME COUNTY                
 3                      UNLIMITED JURISDICTION                        
 4                            --o0o--                                 
 5                                                                    
 6   JOHN SMITH AND JILL SMITH,         )                             
                                        )                             
 7                  Plaintiffs,         )                             
                                        )                             
 8        vs.                           )     No. 12345
                                        )                             
 9   ACME CO, et al.,                   )                             
                                        )                             
10                  Defendants.         )                             
     ___________________________________)

我需要提取原告和被告的身份。

这些笔录的格式多种多样，所以我不能总是指望那些漂亮的括号，或者原告和被告信息被整齐地框起来，例如：

 1        SUPREME COURT OF THE STATE OF SOME OTHER STATE
                      COUNTY OF COUNTYVILLE
 2                  First Judicial District
                     Important Litigation
 3  --------------------------------------------------X
    THIS DOCUMENT APPLIES TO:
 4
    JOHN SMITH,
 5                            Plaintiff,          Index No.
                                                  2000-123
 6
                                            DEPOSITION
 7                  - against -             UNDER ORAL
                                            EXAMINATION
 8                                              OF
                                            JOHN SMITH,
 9                                           Volume I

10  ACME CO,
    et al,
11                            Defendants.

12  --------------------------------------------------X

这两个常量是：

“原告”将出现在原告的姓名，但不是必然在同一条线上。
原告和被告姓名将为大写。

有什么想法吗？

原文

I need to parse text files where relevant information is often spread across multiple lines in a nonlinear way. An example:

1234
 1         IN THE SUPERIOR COURT OF THE STATE OF SOME STATE           
 2              IN AND FOR THE COUNTY OF SOME COUNTY                
 3                      UNLIMITED JURISDICTION                        
 4                            --o0o--                                 
 5                                                                    
 6   JOHN SMITH AND JILL SMITH,         )                             
                                        )                             
 7                  Plaintiffs,         )                             
                                        )                             
 8        vs.                           )     No. 12345
                                        )                             
 9   ACME CO, et al.,                   )                             
                                        )                             
10                  Defendants.         )                             
     ___________________________________)

I need to pull out Plaintiff and Defendant identities.

These transcripts have a very wide variety of formattings, so I can't always count on those nice parentheses being there, or the plaintiff and defendant information being neatly boxed off, e.g.:

 1        SUPREME COURT OF THE STATE OF SOME OTHER STATE
                      COUNTY OF COUNTYVILLE
 2                  First Judicial District
                     Important Litigation
 3  --------------------------------------------------X
    THIS DOCUMENT APPLIES TO:
 4
    JOHN SMITH,
 5                            Plaintiff,          Index No.
                                                  2000-123
 6
                                            DEPOSITION
 7                  - against -             UNDER ORAL
                                            EXAMINATION
 8                                              OF
                                            JOHN SMITH,
 9                                           Volume I

10  ACME CO,
    et al,
11                            Defendants.

12  --------------------------------------------------X

The two constants are:

"Plaintiff" will occur after the
name of the plaintiff(s), but not
necessarily on the same line.
Plaintiffs and defendants' names
will be in upper case.

Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

傲世九天 2024-09-06 16:55:29

我喜欢马丁的回答。
这可能是使用 Python 的更通用方法：

import re

# load file into memory 
# (if large files, provide some limit to how much of the file gets loaded)
with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens

# match all sequences of one or more alphanumeric (or underscore) characters 
# when followed by the word `Plaintiff`; this is intentionally general
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)', paren, 
    re.DOTALL|re.MULTILINE)

# join the list separating by whitespace
str_of_matches = ' '.join(list_of_matches)

# split string by digits (line numbers)
tokens = re.split(r'\d',str_of_matches)

# plaintiffs will be in 2nd-to-last group
plaintiff = tokens[-2].strip()

测试：

with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)>>> tokens = re.split(r'\d', str_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH and JILL SMITH'

with open('no_paren.txt','r') as f:
  no_paren = f.read() # example doc with no parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',no_paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH'

I like Martin's answer.
Here's perhaps a more general approach using Python:

import re

# load file into memory 
# (if large files, provide some limit to how much of the file gets loaded)
with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens

# match all sequences of one or more alphanumeric (or underscore) characters 
# when followed by the word `Plaintiff`; this is intentionally general
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)', paren, 
    re.DOTALL|re.MULTILINE)

# join the list separating by whitespace
str_of_matches = ' '.join(list_of_matches)

# split string by digits (line numbers)
tokens = re.split(r'\d',str_of_matches)

# plaintiffs will be in 2nd-to-last group
plaintiff = tokens[-2].strip()

Tests:

with open('paren.txt','r') as f:
  paren = f.read() # example doc with parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)>>> tokens = re.split(r'\d', str_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH and JILL SMITH'

with open('no_paren.txt','r') as f:
  no_paren = f.read() # example doc with no parens
list_of_matches = re.findall(r'(\w+)(?=.*Plaintiff)',no_paren,
  re.DOTALL|re.MULTILINE)
str_of_matches = ' '.join(list_of_matches)
tokens = re.split(r'\d', str_of_matches)
plaintiff = tokens[-2].strip()
plaintiff
# prints 'JOHN SMITH'

回复收藏 0 原文

~没有更多了~