解析二维文本
我需要解析文本文件,其中相关信息通常以非线性方式分布在多行中。举个例子:
1234
1 IN THE SUPERIOR COURT OF THE STATE OF SOME STATE
2 IN AND FOR THE COUNTY OF SOME COUNTY
3 UNLIMITED JURISDICTION
4 --o0o--
5
6 JOHN SMITH AND JILL SMITH, )
)
7 Plaintiffs, )
)
8 vs. ) No. 12345
)
9 ACME CO, et al., )
)
10 Defendants. )
___________________________________)
我需要提取原告和被告的身份。
这些笔录的格式多种多样,所以我不能总是指望那些漂亮的括号,或者原告和被告信息被整齐地框起来,例如:
1 SUPREME COURT OF THE STATE OF SOME OTHER STATE
COUNTY OF COUNTYVILLE
2 First Judicial District
Important Litigation
3 --------------------------------------------------X
THIS DOCUMENT APPLIES TO:
4
JOHN SMITH,
5 Plaintiff, Index No.
2000-123
6
DEPOSITION
7 - against - UNDER ORAL
EXAMINATION
8 OF
JOHN SMITH,
9 Volume I
10 ACME CO,
et al,
11 Defendants.
12 --------------------------------------------------X
这两个常量是:
- “原告”将出现在 原告的姓名,但不是 必然在同一条线上。
- 原告和被告姓名 将为大写。
有什么想法吗?
I need to parse text files where relevant information is often spread across multiple lines in a nonlinear way. An example:
1234
1 IN THE SUPERIOR COURT OF THE STATE OF SOME STATE
2 IN AND FOR THE COUNTY OF SOME COUNTY
3 UNLIMITED JURISDICTION
4 --o0o--
5
6 JOHN SMITH AND JILL SMITH, )
)
7 Plaintiffs, )
)
8 vs. ) No. 12345
)
9 ACME CO, et al., )
)
10 Defendants. )
___________________________________)
I need to pull out Plaintiff and Defendant identities.
These transcripts have a very wide variety of formattings, so I can't always count on those nice parentheses being there, or the plaintiff and defendant information being neatly boxed off, e.g.:
1 SUPREME COURT OF THE STATE OF SOME OTHER STATE
COUNTY OF COUNTYVILLE
2 First Judicial District
Important Litigation
3 --------------------------------------------------X
THIS DOCUMENT APPLIES TO:
4
JOHN SMITH,
5 Plaintiff, Index No.
2000-123
6
DEPOSITION
7 - against - UNDER ORAL
EXAMINATION
8 OF
JOHN SMITH,
9 Volume I
10 ACME CO,
et al,
11 Defendants.
12 --------------------------------------------------X
The two constants are:
- "Plaintiff" will occur after the
name of the plaintiff(s), but not
necessarily on the same line. - Plaintiffs and defendants' names
will be in upper case.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我喜欢马丁的回答。
这可能是使用 Python 的更通用方法:
测试:
I like Martin's answer.
Here's perhaps a more general approach using Python:
Tests: