如何使用正则表达式正确提取数据

发布于 2024-09-10 19:30:01 字数 635 浏览 4 评论 0原文

我第一次面对正则表达式，我需要从此报告中提取一些数据（带有格式信息的 txt 文件）：

\n10：Vikelis M，Rapoport AM。的作用抗癫痫药物作为预防 \n偏头痛药物。中枢神经系统药物。 2010年 1 月 1 日；24(1)：21-33。 doi：\n10.2165/11310970-000000000-00000。审查。 PubMed PMID： 20030417.\n\n\n21：约翰内森地标 C、拉尔森 PG、瑞特 E、约翰内森 SI.抗癫痫药\药物癫痫和其他疾病——a 基于人群的研究处方。\n癫痫研究。 2009年十一月；87(1)：31-9。电子版 2009 年 8 月 13 日。 PubMed PMID：19679449。\n\n\n

正如您所看到的，所有 txt 记录都以“xx:”之类的数字开头，并且始终以“PubMed PMID: dddddddd”结尾。但是使用这样的正则表达式：

regex = re.compile(r"^\d+: .+ PMID: \d{8}.$")
regex.findall(inputfile)

给我一个列表一大串，所以我误解了一些东西，如何从这些记录中提取数据？

原文

i'm facing regulars expressions for the first time and i need to extract some data from this report (a txt file with formatting info):

\n10: Vikelis M, Rapoport AM. Role of
antiepileptic drugs as preventive
agents for \nmigraine. CNS Drugs. 2010
Jan 1;24(1):21-33.
doi:\n10.2165/11310970-000000000-00000.
Review. PubMed PMID:
20030417.\n\n\n21: Johannessen Landmark C, Larsson PG, Rytter E,
Johannessen SI. Antiepileptic\ndrugs
in epilepsy and other disorders--a
population-based study of
prescriptions.\nEpilepsy Res. 2009
Nov;87(1):31-9. Epub 2009 Aug 13.
PubMed PMID: 19679449.\n\n\n

As you can see all the txt's records begins with a number like "xx:" and always ends with "PubMed PMID: dddddddd. but using a RegEx like this:

regex = re.compile(r"^\d+: .+ PMID: \d{8}.$")
regex.findall(inputfile)

Gives me a list with one big string, so i'm misunderstanding something. How can i extract data from these records?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

巷子口的你 2024-09-17 19:30:01

使用 .+? 进行非贪婪匹配，而不是使用 .+ 进行贪婪匹配。您还需要一个 re.DOTALL以确保您的 . 与其需要匹配的行结束字符匹配，并且 re.MULTILINE 确保 ^ 和 $< /code> 匹配行的开头和结尾，而不仅仅是整个字符串。有问题的选项需要与“位或”| 运算符连接，并作为第二个参数传递给 re.compile。

回复收藏 0 原文

你穿错了嫁妆 2024-09-17 19:30:01

如果记录与示例中所示的一致，则无需使用正则表达式。将文本文件简单地划分为标记列表就可以解决问题。例如：

txt = '\n10: Vikelis M, Rapoport AM. Role of antiepileptic drugs as preventive agents for \nmigraine. CNS Drugs. 2010 Jan 1;24(1):21-33. doi:\n10.2165/11310970-000000000-00000. Review. PubMed PMID: 20030417.\n\n\n21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI. Antiepileptic\ndrugs in epilepsy and other disorders--a population-based study of prescriptions.\nEpilepsy Res. 2009 Nov;87(1):31-9. Epub 2009 Aug 13. PubMed PMID: 19679449.\n\n\n'

lines = [token.replace('\n', '') for token in txt.split('.')]
for line in lines:
    print line

将逐行打印参考文献的每个元素：

10: Vikelis M, Rapoport AM
 Role of antiepileptic drugs as preventive agents for migraine
 CNS Drugs
 2010 Jan 1;24(1):21-33
 doi:10
2165/11310970-000000000-00000
 Review
 PubMed PMID: 20030417
21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI
 Antiepilepticdrugs in epilepsy and other disorders--a population-based study of prescriptions
Epilepsy Res
 2009 Nov;87(1):31-9
 Epub 2009 Aug 13
 PubMed PMID: 19679449

同样，如果您可以相信记录的第一行有作者；第二个是标题，第三个是期刊，依此类推，你也许可以很快做到这一点。如果信息更“上下文化”，那么此时您可以开始使用正则表达式。

祝你好运。

If the records are as consistent as presented in your example, you don't need to use regular expressions. A simple partition of the text file into lists of tokens will do the trick. For instance:

txt = '\n10: Vikelis M, Rapoport AM. Role of antiepileptic drugs as preventive agents for \nmigraine. CNS Drugs. 2010 Jan 1;24(1):21-33. doi:\n10.2165/11310970-000000000-00000. Review. PubMed PMID: 20030417.\n\n\n21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI. Antiepileptic\ndrugs in epilepsy and other disorders--a population-based study of prescriptions.\nEpilepsy Res. 2009 Nov;87(1):31-9. Epub 2009 Aug 13. PubMed PMID: 19679449.\n\n\n'

lines = [token.replace('\n', '') for token in txt.split('.')]
for line in lines:
    print line

will print line by line each element of your references:

10: Vikelis M, Rapoport AM
 Role of antiepileptic drugs as preventive agents for migraine
 CNS Drugs
 2010 Jan 1;24(1):21-33
 doi:10
2165/11310970-000000000-00000
 Review
 PubMed PMID: 20030417
21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI
 Antiepilepticdrugs in epilepsy and other disorders--a population-based study of prescriptions
Epilepsy Res
 2009 Nov;87(1):31-9
 Epub 2009 Aug 13
 PubMed PMID: 19679449

Again, if you can trust that the first line of a record has the author; the second one the title, the third one the journal, etc, you may be able to do this very fast. If the information is a bit more "contextual" then you can START using regexp at this point.

Good luck.

回复收藏 0 原文

~没有更多了~