如何使用正则表达式正确提取数据

发布于 2024-09-10 19:30:01 字数 635 浏览 4 评论 0原文

我第一次面对正则表达式,我需要从此报告中提取一些数据(带有格式信息的 txt 文件):

\n10:Vikelis M,Rapoport AM。的作用 抗癫痫药物作为预防 \n偏头痛药物。中枢神经系统药物。 2010年 1 月 1 日;24(1):21-33。 doi:\n10.2165/11310970-000000000-00000。 审查。 PubMed PMID: 20030417.\n\n\n21:约翰内森地标 C、拉尔森 PG、瑞特 E、 约翰内森 SI.抗癫痫药\药物 癫痫和其他疾病——a 基于人群的研究 处方。\n癫痫研究。 2009年 十一月;87(1):31-9。电子版 2009 年 8 月 13 日。 PubMed PMID:19679449。\n\n\n

正如您所看到的,所有 txt 记录都以“xx:”之类的数字开头,并且始终以“PubMed PMID: dddddddd”结尾。但是使用这样的正则表达式:

regex = re.compile(r"^\d+: .+ PMID: \d{8}.$")
regex.findall(inputfile)

给我一个列表一大串,所以我误解了一些东西,如何从这些记录中提取数据?

i'm facing regulars expressions for the first time and i need to extract some data from this report (a txt file with formatting info):

\n10: Vikelis M, Rapoport AM. Role of
antiepileptic drugs as preventive
agents for \nmigraine. CNS Drugs. 2010
Jan 1;24(1):21-33.
doi:\n10.2165/11310970-000000000-00000.
Review. PubMed PMID:
20030417.\n\n\n21: Johannessen Landmark C, Larsson PG, Rytter E,
Johannessen SI. Antiepileptic\ndrugs
in epilepsy and other disorders--a
population-based study of
prescriptions.\nEpilepsy Res. 2009
Nov;87(1):31-9. Epub 2009 Aug 13.
PubMed PMID: 19679449.\n\n\n

As you can see all the txt's records begins with a number like "xx:" and always ends with "PubMed PMID: dddddddd. but using a RegEx like this:

regex = re.compile(r"^\d+: .+ PMID: \d{8}.$")
regex.findall(inputfile)

Gives me a list with one big string, so i'm misunderstanding something. How can i extract data from these records?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

巷子口的你 2024-09-17 19:30:01

使用 .+? 进行非贪婪匹配,而不是使用 .+ 进行贪婪匹配。您还需要一个 re.DOTALL以确保您的 . 与其需要匹配的行结束字符匹配,并且 re.MULTILINE 确保 ^$< /code> 匹配行的开头和结尾,而不仅仅是整个字符串。有问题的选项需要与“位或”| 运算符连接,并作为第二个参数传递给 re.compile

Use .+? for non-greedy matching instead of .+ which gives you greedy matching. You also want a re.DOTALL to make sure your . matches the line-end characters it needs to match, and re.MULTILINE to make sure the ^ and $ match starts and ends of line, not just of the whole string. The options in question need to be joined with the "bit-OR" | operator and passed as the second argument to re.compile.

你穿错了嫁妆 2024-09-17 19:30:01

如果记录与示例中所示的一致,则无需使用正则表达式。将文本文件简单地划分为标记列表就可以解决问题。例如:

txt = '\n10: Vikelis M, Rapoport AM. Role of antiepileptic drugs as preventive agents for \nmigraine. CNS Drugs. 2010 Jan 1;24(1):21-33. doi:\n10.2165/11310970-000000000-00000. Review. PubMed PMID: 20030417.\n\n\n21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI. Antiepileptic\ndrugs in epilepsy and other disorders--a population-based study of prescriptions.\nEpilepsy Res. 2009 Nov;87(1):31-9. Epub 2009 Aug 13. PubMed PMID: 19679449.\n\n\n'

lines = [token.replace('\n', '') for token in txt.split('.')]
for line in lines:
    print line

将逐行打印参考文献的每个元素:

10: Vikelis M, Rapoport AM
 Role of antiepileptic drugs as preventive agents for migraine
 CNS Drugs
 2010 Jan 1;24(1):21-33
 doi:10
2165/11310970-000000000-00000
 Review
 PubMed PMID: 20030417
21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI
 Antiepilepticdrugs in epilepsy and other disorders--a population-based study of prescriptions
Epilepsy Res
 2009 Nov;87(1):31-9
 Epub 2009 Aug 13
 PubMed PMID: 19679449

同样,如果您可以相信记录的第一行有作者;第二个是标题,第三个是期刊,依此类推,你也许可以很快做到这一点。如果信息更“上下文化”,那么此时您可以开始使用正则表达式。

祝你好运。

If the records are as consistent as presented in your example, you don't need to use regular expressions. A simple partition of the text file into lists of tokens will do the trick. For instance:

txt = '\n10: Vikelis M, Rapoport AM. Role of antiepileptic drugs as preventive agents for \nmigraine. CNS Drugs. 2010 Jan 1;24(1):21-33. doi:\n10.2165/11310970-000000000-00000. Review. PubMed PMID: 20030417.\n\n\n21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI. Antiepileptic\ndrugs in epilepsy and other disorders--a population-based study of prescriptions.\nEpilepsy Res. 2009 Nov;87(1):31-9. Epub 2009 Aug 13. PubMed PMID: 19679449.\n\n\n'

lines = [token.replace('\n', '') for token in txt.split('.')]
for line in lines:
    print line

will print line by line each element of your references:

10: Vikelis M, Rapoport AM
 Role of antiepileptic drugs as preventive agents for migraine
 CNS Drugs
 2010 Jan 1;24(1):21-33
 doi:10
2165/11310970-000000000-00000
 Review
 PubMed PMID: 20030417
21: Johannessen Landmark C, Larsson PG, Rytter E, Johannessen SI
 Antiepilepticdrugs in epilepsy and other disorders--a population-based study of prescriptions
Epilepsy Res
 2009 Nov;87(1):31-9
 Epub 2009 Aug 13
 PubMed PMID: 19679449

Again, if you can trust that the first line of a record has the author; the second one the title, the third one the journal, etc, you may be able to do this very fast. If the information is a bit more "contextual" then you can START using regexp at this point.

Good luck.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文