如何使用正则表达式进行复杂的pdf提取

发布于 2025-01-15 11:13:27 字数 1814 浏览 1 评论 0原文

我有一个包含彩票中奖者的 PDF 文件,我想根据他们的奖品提取所有中奖彩票。

PDF 文件

我尝试了这个:

import re
import pdfplumber

prize_re = re.compile(r"^\d[a-z]")
cons_prize_re = re.compile(r"^Cons")
ticket1_line_re = re.compile(r"^\d[)]")
ticket2_line_re = re.compile(r"^\d{4}")
ticket3_line_re = re.compile(r"[A-Z] \d{6}")

with pdfplumber.open("./test11.pdf") as pdf:
    for i in range(len(pdf.pages)):
        page_text = pdf.pages[i].extract_text()

        for line in page_text.split("\n"):
            if prize_re.match(line) or cons_prize_re.match(line) or ticket1_line_re.match(line) or ticket2_line_re.match(line) or ticket3_line_re.search(line):
                print(line)

我得到了这个,我不知道如何分配每个奖品门票,还有缺点奖品门票号码似乎有点奇怪我不知道为什么(AN 867952AO 867952AP 应该是 => AN 867952 AO 867952 AP...):

1st Prize Rs :7000000/- 1) AU 867952 (MANANTHAVADY)
Cons Prize-Rs :8000/- AN 867952AO 867952AP 867952 AR 867952AS 867952
AT 867952 AV 867952 AW 867952AX 867952AY 867952
AZ 867952
2nd Prize Rs :500000/- 1) AZ 499603 (ADOOR)
3rd Prize Rs :100000/- 1) AN 215264 (KOTTAYAM)
2) AO 852774 (PATTAMBI)
3) AP 953655 (KOTTAYAM)
4) AR 638904 (PAYYANUR)
5) AS 496774 (VAIKKOM)
6) AT 878990 (WAYANADU)
7) AU 703702 (PUNALUR)
8) AV 418446 (WAYANADU)
9) AW 994685 (KOZHIKKODE)
10) AX 317550 (PATTAMBI)
11) AY 854780 (CHITTUR)
12) AZ 899905 (KARUNAGAPALLY
...

相反我想得到:

 [
    {
        "1st Prize Rs :7000000",
        "tickets": [
            "AU 867952"
        ]
     },
    {
        "Cons Prize-Rs :8000",
        "tickets": [
            "AN 867952",
            "AO 867952",
            "AP 867952",
            "AR 867952",
            ...
        ]
     },
     ...
 ]

我怎样才能实现这个?

I have a PDF file which contains Lottery Tickets winners, i want to extract all win tickets according to their prizes.

PDF file

i tried this:

import re
import pdfplumber

prize_re = re.compile(r"^\d[a-z]")
cons_prize_re = re.compile(r"^Cons")
ticket1_line_re = re.compile(r"^\d[)]")
ticket2_line_re = re.compile(r"^\d{4}")
ticket3_line_re = re.compile(r"[A-Z] \d{6}")

with pdfplumber.open("./test11.pdf") as pdf:
    for i in range(len(pdf.pages)):
        page_text = pdf.pages[i].extract_text()

        for line in page_text.split("\n"):
            if prize_re.match(line) or cons_prize_re.match(line) or ticket1_line_re.match(line) or ticket2_line_re.match(line) or ticket3_line_re.search(line):
                print(line)

and i got this, i don't know how to assign each ticket to its prize, also Cons prizes tickets number seems a little bit strange i don't know why (AN 867952AO 867952AP shoud be => AN 867952 AO 867952 AP...):

1st Prize Rs :7000000/- 1) AU 867952 (MANANTHAVADY)
Cons Prize-Rs :8000/- AN 867952AO 867952AP 867952 AR 867952AS 867952
AT 867952 AV 867952 AW 867952AX 867952AY 867952
AZ 867952
2nd Prize Rs :500000/- 1) AZ 499603 (ADOOR)
3rd Prize Rs :100000/- 1) AN 215264 (KOTTAYAM)
2) AO 852774 (PATTAMBI)
3) AP 953655 (KOTTAYAM)
4) AR 638904 (PAYYANUR)
5) AS 496774 (VAIKKOM)
6) AT 878990 (WAYANADU)
7) AU 703702 (PUNALUR)
8) AV 418446 (WAYANADU)
9) AW 994685 (KOZHIKKODE)
10) AX 317550 (PATTAMBI)
11) AY 854780 (CHITTUR)
12) AZ 899905 (KARUNAGAPALLY
...

instead i want to get:

 [
    {
        "1st Prize Rs :7000000",
        "tickets": [
            "AU 867952"
        ]
     },
    {
        "Cons Prize-Rs :8000",
        "tickets": [
            "AN 867952",
            "AO 867952",
            "AP 867952",
            "AR 867952",
            ...
        ]
     },
     ...
 ]

how can i achieve this ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

贱人配狗天长地久 2025-01-22 11:13:27

您可以首先从捕获组中的所有页面获取所有完整部分。

然后,您可以在处理第三个捕获组后获取单独的“门票”并在循环中创建所需的数据结构。

对于第一个单独的组,您可以使用与每个奖品部分的开头相匹配的模式,并捕获直到下一个奖品部分的所有值。

^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)

正则表达式演示

对于后处理,您可以使用票证格式的模式,该模式与2 个大写字符、空格和 6 个数字,或者 4 个或更多数字后跟空白边界。

(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))

Regex demo

使用问题中的 pdf 文件的示例代码:

import re
import pdfplumber
import json

pattern = r"^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)"

with pdfplumber.open("./test11.pdf") as pdf:
    all_text = ""

    for page in pdf.pages:
        all_text += '\n' + page.extract_text()

    matches = re.finditer(pattern, all_text, re.MULTILINE)

    coll = []
    for matchNum, match in enumerate(matches):
        dct = {}
        dct[match.group(1)] = match.group(2)
        dct["tickets"] = re.findall(r"(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))", match.group(3))
        coll.append(dct)

    print(json.dumps(coll, indent=4))

输出

[
    {
        "1st Prize Rs ": "120000000",
        "tickets": [
            "XG 218582"
        ]
    },
    {
        "Cons Prize-Rs ": "500000",
        "tickets": [
            "XA 218582",
            "XB 218582",
            "XC 218582",
            "XD 218582",
            "XE 218582"
        ]
    },
    {
        "2nd Prize Rs ": "5000000",
        "tickets": [
            "XA 788417",
            "XB 161796",
            "XC 319503",
            "XD 713832",
            "XE 667708",
            "XG 137764"
        ]
    },
    ....

You could first get all the full parts from all the pages in capture groups.

Then you can after process the 3rd capture group to get the separate "tickets" and in a loop create the wanted data structure.

For the first separate groups, you can use a pattern that matches the start of every prize section, and captures all values until the next prize section.

^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)

Regex demo

For the after processing, you can use a pattern for the ticket formats, which matches either 2 uppercase chars, space and 6 digits, or 4 or more digits followed by a whitespace boundary.

(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))

Regex demo

Example code using the pdf file from the question:

import re
import pdfplumber
import json

pattern = r"^(\w+ Prize[-\s]Rs\s*):(\d+)/-(?:\s*\d+\))?\s*(.*(?:\n(?!\w+ Prize\b).*)*)"

with pdfplumber.open("./test11.pdf") as pdf:
    all_text = ""

    for page in pdf.pages:
        all_text += '\n' + page.extract_text()

    matches = re.finditer(pattern, all_text, re.MULTILINE)

    coll = []
    for matchNum, match in enumerate(matches):
        dct = {}
        dct[match.group(1)] = match.group(2)
        dct["tickets"] = re.findall(r"(?:[A-Z]{2} \d{6}(?!\d)|(?<!\S)\d{4,}(?!\S))", match.group(3))
        coll.append(dct)

    print(json.dumps(coll, indent=4))

Output

[
    {
        "1st Prize Rs ": "120000000",
        "tickets": [
            "XG 218582"
        ]
    },
    {
        "Cons Prize-Rs ": "500000",
        "tickets": [
            "XA 218582",
            "XB 218582",
            "XC 218582",
            "XD 218582",
            "XE 218582"
        ]
    },
    {
        "2nd Prize Rs ": "5000000",
        "tickets": [
            "XA 788417",
            "XB 161796",
            "XC 319503",
            "XD 713832",
            "XE 667708",
            "XG 137764"
        ]
    },
    ....
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文