从多份 PDF 格式的简历中提取信息

发布于 2025-01-10 17:55:32 字数 738 浏览 1 评论 0原文

我有一个数据集,其中有一列包含用于简历的谷歌驱动器链接,我有 5000 行,因此有 5000 个链接,我试图从这些简历中的 2 个单独的列中提取诸如多年经验和工资之类的信息。到目前为止,我已经看到了这里提到的很多例子。

例如:下面提到的代码只能从一个文件读取数据,如何将其复制到多行?

请帮我解决这个问题,否则我将不得不手动浏览 500 份简历并填写数据

,希望我能找到解决这个痛苦问题的方法。

pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

#to extract salary , experience using regular expressions
import re

prog = re.compile("\s*(Name|name|nick).*")
result = prog.match("Name: Bob Exampleson")

if result:
    print result.group(0)

result = prog.match("University: MIT")

if result:
    print result.group(0)

I have a data set with a column which has google drive link for resumes, I have 5000 rows so there are 5000 links , I am trying to extract information like years of experience and salary from these resumes in 2 separate columns. so far I've seen so many examples mentioned here on SO.

For example: the code mentioned below can only read the data from one file , how do I replicate this to multiple rows ?

Please help me with this , else I will have to manually go through 500 resumes and fill in the data

Hoping that I'll get a solution for this painful problem that I have.

pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

#to extract salary , experience using regular expressions
import re

prog = re.compile("\s*(Name|name|nick).*")
result = prog.match("Name: Bob Exampleson")

if result:
    print result.group(0)

result = prog.match("University: MIT")

if result:
    print result.group(0)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

安穩 2025-01-17 17:55:32

使用循环。基本上,您将主要代码放入函数中(更易于阅读)并创建文件名列表。然后,您迭代此列表,使用列表中的值作为函数的参数:

注意:我没有检查您的抓取代码,只是展示了如何循环。还有更有效的方法可以做到这一点,但我假设您是 Python 初学者,所以让我们从简单的角度开始吧。

# add your imports to the top
import re

# create a list of your filenames
files_list = ['a.pdf', 'b.pdf', 'c.pdf']
for filename in files_list:  # iterate over the list
    get_data(filename)


# put the rest in a function for readability
def get_data(filename):
    pdf_file = open(filename, 'rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    print page_content.encode('utf-8')

    prog = re.compile("\s*(Name|name|nick).*")
    result = prog.match("Name: Bob Exampleson")

    if result:
        print result.group(0)

    result = prog.match("University: MIT")

    if result:
        print result.group(0)

现在您的下一个问题可能是,如何创建包含 5000 个文件名的列表?这取决于文件的名称及其存储位置。如果它们是连续的,您可以执行以下操作:

files_list = []  # empty list
num_files = 5000  # total number of files
for i in range(1, num_files+1):
    files_list.append(f'myfile-{i}.pdf')

这将创建一个包含“myfile-1.pdf”、“myfile-2.pdf”等的列表。

希望这足以让您开始。

您还可以在函数中使用 return 来创建一个新列表,其中包含稍后可以使用的所有输出,而不是随时打印输出:

output = []

def doSomething(i):
    return i * 2

for i in range(1, 100):
    output.append(doSomething(i))

# output is now a list with values like:
# [2, 4, 6, 8, 10, 12, ...] 

Use a loop. Basically you put your main code into a function (easier to read) and create a list of filenames. Then you iterate over this list, using the values from the list as argument for your function:

Note: I didn't check your scraping code, just showing how to loop. There are also way more efficient ways to do this, but I'm assuming you're somewhat of a Python beginner so lets keep it simple to start with.

# add your imports to the top
import re

# create a list of your filenames
files_list = ['a.pdf', 'b.pdf', 'c.pdf']
for filename in files_list:  # iterate over the list
    get_data(filename)


# put the rest in a function for readability
def get_data(filename):
    pdf_file = open(filename, 'rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    print page_content.encode('utf-8')

    prog = re.compile("\s*(Name|name|nick).*")
    result = prog.match("Name: Bob Exampleson")

    if result:
        print result.group(0)

    result = prog.match("University: MIT")

    if result:
        print result.group(0)

So now your next question might be, how do I create this list with 5000 filenames? This depends on what the files are called and where they are stored. If they are sequential, you could to something like:

files_list = []  # empty list
num_files = 5000  # total number of files
for i in range(1, num_files+1):
    files_list.append(f'myfile-{i}.pdf')

This will create a list with 'myfile-1.pdf', 'myfile-2.pdf', etc.

Hopefully this is enough to get you started.

You can also use return in your function to create a new list with all of the output which you can use later on, instead of printing the output as you go:

output = []

def doSomething(i):
    return i * 2

for i in range(1, 100):
    output.append(doSomething(i))

# output is now a list with values like:
# [2, 4, 6, 8, 10, 12, ...] 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文