使用正则表达式将文本提取到列中
我想从附加的 PDF 文档中提取数据(S 号、商品代码、价格和尺寸)到列中。
重新编译适用于 S 号、项目代码和价格,但一旦我输入大小 - 它就会给出有限的输出。我不明白为什么?您能帮忙吗
(PDF页面的附图)
Import pandas as pd
Import re
Import PyPDF2
file = open("Petchem.pdf", "rb")
pdfReader = PyPDF2.PdfFileReader(file)
my_dict = {"S no":[], "Item Code":[], "Price":[], "Size":[]}
for page in range (1,25):
pageObj = pdfReader.getPage(page)
data = pageObj.extractText()
size = re.compile(r'((\d{2,4}?)(\d{10})EA\s(\d?\d?,?\d?\d?\d.\d\d)[\s\w\d,:/.()-])')
for number in size.findall(data):
S_No = my_dict["S No"].append(number[1])
Item_Code = my_dict["Item Code"].append(number[2])
Price = my_dict["Price"].append(number[3])
Size = my_dict["Size"].append(number[4])
print(number[1])
a_file = open("Column_Breakup.csv", "w")
datadf = pd.DataFrame(my_dict)
datadf.to_csv("Column_Breakup.csv")
a_file.close()
I want to extract data (S no, Item Code, Price and Size) from the attached PDF Document in to columns.
The re.compile works for the S no, Item Code and Price, but as soon as I put the Size - it gives a limited output. I am unable to figure out why? Can you please help
(Attached picture of the PDF page)
Import pandas as pd
Import re
Import PyPDF2
file = open("Petchem.pdf", "rb")
pdfReader = PyPDF2.PdfFileReader(file)
my_dict = {"S no":[], "Item Code":[], "Price":[], "Size":[]}
for page in range (1,25):
pageObj = pdfReader.getPage(page)
data = pageObj.extractText()
size = re.compile(r'((\d{2,4}?)(\d{10})EA\s(\d?\d?,?\d?\d?\d.\d\d)[\s\w\d,:/.()-])')
for number in size.findall(data):
S_No = my_dict["S No"].append(number[1])
Item_Code = my_dict["Item Code"].append(number[2])
Price = my_dict["Price"].append(number[3])
Size = my_dict["Size"].append(number[4])
print(number[1])
a_file = open("Column_Breakup.csv", "w")
datadf = pd.DataFrame(my_dict)
datadf.to_csv("Column_Breakup.csv")
a_file.close()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论