使用 itextsharp 从 PDF 中提取图像
我正在尝试使用 itextsharp 从 pdf 中提取所有图像,但似乎无法克服这一障碍。
错误发生在 System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);
行上,给出“参数无效”错误。
我认为当图像是位图而不是任何其他格式时它会起作用。
我有以下代码 - 抱歉,长度太长了;
private void Form1_Load(object sender, EventArgs e)
{
FileStream fs = File.OpenRead(@"reader.pdf");
byte[] data = new byte[fs.Length];
fs.Read(data, 0, (int)fs.Length);
List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();
iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
iTextSharp.text.pdf.PdfObject PDFObj = null;
iTextSharp.text.pdf.PdfStream PDFStremObj = null;
try
{
RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(data);
PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);
for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
{
PDFObj = PDFReaderObj.GetPdfObject(i);
if ((PDFObj != null) && PDFObj.IsStream())
{
PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);
if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
{
byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj);
if ((bytes != null))
{
try
{
System.IO.MemoryStream MS = new System.IO.MemoryStream(bytes);
MS.Position = 0;
System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);
ImgList.Add(ImgPDF);
}
catch (Exception)
{
}
}
}
}
}
PDFReaderObj.Close();
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
} //Form1_Load
I am trying to extract all the images from a pdf using itextsharp but can't seem to overcome this one hurdle.
The error occures on the line System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);
giving an error of "Parameter is not valid".
I think it works when the image is a bitmap but not of any other format.
I have this following code - sorry for the length;
private void Form1_Load(object sender, EventArgs e)
{
FileStream fs = File.OpenRead(@"reader.pdf");
byte[] data = new byte[fs.Length];
fs.Read(data, 0, (int)fs.Length);
List<System.Drawing.Image> ImgList = new List<System.Drawing.Image>();
iTextSharp.text.pdf.RandomAccessFileOrArray RAFObj = null;
iTextSharp.text.pdf.PdfReader PDFReaderObj = null;
iTextSharp.text.pdf.PdfObject PDFObj = null;
iTextSharp.text.pdf.PdfStream PDFStremObj = null;
try
{
RAFObj = new iTextSharp.text.pdf.RandomAccessFileOrArray(data);
PDFReaderObj = new iTextSharp.text.pdf.PdfReader(RAFObj, null);
for (int i = 0; i <= PDFReaderObj.XrefSize - 1; i++)
{
PDFObj = PDFReaderObj.GetPdfObject(i);
if ((PDFObj != null) && PDFObj.IsStream())
{
PDFStremObj = (iTextSharp.text.pdf.PdfStream)PDFObj;
iTextSharp.text.pdf.PdfObject subtype = PDFStremObj.Get(iTextSharp.text.pdf.PdfName.SUBTYPE);
if ((subtype != null) && subtype.ToString() == iTextSharp.text.pdf.PdfName.IMAGE.ToString())
{
byte[] bytes = iTextSharp.text.pdf.PdfReader.GetStreamBytesRaw((iTextSharp.text.pdf.PRStream)PDFStremObj);
if ((bytes != null))
{
try
{
System.IO.MemoryStream MS = new System.IO.MemoryStream(bytes);
MS.Position = 0;
System.Drawing.Image ImgPDF = System.Drawing.Image.FromStream(MS);
ImgList.Add(ImgPDF);
}
catch (Exception)
{
}
}
}
}
}
PDFReaderObj.Close();
}
catch (Exception ex)
{
throw new Exception(ex.Message);
}
} //Form1_Load
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
已解决...
即使我也遇到了“参数无效”的相同异常,并且经过了这么多
在 der_chirurg 提供的链接的帮助下工作
(http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx)我解决了它
以下是代码:
Resolved...
Even I got the same exception of "Parameter is not valid" and after so much of
work with the help of the link provided by der_chirurg
(http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx ) I resolved it
and following is the code:
您需要检查流的 /Filter 以查看给定图像使用的图像格式。它可能是标准图像格式:
除此之外,您还需要获取原始字节(如您所愿),并使用图像流的宽度、高度、每个组件的位数、颜色组件的数量(可以是 CMYK、索引、RGB 或某些奇怪的东西)以及其他一些定义来构建图像在 ISO PDF 规范(免费提供)。
因此,在某些情况下,您的代码可以工作,但在其他情况下,它会因您提到的异常而失败。
PS:当出现异常时,请每次都包含堆栈跟踪。上面加点糖好吗?
You need to check the stream's /Filter to see what image format a given image uses. It may be a standard image format:
Other than that, you'll need to get the raw bytes (as you are), and build an image using the image stream's width, height, bits per component, number of color components (could be CMYK, indexed, RGB, or Something Weird), and a few others, as defined in section 8.9 of the ISO PDF SPECIFICATION (available for free).
So in some cases your code will work, but in others, it'll fail with the exception you mentioned.
PS: When you have an exception, PLEASE include the stack trace every single time. Pretty please with sugar on top?
使用这两种方法对我来说是这样的:
Works for me like this, using these two methods:
在较新版本的 iTextSharp 中,
ImageRenderInfo.CreateForXObject
的第一个参数不再是Matrix
而是GraphicsState
。 @der_chirurg 的方法应该有效。我使用以下链接中的信息对自己进行了测试,效果非常好:http://www.thevalvepage.com/swmonkey/2014/11/26/extract-images-from-pdf-files-using-itextsharp/
In newer version of iTextSharp, the 1st parameter of
ImageRenderInfo.CreateForXObject
is notMatrix
anymore butGraphicsState
. @der_chirurg's approach should work. I tested myself with the information from the following link and it worked beautifully:http://www.thevalvepage.com/swmonkey/2014/11/26/extract-images-from-pdf-files-using-itextsharp/
要提取所有页面上的所有图像,无需实现不同的过滤器。 iTextSharp 有一个图像渲染器,它将所有图像保存为其原始图像类型。
只需执行此处找到的以下操作: http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx不需要实现 HttpHandler...
To extract all Images on all Pages, it is not necessary to implement different filters. iTextSharp has an Image Renderer, which saves all Images in their original image type.
Just do the following found here: http://kuujinbo.info/iTextSharp/CCITTFaxDecodeExtract.aspx You don't need to implement HttpHandler...
我在 github 上添加了库,提取 PDF 中的图像并压缩它们。
当您要开始使用非常强大的 ITextSharp 库时,这可能很有用。
链接如下: https://github.com/rock-walker/PdfCompression
I added library on github which, extract images in PDF and compress them.
Could be useful, when you are going to start play with very powerful library ITextSharp.
Here the link: https://github.com/rock-walker/PdfCompression
这对我有用,我认为这是一个简单的解决方案:
编写一个自定义 RenderListener 并实现其 RenderImage 方法,如下所示
This works for me and I think it's a simple solution:
Write a custom RenderListener and implement its RenderImage method, something like this
我过去使用过这个库,没有任何问题。
http://www.winnovative-software.com/PdfImgExtractor.aspx
I have used this library in the past without any problems.
http://www.winnovative-software.com/PdfImgExtractor.aspx