How To Load Data From PDF File in SSIS Package Using Script Component
How To Load Data From PDF File in SSIS Package Using Script Component
component
I have come across the case where I have to read PDF document, parse it and write content
to database using SSIS. PDF document is not usual source for any data so there is no
component available to read PDF file from SSIS. I have used iTextSharp library in Script
Component to read the PDF file from SSIS package. I also found very good link where I
found good example on how to use iTextsharp library.
Using iTextsharp library it is very easy to process PDF document. In my case it was simple
case of reading PDF document page by page and parses the content using custom logic. You
can do much more than just reading PDF document using this library.
To read data from PDF file and write to database, I have used Script Component as data
source, which reads the PDF document and return result and output record of Script
Component. The simply used OLE DB destination to write it to database. You might be
aware that any third party dlls can be used in SSIS Script component by referencing it. But
to add it as reference and use it in script component you have to register dll in GAC. It can
be referenced without adding it but dll might not be available on the production
environment at the same location. So better add them to GAC. In my case I was not
allowed to add dll to production server directly, so I used .Net reflection to load the dll
dynamically and added dll location as configuration in SSIS package.
Below is my code
//Below two variable are used to pass ITextSharp dll location and PDF file
path.
//Which ultimately supplied from dts config file.
String configITextPath = Variables.itextdllfilepath;
String configSourceDataFile = Variables.filepath;
Type extractorType =
assembly.GetType("iTextSharp.text.pdf.parser.PdfTextExtractor");
Type extractionStrategyType =
assembly.GetType("iTextSharp.text.pdf.parser.LocationTextExtractionStrategy")
;
//code to process PDF file content and create Output rows for script
component
}
I know better way to do this is to add dll in GAC and refer from there. But in my case I have
to quickly found the solution to process few PDF document suing SSIS package, so I have
used reflection to load it.