Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
154 views

How To Load Data From PDF File in SSIS Package Using Script Component

The document discusses loading data from PDF files into an SSIS package using a script component. It describes using the iTextSharp library to parse PDF documents page by page and extract text. The script component acts as the data source, loading the PDF using iTextSharp reflection to avoid adding the DLL to the GAC. Output rows are generated from the parsed PDF text and written to a database using an OLE DB destination.

Uploaded by

Mohamed Zaki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views

How To Load Data From PDF File in SSIS Package Using Script Component

The document discusses loading data from PDF files into an SSIS package using a script component. It describes using the iTextSharp library to parse PDF documents page by page and extract text. The script component acts as the data source, loading the PDF using iTextSharp reflection to avoid adding the DLL to the GAC. Output rows are generated from the parsed PDF text and written to a database using an OLE DB destination.

Uploaded by

Mohamed Zaki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

How to load data from PDF file in SSIS package using script

component
I have come across the case where I have to read PDF document, parse it and write content
to database using SSIS. PDF document is not usual source for any data so there is no
component available to read PDF file from SSIS. I have used iTextSharp library in Script
Component to read the PDF file from SSIS package. I also found very good link where I
found good example on how to use iTextsharp library.

Using iTextsharp library it is very easy to process PDF document. In my case it was simple
case of reading PDF document page by page and parses the content using custom logic. You
can do much more than just reading PDF document using this library.

To read data from PDF file and write to database, I have used Script Component as data
source, which reads the PDF document and return result and output record of Script
Component. The simply used OLE DB destination to write it to database. You might be
aware that any third party dlls can be used in SSIS Script component by referencing it. But
to add it as reference and use it in script component you have to register dll in GAC. It can
be referenced without adding it but dll might not be available on the production
environment at the same location. So better add them to GAC. In my case I was not
allowed to add dll to production server directly, so I used .Net reflection to load the dll
dynamically and added dll location as configuration in SSIS package.

Below is my code

//Below two variable are used to pass ITextSharp dll location and PDF file
path.
//Which ultimately supplied from dts config file.
String configITextPath = Variables.itextdllfilepath;
String configSourceDataFile = Variables.filepath;

//Load ITextSharp library


Assembly assembly = Assembly.LoadFrom(configITextPath);

//Get PdfReader object and create its instance


Type readerType = assembly.GetType("iTextSharp.text.pdf.PdfReader");
object reader = Activator.CreateInstance(readerType, new object[] {
configSourceDataFile });

Type extractorType =
assembly.GetType("iTextSharp.text.pdf.parser.PdfTextExtractor");

Type extractionStrategyType =
assembly.GetType("iTextSharp.text.pdf.parser.LocationTextExtractionStrategy")
;

object extractionStrategy = Activator.CreateInstance(extractionStrategyType);

//Get number of pages in PDF document


Int32 pages =
(Int32)reader.GetType().GetProperty("NumberOfPages").GetValue(reader, null);
//Read PDF document page by page and process it and generate Output rows for
script component.
String pageText;
List<PepRecord> pepRecords = new List<PepRecord>();
for (int page = configStartPageNumber; page <= pages; page++)
{
reader = Activator.CreateInstance(readerType, new object[] {
configSourceDataFile });
extractionStrategy = Activator.CreateInstance(extractionStrategyType);
pageText =
(String)extractorType.GetMethods()[0].Invoke(reader, new Object[] { reader,
page, extractionStrategy });

//code to process PDF file content and create Output rows for script
component
}

I know better way to do this is to add dll in GAC and refer from there. But in my case I have
to quickly found the solution to process few PDF document suing SSIS package, so I have
used reflection to load it.

You might also like