Does anyone have any experience reading pdf's via VBA? Is it possible? I am getting batches of pdf files, all in the same layout that contain three pieces of information that I need to gather. Opening them individually is time consuming so I am looking for an alternative.
Thanks in advance for your assistance.
Ken
Read pdf file via VBA
-
- 3StarLounger
- Posts: 308
- Joined: 24 Feb 2010, 13:41
Read pdf file via VBA
Last edited by HansV on 27 Mar 2011, 12:53, edited 1 time in total.
Reason: to correct typo in subject
Reason: to correct typo in subject
-
- Administrator
- Posts: 79317
- Joined: 16 Jan 2010, 00:14
- Status: Microsoft MVP
- Location: Wageningen, The Netherlands
Re: Read pdf file via VBA
If you have Adobe Reader, you may be able to use the Adobe Acrobat n.0 Type Library, but I don't know whether you can find text in a PDF file - among other things it depends on the PDF file: for example, a scanned document is basically an image that can't be searched.
Unfortunately, the documentation from Adobe is rather esoteric. And since I don't have Adobe Reader myself, I can't create or test code.
Unfortunately, the documentation from Adobe is rather esoteric. And since I don't have Adobe Reader myself, I can't create or test code.
Best wishes,
Hans
Hans
-
- StarLounger
- Posts: 81
- Joined: 08 Feb 2010, 21:48
- Location: Wellington, New Zealand
Re: Read pdf file via VBA
With Word VBA it is possible to open PDF files using "Documents.Open ... Format:=wdOpenFormatText" - which is probably the same as opening them manually using the "Recover Text from Any File" option. The problem is that this doesn't give you much useful text - usually just some metadata, and not much else.
I have used this process to open multiple PDF files and extract link - hyperlink and email - details, but the success of this has been dependent on the application used to create the PDF files (my current creator, Acrobat 9 Professional, doesn't create files that "expose" this information, whereas this information is available in files created using my previous creator - Acrobat 7 Standard).
I have used this process to open multiple PDF files and extract link - hyperlink and email - details, but the success of this has been dependent on the application used to create the PDF files (my current creator, Acrobat 9 Professional, doesn't create files that "expose" this information, whereas this information is available in files created using my previous creator - Acrobat 7 Standard).
-
- 2StarLounger
- Posts: 148
- Joined: 26 Dec 2010, 18:17
Re: Read pdf file via VBA
I wonder whether this might be related to adding hidden metadata for full text indexing? See Adobe Acrobat 9 Standard * Create and manage an index in a PDF for how to add it (maybe it was added by default in Acrobat 7??).William wrote:...the success of this has been dependent on the application used to create the PDF files (my current creator, Acrobat 9 Professional, doesn't create files that "expose" this information, whereas this information is available in files created using my previous creator - Acrobat 7 Standard).
-
- 2StarLounger
- Posts: 103
- Joined: 04 Feb 2010, 22:44
- Location: Melbourne Australia
Re: Read pdf file via VBA
Years ago I manipulated PDF files using VBA and it worked reasonably well.
I can't remember the sources I used for my fiddling but these links will give you some useful areas to start looking at...
http://www.adobe.com/content/dam/Adobe/ ... Script.pdf
http://diaryproducts.net/for/programmer ... javascript
http://www.adobe.com/devnet/acrobat/overview.html#IAC
http://www.planetpdf.com/developer/arti ... t&gid=6624
I can't remember the sources I used for my fiddling but these links will give you some useful areas to start looking at...
http://www.adobe.com/content/dam/Adobe/ ... Script.pdf
http://diaryproducts.net/for/programmer ... javascript
http://www.adobe.com/devnet/acrobat/overview.html#IAC
http://www.planetpdf.com/developer/arti ... t&gid=6624
Andrew Lockton
Melbourne Australia
Melbourne Australia
-
- 4StarLounger
- Posts: 508
- Joined: 17 Dec 2010, 03:14
Re: Read pdf file via VBA
Here's some code I've used:
You can then call the function with code like:
Note: This code is perhaps a little more complicated than you'll find elsewhere because I'm using Acrobat Pro 8 on Windows 7, where it isn't fully supported (Acrobat Pro 9 is the first version fully supported on Windows 7).
Code: Select all
Public Function ReadAcrobatDocument(strFileName As String) As String
'Note: A Reference to the Adobe Library must be set in Tools|References!
Dim AcroApp As CAcroApp, AcroAVDoc As CAcroAVDoc, AcroPDDoc As CAcroPDDoc
Dim AcroHiliteList As CAcroHiliteList, AcroTextSelect As CAcroPDTextSelect
Dim PageNumber, PageContent, Content, i, j
Set AcroApp = CreateObject("AcroExch.App")
Set AcroAVDoc = CreateObject("AcroExch.AVDoc")
If AcroAVDoc.Open(strFileName, vbNull) <> True Then Exit Function
' The following While-Wend loop shouldn't be necessary but timing issues may occur.
While AcroAVDoc Is Nothing
Set AcroAVDoc = AcroApp.GetActiveDoc
Wend
Set AcroPDDoc = AcroAVDoc.GetPDDoc
For i = 0 To AcroPDDoc.GetNumPages - 1
Set PageNumber = AcroPDDoc.AcquirePage(i)
Set PageContent = CreateObject("AcroExch.HiliteList")
If PageContent.Add(0, 9000) <> True Then Exit Function
Set AcroTextSelect = PageNumber.CreatePageHilite(PageContent)
' The next line is needed to avoid errors with protected PDFs that can't be read
On Error Resume Next
For j = 0 To AcroTextSelect.GetNumText - 1
Content = Content & AcroTextSelect.GetText(j)
Next j
Next i
ReadAcrobatDocument = Content
AcroAVDoc.Close True
AcroApp.Exit
Set AcroAVDoc = Nothing: Set AcroApp = Nothing
End Function
Code: Select all
Sub Demo()
Dim strPDF As String, strTmp As String, i As Integer
' The next ten lines and the last line in this sub can help if
' you get "ActiveX component can't create object" errors even
' though a Reference to Acrobat is set in Tools|References.
Dim bTask As Boolean
bTask = True
If Tasks.Exists(Name:="Adobe Acrobat Professional") = False Then
bTask = False
Dim AdobePath As String, WshShell As Object
Set WshShell = CreateObject("Wscript.shell")
AdobePath = WshShell.RegRead("HKEY_CLASSES_ROOT\acrobat\shell\open\command\")
AdobePath = Trim(Left(AdobePath, InStr(AdobePath, "/") - 1))
Shell AdobePath, vbHide
End If
'Replace FilePath & Filename with the correct FilePath & Filename for the pdf file to be read.
strPDF = ReadAcrobatDocument("FilePath & Filename")
ActiveDocument.Range.InsertAfter strPDF
If bTask = False Then Tasks.Item("Adobe Acrobat Professional").Close
End Sub
Paul Edstein
[Fmr MS MVP - Word]
[Fmr MS MVP - Word]
-
- 3StarLounger
- Posts: 308
- Joined: 24 Feb 2010, 13:41
Re: Read pdf file via VBA
Hans, William, jscher, Guessed and Paul,
THANKS! Still struggling with trying to use the various approaches to reading specific lines within the pdf file. The pdf files may be more than one page, but everything I need is in the top 10 lines or so on the first page.
The information I need will always be prefaced with the same string per field needed. For example:
Line 3 "Student #: " would precede the information I need which is the student's number which will always be 8 characters
Line 6 "Home Room #: " would precede the room number which will always be 9 characters
Line 12 "Date of Enrollment:" would always precede the date enrolled which will always be 8 characters.
So I must find a mechanism to search the pdf for the these labels and then capture the following XX characters. Is that possible?
Thanks in advance for your ideas.
THANKS! Still struggling with trying to use the various approaches to reading specific lines within the pdf file. The pdf files may be more than one page, but everything I need is in the top 10 lines or so on the first page.
The information I need will always be prefaced with the same string per field needed. For example:
Line 3 "Student #: " would precede the information I need which is the student's number which will always be 8 characters
Line 6 "Home Room #: " would precede the room number which will always be 9 characters
Line 12 "Date of Enrollment:" would always precede the date enrolled which will always be 8 characters.
So I must find a mechanism to search the pdf for the these labels and then capture the following XX characters. Is that possible?
Thanks in advance for your ideas.
-
- 4StarLounger
- Posts: 508
- Joined: 17 Dec 2010, 03:14
Re: Read pdf file via VBA
Hi Ken,
If your PDFs always have the same format, what you should be able to do is to read in the data, then discard however many characters precede the start of what you're interested in, along with however many characters follow the maximum length of what you're interested in, then parse what's left for the data you're interested in. For example, you might use the line:
strPDF = Mid(strPDF, 500, 250)
to disregard anything before the 500th character in the file and anything after the 750th character. That leaves just 250 characters to parse. Some trial an error will be required for the Mid variable, since the # characters in the output won't necessarily correspond with what you can see in the PDF.
If you have problems doing this, post a sample PDF and we'll see what we can do.
If your PDFs always have the same format, what you should be able to do is to read in the data, then discard however many characters precede the start of what you're interested in, along with however many characters follow the maximum length of what you're interested in, then parse what's left for the data you're interested in. For example, you might use the line:
strPDF = Mid(strPDF, 500, 250)
to disregard anything before the 500th character in the file and anything after the 750th character. That leaves just 250 characters to parse. Some trial an error will be required for the Mid variable, since the # characters in the output won't necessarily correspond with what you can see in the PDF.
If you have problems doing this, post a sample PDF and we'll see what we can do.
Paul Edstein
[Fmr MS MVP - Word]
[Fmr MS MVP - Word]