Converting PDF to Word - inserts heaps of spaces

User avatar
Peter Kinross
5StarLounger
Posts: 962
Joined: 09 Feb 2010, 00:33
Location: Patterson Lakes, Victoria, Australia

Converting PDF to Word - inserts heaps of spaces

Post by Peter Kinross »

One of my business sources produces PDFs that introduce spaces within words when converting (Saving As) to Word DOCX.
See the small samples attached. Small docs like this can be fixed manually, but larger ones - nope!
I have used several converters; Phantom, Adobe and a few online converters, all give the same result.
How can I avoid this?
You do not have the required permissions to view the files attached to this post.
Avagr8day, regards, Peter

User avatar
HansV
Administrator
Posts: 78230
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: Converting PDF to Word - inserts heaps of spaces

Post by HansV »

Garbage in, garbage out. The spaces are already present in the PDF file:
S3107.png
Word faithfully reproduces them...
You do not have the required permissions to view the files attached to this post.
Best wishes,
Hans

User avatar
Peter Kinross
5StarLounger
Posts: 962
Joined: 09 Feb 2010, 00:33
Location: Patterson Lakes, Victoria, Australia

Re: Converting PDF to Word - inserts heaps of spaces

Post by Peter Kinross »

Gotta love GIGO!
Word seems to amplify the spaces, but as they are pre-existing, not much can be done.

Thanks anyway Hans.
Avagr8day, regards, Peter

User avatar
Charles Kenyon
4StarLounger
Posts: 596
Joined: 10 Jan 2016, 15:56
Location: Madison, Wisconsin

Re: Converting PDF to Word - inserts heaps of spaces

Post by Charles Kenyon »

You might want to clue your associate that converted files always have a high overhead of garbage, even if you do not see it. They are usually an editing nightmare, and the longer they are, the worse it is.

If possible, I copy and paste as plain text and then reformat using Styles.

User avatar
Peter Kinross
5StarLounger
Posts: 962
Joined: 09 Feb 2010, 00:33
Location: Patterson Lakes, Victoria, Australia

Re: Converting PDF to Word - inserts heaps of spaces

Post by Peter Kinross »

Thanks Charles. Not incredibly long docs, but heaps of formatting, so cutting and pasting as plain text is not really an option. I have never had any problem saving a PDF as a Word docx, save for this lot. The institution that outputs these PDFs is not very concerned. Very frustrating.
Avagr8day, regards, Peter

User avatar
kdock
5StarLounger
Posts: 720
Joined: 21 Aug 2011, 21:01
Location: The beautiful hills of Western North Carolina

Re: Converting PDF to Word - inserts heaps of spaces

Post by kdock »

Peter Kinross wrote:Thanks Charles. Not incredibly long docs, but heaps of formatting, so cutting and pasting as plain text is not really an option. I have never had any problem saving a PDF as a Word docx, save for this lot. The institution that outputs these PDFs is not very concerned. Very frustrating.
Agravating! Twenty-some years ago (when I was 12), our firm was faced with horribly "corrupt" (that is, very badly formatted) documents that were WordPerfect docs subjected to Word's "conversion" process. Which really didn't work for many, many reasons. Point is, we were faced with hundreds (soon to be thousands) of documents that were all very frustrating.

Here's what we did. We analyzed the incoming docs to see what we could turn into styles. Then we created a template with those styles. Then we created a macro that would allow us to move from paragraph to paragraph applying those styles. We could reformat a fifty-page heavily formatted document in five minutes instead of fixing the formatting that came with it. If we came across a format that wasn't yet turned into a style, we added it to the template. If we were faced with a different type of document with different formatting, we created a new template.

Not to beat a dead horse, but I'm with Charles. IF the PDFs you get from this one frustrating source have similar formatting to each other, you could do worse than open the PDF in one window (so you know what the formatting should look like), then strip the doc down to its skivvies and start over in another window applying styles. You will ultimately save a lot of time.

Sermon over. Good luck with them whatever you do. Kim
"Hmm. What does this button do?" Said everyone before being ejected from a car, blown up, or deleting all the data from the mainframe.

User avatar
HansV
Administrator
Posts: 78230
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: Converting PDF to Word - inserts heaps of spaces

Post by HansV »

I fear that applying styles won't remove spurious spaces from within words...
Best wishes,
Hans

User avatar
kdock
5StarLounger
Posts: 720
Joined: 21 Aug 2011, 21:01
Location: The beautiful hills of Western North Carolina

Re: Converting PDF to Word - inserts heaps of spaces

Post by kdock »

HansV wrote:I fear that applying styles won't remove spurious spaces from within words...
No, but stripping all manual formatting will... Or do you mean the extra spaces within words? that might only be addressed with spell and grammar check... <sigh> Unfortunately, as stated above, GIGO.

K
"Hmm. What does this button do?" Said everyone before being ejected from a car, blown up, or deleting all the data from the mainframe.

User avatar
Peter Kinross
5StarLounger
Posts: 962
Joined: 09 Feb 2010, 00:33
Location: Patterson Lakes, Victoria, Australia

Re: Converting PDF to Word - inserts heaps of spaces

Post by Peter Kinross »

Very interesting and full response. Thanks Kim.
As Hans says, unfortunately GIGO wins!
Ideally we need a spell checker that looks for words with spaces. Then just 2 clicks per word would would fix it.
What Word spell checker does is highlight the Approv in 'Approv al', the correction gives 'Approval al'. No better off.
Even a perfect spell checker would need about 30 two clicks per page in these docs.
Hey interestingly, the Lounge spell checker did what Word's won't, and highlighted the full 'Approv al', offering Approval as the correction. All formatting is lost, but we could copy and paste the text to the lounge as a topic, correct the mistakes and copy and paste it back into a new Word doc and cancel the topic.Then apply Kims solution. Although it would work, it is a way to big a task for the multiple jobs at hand.
I tried opening the doc in Libreoffice Writer to see if that spell checker was better than Word's - it isn't.
Thanks all again.
If anyone wants to have a play with it, here are 2 paragraphs of the original.
Please read this document alongside y our prev ious Statement of Adv ice (SoA) which contains details of y our relev ant personal
circumstances that I hav e used to prepare this f urther adv ice. If y ou require a copy of y our prev ious SoA, please contact me.
I hav e considered y our prev ious SoA and it is the basis of my adv ice in this Record of Adv ice (RoA). It is not signif icantly dif f erent f rom that adv ice and remains appropriate based on y our personal circumstances (including y our needs and objectiv es).
Avagr8day, regards, Peter

User avatar
kdock
5StarLounger
Posts: 720
Joined: 21 Aug 2011, 21:01
Location: The beautiful hills of Western North Carolina

Re: Converting PDF to Word - inserts heaps of spaces

Post by kdock »

I think you mentioned that the folk behind the source pdfs are not interested in changing their evil ways. I noticed they use a program called wkhtmltopdf from GitHub, a program intended to turn html into a pdf. The document also has the Ariel font embedded in it. It's unusual for an html page to use Ariel, but it's possible. If the source html is not displayed in Ariel, that could be the source of the bad formatting in the pdf. It could also be that the font in the html page is not on the computer that does the rendering. html often specifies alternate fonts such as Ariel, but that doesn't ensure the page will look good once it's rendered.
Of course that and three dollars will buy you a cup of coffee. :scratch:

Unfortunately, without the cooperation of the source folk it looks like you're stuck with some manual cleanup. For what it's worth (possibly add another three dollars to make anything of this), there seem to be recurring spelling errors, probably where leading failed to push the letters close enough together to be perceived as a single word. "y our" "y ou", words with a y, "adv ice" and other words with v, and words with spaces after f "signif icantly" "f urther" "f rom" I wonder if you can come up with a search and replace that would handle some of these consistently mis-spaced words?
"Hmm. What does this button do?" Said everyone before being ejected from a car, blown up, or deleting all the data from the mainframe.

User avatar
Peter Kinross
5StarLounger
Posts: 962
Joined: 09 Feb 2010, 00:33
Location: Patterson Lakes, Victoria, Australia

Re: Converting PDF to Word - inserts heaps of spaces

Post by Peter Kinross »

Thanks Kim. I sent your very interesting and knowledgeable comments to the guilty party. Who knows, maybe common sense will prevail. Although with an investment company that is HIGHLY unlikely.
Avagr8day, regards, Peter

User avatar
silverback
5StarLounger
Posts: 771
Joined: 29 Jan 2010, 13:30

Re: Converting PDF to Word - inserts heaps of spaces

Post by silverback »

Peter
I take it that these pesky spaces are just 'ordinary' spaces and not 'special' spaces that Word allows in global replace (Ctrl + H | Special) i.e. "non-breaking space" or "white space"
If they were it would make getting rid very easy :smile:
Just a thought.
Silverback

User avatar
Peter Kinross
5StarLounger
Posts: 962
Joined: 09 Feb 2010, 00:33
Location: Patterson Lakes, Victoria, Australia

Re: Converting PDF to Word - inserts heaps of spaces

Post by Peter Kinross »

What a thought - it would make it too easy!
Sadly no, they are ordinary spaces. A replace gave a doc with no spaces.
Avagr8day, regards, Peter