Test characteristic (of a MSWord document)

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15641
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Test characteristic (of a MSWord document)

Post by ChrisGreaves »

(Word2003)

Code: Select all

Function blnIsWordDocument(strFullName As String) As Boolean
    ' Return TRUE if we have a DOC or a DOT; that is, if we feel qualified to use Documents.Open on this file
    Select Case UCase(Right(strFullName, 4))
        Case ".DOC"
            blnIsWordDocument = True
        Case ".DOT"
            blnIsWordDocument = True
        Case Else
            blnIsWordDocument = False
    End Select
    ' FileType Property The constant msoFileTypeOfficeFiles includes all files with any of the following extensions: *.doc, *.xls, *.ppt, *.pps, *.obd, *.mdb, *.mpd, *.dot, *.xlt, *.pot, *.obt, *.htm, or *.html.
    ' CanOpen Property See AlsoApplies ToExampleSpecificsTrue if the specified file converter is designed to open files. Read-only Boolean.
    ' Open Method: ConfirmConversions   Optional Variant. True to display the Convert File dialog box if the file isn't in Microsoft Word format.
    ' ConfirmConversions Property See AlsoApplies ToExampleSpecificsTrue if Word displays the Convert File dialog box before it opens or inserts a file that isn't a Word document or template. In the Convert File dialog box, the user chooses the format to convert the file from. Read/write Boolean.
End Function
My utility function works for now, using the contents of the FileType Property, but I wonder if anyone has found a better way to test if a file is a valid WORD document.
The file extent is a guide, but the file name is just the name of a file, and it is not the same thing as the contents of a file. Which is why we rename our passwords document with an extent like “MRG” or “PNH”, but not “94s2wxbwwvhzfl”,right?
I used to test(attached) for WP51 files with

Code: Select all

    If blnChar(intFile, "WPC", 1) Then
      If blnHex(intFile, "01", 8) Then
            If blnHex(intFile, "0A", 9) Then
                blnWP51Doc = True ' matches on 3 known criteria
, but have not found any scheme for identifying an MSWord document by its content.
Given a few years I could probably cobble together code that uses some of the CanOpen Property[\b], Open Method[\b], or ConfirmConversions[\b], but interrogating whether or not a time-wasting dialogue box confronts the user is bad because I cannot then process a batch of files unattended (“Computers are good at doing boring and repetitive tasks).

Thanks
Chris
You do not have the required permissions to view the files attached to this post.
He who plants a seed, plants life.

User avatar
HansV
Administrator
Posts: 78573
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: Test characteristic (of a MSWord document)

Post by HansV »

Did you change [/b] to [\b] in LoungeBold? :evilgrin:
Best wishes,
Hans

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15641
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Test characteristic (of a MSWord document)

Post by ChrisGreaves »

HansV wrote:Did you change [/b] to [\b] in LoungeBold? :evilgrin:
:hairout: :scratch: I am losing it here in the Cafe.
I know that I messed up in my posting about using and which I had to change to {b} and {/b}. Maybe I should have written and [\b] and changed it to {b} and {\b}. Or not?
Cheers
Chris
He who plants a seed, plants life.

William
StarLounger
Posts: 79
Joined: 08 Feb 2010, 21:48
Location: Wellington, New Zealand

Re: Test characteristic (of a MSWord document)

Post by William »

Chris

There used to be a WordBasic command ["FileCreator$()"] that could tell you if a file was created by Word, but it was a Mac-only thing so may be of no use to you - even if it still works.

If I were trying to do what you're doing, I wouldn't rely solely on checking a file's name. I'd also check for something that only a Word file would have.

Also, do you not need to worry about the more recent Word file types and their names? Could you not open these automatically in Word 2003 - or earlier - using the Microsoft converter?

Regards.

User avatar
Jay Freedman
Microsoft MVP
Posts: 1320
Joined: 24 May 2013, 15:33
Location: Warminster, PA

Re: Test characteristic (of a MSWord document)

Post by Jay Freedman »

The first two bytes of a Word .doc or .dot file are hex D0 CF.

The first two bytes of a .docx, .docm, .dotx, or .dotm file are hex 50 4B (which are the letters PK, the initials of Phil Katz, the author of PKZip, because the newer Office formats are really zip files with different extensions).

User avatar
HansV
Administrator
Posts: 78573
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: Test characteristic (of a MSWord document)

Post by HansV »

Jay Freedman wrote:The first two bytes of a .docx, .docm, .dotx, or .dotm file are hex 50 4B (which are the letters PK, the initials of Phil Katz, the author of PKZip, because the newer Office formats are really zip files with different extensions).
Unfortunately, that means that we can't use those two bytes to distinguish Word files (.docx etc.) from Excel files (.xlsx etc.), PowerPoint files (.pptx etc.) and .zip files...
Best wishes,
Hans

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15641
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Test characteristic (of a MSWord document)

Post by ChrisGreaves »

William wrote:If I were trying to do what you're doing, I wouldn't rely solely on checking a file's name. I'd also check for something that only a Word file would have.
Hi William.
In this sort of work I never look at filenames or extents.
When a firm asks me to "Convert all our WP5.1 DOS documents", I have to do just that - locate all the WP5.1 DOS documents on their network.
That is why I wrote that little utility function "blnIsWordDocument" - it looks at specific tell-tale bytes within the file contents.
I read each file as a binary file; I do not open it as a document.

That is why I am asking for identifying an MSWord document by its content.

Such a function should be able to locate my (and your!) Wortd document that contains all the passwords, even though it has been renamed from "MyPasswords.doc" to "nhgfdhdh .iuytg566".
Cheers
Chris
He who plants a seed, plants life.

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15641
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Test characteristic (of a MSWord document)

Post by ChrisGreaves »

Jay Freedman wrote:The first two bytes of a Word .doc or .dot file are hex D0 CF. The first two bytes of a .docx, .docm, .dotx, or .dotm file are hex 50 4B (which are the letters PK, the initials of Phil Katz, the author of PKZip, because the newer Office formats are really zip files with different extensions).
Thank you Jay, I shall give this a try.
Cheers
Chris
He who plants a seed, plants life.

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15641
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Test characteristic (of a MSWord document)

Post by ChrisGreaves »

HansV wrote:Unfortunately, that means that we can't use those two bytes to distinguish Word files (.docx etc.) from Excel files (.xlsx etc.), PowerPoint files (.pptx etc.) and .zip files...
Thanks for this, too, Hans.
I'll play around a bit.
One of my tests will be to look for all "D0CF" files that are NOT named by me as "DOC" extents. (Could be interesting!)
Cheers
Chris
He who plants a seed, plants life.

User avatar
Jay Freedman
Microsoft MVP
Posts: 1320
Joined: 24 May 2013, 15:33
Location: Warminster, PA

Re: Test characteristic (of a MSWord document)

Post by Jay Freedman »

HansV wrote:
Jay Freedman wrote:The first two bytes of a .docx, .docm, .dotx, or .dotm file are hex 50 4B (which are the letters PK, the initials of Phil Katz, the author of PKZip, because the newer Office formats are really zip files with different extensions).
Unfortunately, that means that we can't use those two bytes to distinguish Word files (.docx etc.) from Excel files (.xlsx etc.), PowerPoint files (.pptx etc.) and .zip files...
That's true. Other than trying to open a suspect file in its alleged parent program, the only way (I think) would be to open it with a zip file manager such as 7Zip and look for a folder named "word", "xl", "PowerPoint Document", and so forth.

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15641
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Test characteristic (of a MSWord document)

Post by ChrisGreaves »

Jay Freedman wrote:... open it with a zip file manager such as 7Zip and look for a folder named "word", "xl", "PowerPoint Document", and so forth.
I am so much out of the swim that I did not know that DOCX and the like were PKZip files. This knowledge opens up a scary new world for everyone but me (grin).
Initially I supposed that I might look for a PKZip-encoded version of the string “folder” or even “word”, but then there must be many such encoded strings bearing the text contents of a document.
Password-protected docx files may not be a problem, depending on the nature of the problem If the task is to convert documents, password protection is an existing problem for the client, not for the consultant. If the task is merely to identify/count documents, then the contents are irrelevant and the problem goes away.
My guess is that by now, many year’s after Phils’s death, someone has written code to decode the basic structure of a Zip-file.
Cheers
Chris

P.S. This "zip" trick helped me this morning to unscramble an xlsm File. Hooray!!! :clapping:
He who plants a seed, plants life.

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15641
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Test characteristic (of a MSWord document)

Post by ChrisGreaves »

Jay Freedman wrote:...the only way (I think) would be to open it with a zip file manager such as 7Zip and look for a folder named "word", "xl", "PowerPoint Document", and so forth.
Hello Jay.
I suspect that this manager is an interactive application, which immediately rules it out for processing a batch of, say, 10,000 documents overnight.
(Later: I see from their web site "Powerful command line version" so I have d/l the 64bit version and will take a look)


Back in The Good Old Days a client would drag a folder onto a CD until the CD was full, and then tell me to "take a look".
Can't do that interactively!

Cheers
Chris
He who plants a seed, plants life.

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15641
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Test characteristic (of a MSWord document)

Post by ChrisGreaves »

Jay Freedman wrote:The first two bytes of a Word .doc or .dot file are hex D0 CF..
Thanks Jay! 20191219\blnWord97Doc.txt

Cheers
Chris
You do not have the required permissions to view the files attached to this post.
He who plants a seed, plants life.