A better definition of a Sentence in MSWord.

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15636
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

A better definition of a Sentence in MSWord.

Post by ChrisGreaves »

Word2003/Win7
This was written a proof-of-concept and is not exhaustively tested.

It is said that a well-written paragraph holds the idea in its first sentence, and the remianing senetences merely clarify or amplify that first sentence. Speed-readers therefore read only the first sentence of each paragraph, unless they can’t understand it, in which case they read the entire paragrah.
A writer’s clinic wants to take a submitted text and extract only the first sentence from each paragraph.
Microsoft Word’s definition of a sentence is poor, so I dreamed up a “better definition”.

A sentence is an accumulation of consecutive alphabetic atoms until and including an atom that meets or exceeds a specified length and is immediately followed by a terminator.
Means?

For a specified length of 3 (Think “Dr.” or “Mr.” or “Mrs.” are too short to be valid sentence terminators):-
In “C. Y. O’Connor” the atom “C.” is insufficient because the atom is of length only one alphabetic character preceding the terminator. Continue to accumulate atoms.
In “C. Y. O’Connor” the atom “Y.” is insufficient because the atom is of length only one alphabetic character preceding the terminator. Continue to accumulate atoms.
In “Critics scoffed and said it would never work, and just before the scheme was opened, C. Y. O'Connor rode his horse into the surf near Fremantle and shot himself; in the surf so there'd not be a mess to clean up.” We are scanning left to right for a terminator, and the presence of “C.” and “Y.” mid-text should not stop our accumulation.

The macro “ExtractFirstSentences” in the attached template YMHA03.dot will apply itself to the ActiveDocument.
The macro can be run on the text you are reading, or on the accompanying sample document.

Cheers
Chris
You do not have the required permissions to view the files attached to this post.
There's nothing heavier than an empty water bottle

User avatar
HansV
Administrator
Posts: 78524
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: A better definition of a Sentence in MSWord.

Post by HansV »

Thanks, Chris.
Best wishes,
Hans

User avatar
John Gray
PlatinumLounger
Posts: 5409
Joined: 24 Jan 2010, 08:33
Location: A cathedral city in England

Re: A better definition of a Sentence in MSWord.

Post by John Gray »

Does your account cope with those enlightened areas of the world where the use of 'simplified punctuation' means that full stops are omitted following title abbreviations or initials in names?

We would use Dr Spock, Ms Take, Mrs Jones, and (in your example) C Y O'Connor.

(Some of us even use the Oxford comma, as in the previous sentence!)
John Gray

"(or one of the team)" - how your hospital appointment letter indicates that you won't be seeing the Consultant...

User avatar
HansV
Administrator
Posts: 78524
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: A better definition of a Sentence in MSWord.

Post by HansV »

It's easy enough to test that. The code works fine with simplified punctuation.
Best wishes,
Hans

User avatar
John Gray
PlatinumLounger
Posts: 5409
Joined: 24 Jan 2010, 08:33
Location: A cathedral city in England

Re: A better definition of a Sentence in MSWord.

Post by John Gray »

HansV wrote:It's easy enough to test that.
Only for those intellectuals who use Word macros!
John Gray

"(or one of the team)" - how your hospital appointment letter indicates that you won't be seeing the Consultant...

User avatar
StuartR
Administrator
Posts: 12612
Joined: 16 Jan 2010, 15:49
Location: London, Europe

Re: A better definition of a Sentence in MSWord.

Post by StuartR »

This code might have problems with sentences that end with short words, but I won't complain about it!
StuartR


William
StarLounger
Posts: 79
Joined: 08 Feb 2010, 21:48
Location: Wellington, New Zealand

Re: A better definition of a Sentence in MSWord.

Post by William »

Chris, you might also need to tweak your code to handle quotation marks at the end of sentences. Using examples from the Additional Punctuation Rules When Using Quotation Marks page, these are the results I get using your current code:

Before: The detective said, "I am sure who performed the murder." This is sentence two.
After: "This is sentence two.

Before: Does Dr. Lim always say to her students, "You must work harder"? This is sentence two.
After: Does Dr.

You may also need to consider differences between the American and British styles when it comes to the use of punctuation marks, as described in the British versus American style page, for example - and please don't ask me how definitive that page is. :grin:

Good luck with this. I suspect that it might be tricky.

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15636
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: A better definition of a Sentence in MSWord.

Post by ChrisGreaves »

John Gray wrote:Does your account cope with those enlightened areas of the world where the use of 'simplified punctuation' means that full stops are omitted following title abbreviations or initials in names?
I believe so.
I was aiming in this POC to cope with the presence of full-stops which confuse MSW.
The absence of full-stops was not a problem.
Anyway, you could always d/l the sample and complain try it out for yourself!
Cheers
Chris
There's nothing heavier than an empty water bottle

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15636
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: A better definition of a Sentence in MSWord.

Post by ChrisGreaves »

William wrote:Chris, you might also need to tweak your code to...
And there I was just gingerly slipping back into the pool, toes first, and then THIS comes along.
Thanks a LOT William (huge grin).

William, I've d/l the pages for study and will have a shot at enhancements.
I rather suspect that one of Steven Pinker's students will have already have solved this - Microsoft never cared to upgrade its programming logic.

Thanks again.
Chris
There's nothing heavier than an empty water bottle

User avatar
macropod
4StarLounger
Posts: 508
Joined: 17 Dec 2010, 03:14

Re: A better definition of a Sentence in MSWord.

Post by macropod »

I have some boilerplate text I use to explain the limitations of VBA regarding sentences:

VBA has no idea what a grammatical sentence is. For example, consider the following:
Mr. Smith spent $1,234.56 at Dr. John's Grocery Store, to buy: 10.25kg of potatoes; 10kg of avocados; and 15.1kg of Mrs. Green's Mt. Pleasant macadamia nuts.
For you and me, that would count as one sentence; for VBA it counts as 5 sentences.

When I run Chris' macro on the boilerplate text, the output is:

VBA has no idea what a grammatical sentence is.
Mr.
For you and me, that would count as one sentence; for VBA it counts as 5 sentences.

Methinks 'Mr.' does not constitute a grammatical sentence.
Paul Edstein
[Fmr MS MVP - Word]

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15636
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: A better definition of a Sentence in MSWord.

Post by ChrisGreaves »

William wrote:Chris, you might also need to tweak your code to handle quotation marks at the end of sentences.
William, thanks again.
I ran my little "TESTstrGetSentence" macro on the sentence:-
The detective said, "I am sure who performed the murder."

Received the result of a double-quote (a single character).

I changed the trailing double-quote to an asterisk and received an asterisk.
This suggests that there is something quite faulty in MSW's analysis.

I have thought about it some more (not finished yet) and am of the broad conclusion that in terms of "obtaining
sentences" I can approach it by one of:-
(1) using MSW's definition of "Sentence" and checking the end of each sentence
(2) using MSW's definition of "Word" and building a state transition/parsing table/decision table based on an incoming stream of words
(3) using MSW's character stream and parsing on a character-by-character basis.

Immediate thoughts are that those three become increasingly expensive in terms of processing time

BUT!

The application has an impact.
An application to "Find the first sentence of each paragraph in a document" can still proceed one paragraph at a time and abandon parsing a paragraph once that first sentence is found
An application to "Find and report every sentence in a document" would have to examine every piece of text.

That is, the penalty of run-time cost will not be as significant for the first application as for the second.

I appreciate your samples of text and have incorporated them into my test bed, as Poole and Waite would have me do.
More later, if this cappuccino kicks in as it should.

Cheers
Chris
There's nothing heavier than an empty water bottle

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15636
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: A better definition of a Sentence in MSWord.

Post by ChrisGreaves »

macropod wrote:For you and me, that would count as one sentence; for VBA it counts as 5 sentences. When I run Chris' macro on the boilerplate text, the output is:...
Hi Paul.
I am puzzled.
I ran my macro "TESTstrGetSentence" on your text with the results as shown in the attached PrtScr snapshot.
[/bragging]I received only one sentence.
Macropod.png
That is, my code in YMHA003.dot behaves as I expected it to behave on my first essay.

Nonetheless I appreciate your input and have absorbed your text into my testbed (please see my reply to William).

If you are really, really, really bored, might you drag the TEST macro out, decomment it, select your paragraph and run the macro "TESTstrGetSentence" on your one paragraph?
The macro analyses the first paragraph of the selected text.
Ta ever so.
And thanks again for the Field Codes!

Cheers
Chris
You do not have the required permissions to view the files attached to this post.
There's nothing heavier than an empty water bottle

User avatar
macropod
4StarLounger
Posts: 508
Joined: 17 Dec 2010, 03:14

Re: A better definition of a Sentence in MSWord.

Post by macropod »

ChrisGreaves wrote: I ran my macro "TESTstrGetSentence" on your text with the results as shown in the attached PrtScr snapshot.
I received only one sentence.
I am unable to reproduce that on Word 2010. No matter whether nothing is selected, only the second sentence is selected, or is the only one present in the document, the result is always the same:
Mr.
Paul Edstein
[Fmr MS MVP - Word]

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15636
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: A better definition of a Sentence in MSWord.

Post by ChrisGreaves »

ChrisGreaves wrote:Word2003/Win7. This was written a proof-of-concept and is not exhaustively tested.
Many thanks to those of you who responded, tested, offered up examples.

The latest version(YMHA007 attached) deals with all the examples offered to date on my WinHP/Office2003 SP1(+) system.
I have begun to run into questions of English Syntax (e.g. "In 1967 he was made head of Mktg..") and suspect that one could get embroiled into deep argument about the flexibility of the English language syntax and style.

If anyone can come up with new simple examples of "not-sentences" I'd be pleased to entertain them.

For my purposes (extracting First Sentences) what have now works well, because if Jake sends me a story and I extract First Sentences and send the output to Tim, and if Tim can understand the story, then Jake's work passes the test.
If it fails the test, either my program is wonky (me to do some work) or Jake's story is wonky (he to do some work).
Cheers
Chris
You do not have the required permissions to view the files attached to this post.
There's nothing heavier than an empty water bottle

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15636
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: A better definition of a Sentence in MSWord.

Post by ChrisGreaves »

macropod wrote:I am unable to reproduce that on Word 2010. ... the result is always the same:Mr.
Hi Paul, and thank you for bring this to my attention.
I beg yet-another-favour of you: try it with YMHA007. (my post of 5 minutes ago)
I am puzzled with the failure of "Mr." (and I assume "Dr.", "Mrs." etc), if only because I believe Word95 had the sentence-problem, and that the code I have written works around the problem..

I can only guess that for some unknown reason, something in Office2010 regarding string comparison is wildly different from Office2003, but since MSW almost never fixes design problems, I struggle to believe that anything really changed between 2003 and 2010.
Gratefully
Chris Greaves
There's nothing heavier than an empty water bottle

User avatar
macropod
4StarLounger
Posts: 508
Joined: 17 Dec 2010, 03:14

Re: A better definition of a Sentence in MSWord.

Post by macropod »

Success! The output is now correct (i.e. the same as the boilerplate input).
ChrisGreaves wrote:since MSW almost never fixes design problems, I struggle to believe that anything really changed between 2003 and 2010.
They can always get worse ... and often enough do just that.
Paul Edstein
[Fmr MS MVP - Word]

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15636
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: A better definition of a Sentence in MSWord.

Post by ChrisGreaves »

macropod wrote:Success!
Thanks for this feedback, Paul.
There's nothing heavier than an empty water bottle