A better definition of a Sentence in MSWord.
-
- PlutoniumLounger
- Posts: 15636
- Joined: 24 Jan 2010, 23:23
- Location: brings.slot.perky
A better definition of a Sentence in MSWord.
Word2003/Win7
This was written a proof-of-concept and is not exhaustively tested.
It is said that a well-written paragraph holds the idea in its first sentence, and the remianing senetences merely clarify or amplify that first sentence. Speed-readers therefore read only the first sentence of each paragraph, unless they can’t understand it, in which case they read the entire paragrah.
A writer’s clinic wants to take a submitted text and extract only the first sentence from each paragraph.
Microsoft Word’s definition of a sentence is poor, so I dreamed up a “better definition”.
A sentence is an accumulation of consecutive alphabetic atoms until and including an atom that meets or exceeds a specified length and is immediately followed by a terminator.
Means?
For a specified length of 3 (Think “Dr.” or “Mr.” or “Mrs.” are too short to be valid sentence terminators):-
In “C. Y. O’Connor” the atom “C.” is insufficient because the atom is of length only one alphabetic character preceding the terminator. Continue to accumulate atoms.
In “C. Y. O’Connor” the atom “Y.” is insufficient because the atom is of length only one alphabetic character preceding the terminator. Continue to accumulate atoms.
In “Critics scoffed and said it would never work, and just before the scheme was opened, C. Y. O'Connor rode his horse into the surf near Fremantle and shot himself; in the surf so there'd not be a mess to clean up.” We are scanning left to right for a terminator, and the presence of “C.” and “Y.” mid-text should not stop our accumulation.
The macro “ExtractFirstSentences” in the attached template YMHA03.dot will apply itself to the ActiveDocument.
The macro can be run on the text you are reading, or on the accompanying sample document.
Cheers
Chris
This was written a proof-of-concept and is not exhaustively tested.
It is said that a well-written paragraph holds the idea in its first sentence, and the remianing senetences merely clarify or amplify that first sentence. Speed-readers therefore read only the first sentence of each paragraph, unless they can’t understand it, in which case they read the entire paragrah.
A writer’s clinic wants to take a submitted text and extract only the first sentence from each paragraph.
Microsoft Word’s definition of a sentence is poor, so I dreamed up a “better definition”.
A sentence is an accumulation of consecutive alphabetic atoms until and including an atom that meets or exceeds a specified length and is immediately followed by a terminator.
Means?
For a specified length of 3 (Think “Dr.” or “Mr.” or “Mrs.” are too short to be valid sentence terminators):-
In “C. Y. O’Connor” the atom “C.” is insufficient because the atom is of length only one alphabetic character preceding the terminator. Continue to accumulate atoms.
In “C. Y. O’Connor” the atom “Y.” is insufficient because the atom is of length only one alphabetic character preceding the terminator. Continue to accumulate atoms.
In “Critics scoffed and said it would never work, and just before the scheme was opened, C. Y. O'Connor rode his horse into the surf near Fremantle and shot himself; in the surf so there'd not be a mess to clean up.” We are scanning left to right for a terminator, and the presence of “C.” and “Y.” mid-text should not stop our accumulation.
The macro “ExtractFirstSentences” in the attached template YMHA03.dot will apply itself to the ActiveDocument.
The macro can be run on the text you are reading, or on the accompanying sample document.
Cheers
Chris
You do not have the required permissions to view the files attached to this post.
There's nothing heavier than an empty water bottle
-
- Administrator
- Posts: 78524
- Joined: 16 Jan 2010, 00:14
- Status: Microsoft MVP
- Location: Wageningen, The Netherlands
-
- PlatinumLounger
- Posts: 5409
- Joined: 24 Jan 2010, 08:33
- Location: A cathedral city in England
Re: A better definition of a Sentence in MSWord.
Does your account cope with those enlightened areas of the world where the use of 'simplified punctuation' means that full stops are omitted following title abbreviations or initials in names?
We would use Dr Spock, Ms Take, Mrs Jones, and (in your example) C Y O'Connor.
(Some of us even use the Oxford comma, as in the previous sentence!)
We would use Dr Spock, Ms Take, Mrs Jones, and (in your example) C Y O'Connor.
(Some of us even use the Oxford comma, as in the previous sentence!)
John Gray
"(or one of the team)" - how your hospital appointment letter indicates that you won't be seeing the Consultant...
"(or one of the team)" - how your hospital appointment letter indicates that you won't be seeing the Consultant...
-
- Administrator
- Posts: 78524
- Joined: 16 Jan 2010, 00:14
- Status: Microsoft MVP
- Location: Wageningen, The Netherlands
Re: A better definition of a Sentence in MSWord.
It's easy enough to test that. The code works fine with simplified punctuation.
Best wishes,
Hans
Hans
-
- PlatinumLounger
- Posts: 5409
- Joined: 24 Jan 2010, 08:33
- Location: A cathedral city in England
Re: A better definition of a Sentence in MSWord.
Only for those intellectuals who use Word macros!HansV wrote:It's easy enough to test that.
John Gray
"(or one of the team)" - how your hospital appointment letter indicates that you won't be seeing the Consultant...
"(or one of the team)" - how your hospital appointment letter indicates that you won't be seeing the Consultant...
-
- Administrator
- Posts: 12612
- Joined: 16 Jan 2010, 15:49
- Location: London, Europe
Re: A better definition of a Sentence in MSWord.
This code might have problems with sentences that end with short words, but I won't complain about it!
StuartR
-
- StarLounger
- Posts: 79
- Joined: 08 Feb 2010, 21:48
- Location: Wellington, New Zealand
Re: A better definition of a Sentence in MSWord.
Chris, you might also need to tweak your code to handle quotation marks at the end of sentences. Using examples from the Additional Punctuation Rules When Using Quotation Marks page, these are the results I get using your current code:
Before: The detective said, "I am sure who performed the murder." This is sentence two.
After: "This is sentence two.
Before: Does Dr. Lim always say to her students, "You must work harder"? This is sentence two.
After: Does Dr.
You may also need to consider differences between the American and British styles when it comes to the use of punctuation marks, as described in the British versus American style page, for example - and please don't ask me how definitive that page is.
Good luck with this. I suspect that it might be tricky.
Before: The detective said, "I am sure who performed the murder." This is sentence two.
After: "This is sentence two.
Before: Does Dr. Lim always say to her students, "You must work harder"? This is sentence two.
After: Does Dr.
You may also need to consider differences between the American and British styles when it comes to the use of punctuation marks, as described in the British versus American style page, for example - and please don't ask me how definitive that page is.
Good luck with this. I suspect that it might be tricky.
-
- PlutoniumLounger
- Posts: 15636
- Joined: 24 Jan 2010, 23:23
- Location: brings.slot.perky
Re: A better definition of a Sentence in MSWord.
I believe so.John Gray wrote:Does your account cope with those enlightened areas of the world where the use of 'simplified punctuation' means that full stops are omitted following title abbreviations or initials in names?
I was aiming in this POC to cope with the presence of full-stops which confuse MSW.
The absence of full-stops was not a problem.
Anyway, you could always d/l the sample and
Cheers
Chris
There's nothing heavier than an empty water bottle
-
- PlutoniumLounger
- Posts: 15636
- Joined: 24 Jan 2010, 23:23
- Location: brings.slot.perky
Re: A better definition of a Sentence in MSWord.
And there I was just gingerly slipping back into the pool, toes first, and then THIS comes along.William wrote:Chris, you might also need to tweak your code to...
Thanks a LOT William (huge grin).
William, I've d/l the pages for study and will have a shot at enhancements.
I rather suspect that one of Steven Pinker's students will have already have solved this - Microsoft never cared to upgrade its programming logic.
Thanks again.
Chris
There's nothing heavier than an empty water bottle
-
- 4StarLounger
- Posts: 508
- Joined: 17 Dec 2010, 03:14
Re: A better definition of a Sentence in MSWord.
I have some boilerplate text I use to explain the limitations of VBA regarding sentences:
VBA has no idea what a grammatical sentence is. For example, consider the following:
Mr. Smith spent $1,234.56 at Dr. John's Grocery Store, to buy: 10.25kg of potatoes; 10kg of avocados; and 15.1kg of Mrs. Green's Mt. Pleasant macadamia nuts.
For you and me, that would count as one sentence; for VBA it counts as 5 sentences.
When I run Chris' macro on the boilerplate text, the output is:
VBA has no idea what a grammatical sentence is.
Mr.
For you and me, that would count as one sentence; for VBA it counts as 5 sentences.
Methinks 'Mr.' does not constitute a grammatical sentence.
VBA has no idea what a grammatical sentence is. For example, consider the following:
Mr. Smith spent $1,234.56 at Dr. John's Grocery Store, to buy: 10.25kg of potatoes; 10kg of avocados; and 15.1kg of Mrs. Green's Mt. Pleasant macadamia nuts.
For you and me, that would count as one sentence; for VBA it counts as 5 sentences.
When I run Chris' macro on the boilerplate text, the output is:
VBA has no idea what a grammatical sentence is.
Mr.
For you and me, that would count as one sentence; for VBA it counts as 5 sentences.
Methinks 'Mr.' does not constitute a grammatical sentence.
Paul Edstein
[Fmr MS MVP - Word]
[Fmr MS MVP - Word]
-
- PlutoniumLounger
- Posts: 15636
- Joined: 24 Jan 2010, 23:23
- Location: brings.slot.perky
Re: A better definition of a Sentence in MSWord.
William, thanks again.William wrote:Chris, you might also need to tweak your code to handle quotation marks at the end of sentences.
I ran my little "TESTstrGetSentence" macro on the sentence:-
The detective said, "I am sure who performed the murder."
Received the result of a double-quote (a single character).
I changed the trailing double-quote to an asterisk and received an asterisk.
This suggests that there is something quite faulty in MSW's analysis.
I have thought about it some more (not finished yet) and am of the broad conclusion that in terms of "obtaining
sentences" I can approach it by one of:-
(1) using MSW's definition of "Sentence" and checking the end of each sentence
(2) using MSW's definition of "Word" and building a state transition/parsing table/decision table based on an incoming stream of words
(3) using MSW's character stream and parsing on a character-by-character basis.
Immediate thoughts are that those three become increasingly expensive in terms of processing time
BUT!
The application has an impact.
An application to "Find the first sentence of each paragraph in a document" can still proceed one paragraph at a time and abandon parsing a paragraph once that first sentence is found
An application to "Find and report every sentence in a document" would have to examine every piece of text.
That is, the penalty of run-time cost will not be as significant for the first application as for the second.
I appreciate your samples of text and have incorporated them into my test bed, as Poole and Waite would have me do.
More later, if this cappuccino kicks in as it should.
Cheers
Chris
There's nothing heavier than an empty water bottle
-
- PlutoniumLounger
- Posts: 15636
- Joined: 24 Jan 2010, 23:23
- Location: brings.slot.perky
Re: A better definition of a Sentence in MSWord.
Hi Paul.macropod wrote:For you and me, that would count as one sentence; for VBA it counts as 5 sentences. When I run Chris' macro on the boilerplate text, the output is:...
I am puzzled.
I ran my macro "TESTstrGetSentence" on your text with the results as shown in the attached PrtScr snapshot.
[/bragging]I received only one sentence. That is, my code in YMHA003.dot behaves as I expected it to behave on my first essay.
Nonetheless I appreciate your input and have absorbed your text into my testbed (please see my reply to William).
If you are really, really, really bored, might you drag the TEST macro out, decomment it, select your paragraph and run the macro "TESTstrGetSentence" on your one paragraph?
The macro analyses the first paragraph of the selected text.
Ta ever so.
And thanks again for the Field Codes!
Cheers
Chris
You do not have the required permissions to view the files attached to this post.
There's nothing heavier than an empty water bottle
-
- 4StarLounger
- Posts: 508
- Joined: 17 Dec 2010, 03:14
Re: A better definition of a Sentence in MSWord.
I am unable to reproduce that on Word 2010. No matter whether nothing is selected, only the second sentence is selected, or is the only one present in the document, the result is always the same:ChrisGreaves wrote: I ran my macro "TESTstrGetSentence" on your text with the results as shown in the attached PrtScr snapshot.
I received only one sentence.
Mr.
Paul Edstein
[Fmr MS MVP - Word]
[Fmr MS MVP - Word]
-
- PlutoniumLounger
- Posts: 15636
- Joined: 24 Jan 2010, 23:23
- Location: brings.slot.perky
Re: A better definition of a Sentence in MSWord.
Many thanks to those of you who responded, tested, offered up examples.ChrisGreaves wrote:Word2003/Win7. This was written a proof-of-concept and is not exhaustively tested.
The latest version(YMHA007 attached) deals with all the examples offered to date on my WinHP/Office2003 SP1(+) system.
I have begun to run into questions of English Syntax (e.g. "In 1967 he was made head of Mktg..") and suspect that one could get embroiled into deep argument about the flexibility of the English language syntax and style.
If anyone can come up with new simple examples of "not-sentences" I'd be pleased to entertain them.
For my purposes (extracting First Sentences) what have now works well, because if Jake sends me a story and I extract First Sentences and send the output to Tim, and if Tim can understand the story, then Jake's work passes the test.
If it fails the test, either my program is wonky (me to do some work) or Jake's story is wonky (he to do some work).
Cheers
Chris
You do not have the required permissions to view the files attached to this post.
There's nothing heavier than an empty water bottle
-
- PlutoniumLounger
- Posts: 15636
- Joined: 24 Jan 2010, 23:23
- Location: brings.slot.perky
Re: A better definition of a Sentence in MSWord.
Hi Paul, and thank you for bring this to my attention.macropod wrote:I am unable to reproduce that on Word 2010. ... the result is always the same:Mr.
I beg yet-another-favour of you: try it with YMHA007. (my post of 5 minutes ago)
I am puzzled with the failure of "Mr." (and I assume "Dr.", "Mrs." etc), if only because I believe Word95 had the sentence-problem, and that the code I have written works around the problem..
I can only guess that for some unknown reason, something in Office2010 regarding string comparison is wildly different from Office2003, but since MSW almost never fixes design problems, I struggle to believe that anything really changed between 2003 and 2010.
Gratefully
Chris Greaves
There's nothing heavier than an empty water bottle
-
- 4StarLounger
- Posts: 508
- Joined: 17 Dec 2010, 03:14
Re: A better definition of a Sentence in MSWord.
Success! The output is now correct (i.e. the same as the boilerplate input).
They can always get worse ... and often enough do just that.ChrisGreaves wrote:since MSW almost never fixes design problems, I struggle to believe that anything really changed between 2003 and 2010.
Paul Edstein
[Fmr MS MVP - Word]
[Fmr MS MVP - Word]
-
- PlutoniumLounger
- Posts: 15636
- Joined: 24 Jan 2010, 23:23
- Location: brings.slot.perky
Re: A better definition of a Sentence in MSWord.
Thanks for this feedback, Paul.macropod wrote:Success!
There's nothing heavier than an empty water bottle