Simple algorithm to reduce Regular Expressions?

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Simple algorithm to reduce Regular Expressions?

Post by ChrisGreaves »

The attached VBA macro demonstrates regular expressions for North American telephone numbers (ten-digit strings). I am not good at regEx, but am improving. The attached macro (the final expression) serves my purpose for now, so I am not looking for a universal telephone-number expression.

I built the thing stage by stage - a good learning exercise - and will probably do the same for postal codes, email addresses, web sites and so on. I have read the complex syntax descriptions for web sites and emails and have no plans to go That Far Out.

That said, I find myself thinking that there are many ways that at the Expert Level I could have written this telephone routine using features of regEx,
In the meantime, if a new way of writing phone/web/email pops up, I can develop an eighth special-purpose expression and just OR/pipe it into my current seven; what do I care?

Then it struck me - there must be some straightforward techniques that humans use to recognize simple patterns, procedures one can apply, to reduce complexity and condense two expressions into a single expression. There is no need to have an expression for "spaces between the numbers" and a separate expression for "hyphens between the numbers". That reduces to one expression of "hyphens OR spaces between the numbers
\d{3} \d{3} \d{4}
and
\d{3}-\d{3}-\d{4}
can be combined into
\d{3}[-| ]\d{3}[-| ]\d{4}

Again, what I have works, but I am thinking about, or looking for, some very basic techniques for recognizing easy simplifications.
Thanks
Chris
You do not have the required permissions to view the files attached to this post.
An expensive day out: Wallet and Grimace

User avatar
SpeakEasy
4StarLounger
Posts: 536
Joined: 27 Jun 2021, 10:46

Re: Simple algorithm to reduce Regular Expressions?

Post by SpeakEasy »

Another simplification is repeated groups. For example, with the phone number we see 3 digits - possibly followed by a separator (or separators) - twice.

So, for example, (I know you were not looking or a universal telephone-number expression but given you now have the challenges of that in your head you should be able to decode this expression better than if I gave a non-related example)

(\d{3}[- )]*){2}\d{4}

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Simple algorithm to reduce Regular Expressions?

Post by ChrisGreaves »

SpeakEasy wrote:
17 Oct 2021, 18:30
Another simplification is repeated groups. For example, with the phone number we see 3 digits - possibly followed by a separator (or separators) - twice.

So, for example, (I know you were not looking or a universal telephone-number expression but given you now have the challenges of that in your head you should be able to decode this expression better than if I gave a non-related example)

(\d{3}[- )]*){2}\d{4}
Thanks speakeasy, and speaking of speaking, if I were to read this out loud I would use this script:

Code: Select all

Look for a left parenthesis
followed by three digits, 
followed by exactly one of a hyphen, a space, or a right parenthesis; exactly twice
followed by four digits.
Is that correct?

I think as I improve (we hope!) in RegEx I will start thinking in a top-down like manner, and then expressions such as the one you provided will come in a more natural fashion.
Now let me try my hand at simple email and website searches (may take a day or two) ...
Thanks again
Chris
An expensive day out: Wallet and Grimace

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Simple algorithm to reduce Regular Expressions?

Post by ChrisGreaves »

ChrisGreaves wrote:
19 Oct 2021, 18:08
Now let me try my hand at simple email and website searches (may take a day or two) ...
... but before that, a little confidence-builder
Untitled.png
I wrote (longhand, pencil and paper) my thoughts on the telephone number; those notes appear in the right-hand column with what I think is the pattern for the thought in the left hand column.
Congregated that gave me "[(]{0,1}\d{3}[)]{0,1}[- ]{0,1}\d{3}[- ]{0,1}\d{4}\D"

I liked the delimiter \D but found that I didn't need that at the front of the pattern because, by definition, the pattern starts with "the first digit found when scanning from left to right", so implicitly there must have been a preceding non-digit.
I needed the delimiter to avoid locating the phone number "343-140-5458" in the string "d-Breakfast-343140545834897"

The detection of a phone number is not critical right now. Faced with a Word Document with the hits from eight search engines, analyze each block of hits (so eight blocks of text), and tally a count for each phone number found. The phone number with the highest tally is probably the phone number of the business.
(See if you can work out the business number from the attached text file)

My first efforts were crippled by my idea that {m,n} was a minimum and maximum count, and that {n,} was a minimum, and (falsely) that {,n} was a maximum!

I am sure that I can devise a pattern for strict syntax - "If there is an opening parenthesis then there ought to be a closing parenthesis", but for this exercise it doesn't concern me that much; a rogue phone number will be swamped by the tally of identical 10-digit strings, once the formatting is stripped.

Cheers
Chris
You do not have the required permissions to view the files attached to this post.
An expensive day out: Wallet and Grimace

User avatar
SpeakEasy
4StarLounger
Posts: 536
Joined: 27 Jun 2021, 10:46

Re: Simple algorithm to reduce Regular Expressions?

Post by SpeakEasy »

>Is that correct?

Pretty much - although it doesn't bother looking for an opening parenthesis

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Simple algorithm to reduce Regular Expressions?

Post by ChrisGreaves »

SpeakEasy wrote:
20 Oct 2021, 08:41
>Is that correct?
Pretty much - although it doesn't bother looking for an opening parenthesis
Thanks SpeakEasy. I think that you are saying that there is no need to test for an opening parenthesis. Whether that parenthesis is there or not, the telephone number must start with a decimal digit.
Untitled.png
I do test for "an optional left parenthesis".
There must be a way to check for the formal syntax that "there must be matching parentheses, or no parentheses, but not just one of the pair", and I suspect that this lies in "Lookarounds", a topic for next week(grin)

I think too that I am right in looking for tips on "regEx VBA", rather than "regEx", because the interpreter in "Microsoft VBScript regular Expressions 1.0" is different from other interpreters, so the syntax differs. I believe that I have seen "[az]" and "[a-z"] at different times.
Thanks
Chris
You do not have the required permissions to view the files attached to this post.
An expensive day out: Wallet and Grimace

User avatar
SpeakEasy
4StarLounger
Posts: 536
Joined: 27 Jun 2021, 10:46

Re: Simple algorithm to reduce Regular Expressions?

Post by SpeakEasy »

The VbScript regular expression library used to be documented on the Microsoft site, but when they deemphasised VBS they hid away a lot of the supplementary documentation. They retained the JavaScript documentation, which uses the4 same library, but calls it slightly differently. But the VbScript documentation is still there, just very well hidden (and not properly indexed any more, either) . Here's a link: https://docs.microsoft.com/en-us/previo ... 2(v=vs.80)

Note that even with the documentation, there are one or two features that are not mentioned in the VBScript version, but are in the JavaScript version (such as the ability to provide your own custom machine/replacement functions)
Last edited by SpeakEasy on 21 Oct 2021, 08:46, edited 1 time in total.

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 15498
Joined: 24 Jan 2010, 23:23
Location: brings.slot.perky

Re: Simple algorithm to reduce Regular Expressions?

Post by ChrisGreaves »

SpeakEasy wrote:
20 Oct 2021, 16:57
... Here's a link: https://docs.microsoft.com/en-us/previo ... 2(v=vs.80)
Thank you, speakeasy.
I shall give this a read-through tonight.

Word on the web is that Street Addresses is "too difficult to do properly in RegEx", so I spent today on a state transition table then converted it to VBA. It looks for a number string followed by alpha strings etc., loads entries into a TYPE array, and if the number string and first alpha string matches, the address is declared identical. My first test ran well. On 28 pages of hits from eight search engines it picked out the correct address corresponding to the 28 pages that were retrieved on the basis of a telephone number!'

That is, I can now send a telephone number to the search engines and extract the correct street address for that telephone number!

Cheers
Chris
An expensive day out: Wallet and Grimace