Friday, June 2, 2017

The confusing RegEx

Regular Expression


Teacher: Today we will learn regex and how to use it in Java.
Jonny & Alice screaming with fear, they said in chorus, sir it is the most confusing thing which makes our life miserable as a programmer.
Teacher: Smiled!!! And replied yes I also used to think in this way when I am student :), But it is not hard as you think, Just need some key points to remember, Yes you have to remember key points like you remember History and Geography.
I try to point out those keys which will break your fear.
Before that tell me why you are saying regex confusing and also share the confusion to me.
Alice: Sir, The common problem is, it is hard to read and write, What we mean by that is , If we have a string to validate with a complex pattern, say email validation if you look the regex for that, it is a one liner with multiple backslashes, third brackets, first bracket etc. so we often perplex how to understand what it says. In simple word just seeing a regex solution, we don’t understand what it tries to say.
Teacher: So you mean readability right, you prefer to write more code to avoid RegEx, so code increases readability but lets me tell you regex is a Holy Grail if you unleash its power you can write concise and readable code.

Jonny: Sir another problem is there is no fix solution for a problem let takes the example of email validation if you search it in google you can see a ton of different solutions to validate email. So it is hard to take the right one?
Teacher: This is because you are not understood the crux of regex. Any other problem??
Students: Pin drop silence there.
Teacher slowly takes a step towards board and start his lesson.

What is Regex?
The teacher said Regular Expression is a technique  for search a pattern in a String, This search pattern can be very simple to very complex, a word to a sentence, or an expression made by different meta-characters or symbol used in the regex.
To understand regex correctly we need to know metacharacters/symbols and it’s meaning, This is the only thing you need to remember.
We found regex hard because we are not able to understand the usage of symbols.
Let take a look what are the different symbols used in the regex.
We can classify regex symbols in 3 brackets.

  1. Meta-Characters.
  2. Ranges & reserved symbols.
  3. Quantifiers.

Meta-Characters : In regex, there are some reserved metacharacters which have
pre-defined meanings to express some common patterns like the digit, whitespace etc in a compact way.

Meta Character
Alternate Expr.
To Express digit
[0-9] or [^\D]
By this we represent a digit character
To Express anything but not digit
[^0-9] or [^\d]
By this we represent a non-digit character
To Express a word
[a-zA-Z_0-9] or [^\W]
By this we represent a word character
To Express anything but not a  word
[^a-zA-Z_0-9] or [^\w]
By this we represent a non-word character
To Express a whitespace
[\t\n\x0b\r\f] or [^\S]
By this we represent any whitespace like \r,\t,\n etc
To Express anything but not a whitespace
[^\t\n\x0b\r\f] or [^\s]
By this we represent any non whitespace
To Express a boundary
By this we represent a boundary

Ranges & reserved symbols :  In regex when we try to match pattern, some information has to mention like how many times a pattern will be matched or you want to match the beginning of the string or end of the string or more complex pattern like maximum how many times a pattern can be a String or minimum etc. we defined them using ranges and reserved symbols.

Example Definition
Any character
Start with any character followed by ha then any character -- sham match:  gyan: not match
Check beginning of the line
If line starts with sha matched else false
sham : match :Aha “ not match
Check end of the line
If line ends with tra matched else false
Mitra: match :Chakra “ not match
Match either x or y or z
ax : Matched
aa : not matched
Match x,y or z followed by a or b or c
sha : Matched
sou : Not matched
Exactly X followed by A
sm: Matched
Sa : Not Matched

X or A
sX: Matched
sZ: Not Matched
Remember : When ^ uses in side third braces act as Negate.
sam:Not Matched
Match between character and digit 1 to 10 remember character : between/except boundaries
digit : all/include boundaries
sx1 : Not Matched
Used for Grouping
sab1: Matched
shac: Not Matched
syab: Not Matched
sabb: Matched

Quantifiers: Quantifiers say how many times a pattern can be found in a String.

Example Definition
Pattern can occurs zero to many times
s    m : Matched
s:m:Not Matched
Pattern can occurs one to many times
s    m : Matched
sm:Not Matched

Pattern can occur no or one time
shha:Not Matched
Pattern must occurs exactly X times
s1234:Not Matched
s1:Not Matched
Pattern must occurs at least X and at maximum Y
s1:Not Matched
s12345:Not Matched

Email Validation :

Teacher : So Jonny earlier you said that Email validation is  confusing, Now can you guys tell us what below email validation says,


Jonny: Yes sir, first part says ^[A-Za-z0-9-\\+]+, email must start with any characters and there must be one occurrence,^ denotes the start of the line and + says one or more, so email can start with any characters with any length.
Sir: Very good, Alice you tell me the second part.
Alice : (\\.[A-Za-z0-9-]+)*, this says that after first part it followed by a dot then again any length of characters but at least one and this part is optional as * is in the last.
Sir: Impressive.
Jonny: @[A-Za-z0-9-]+  Then it strictly matches @ and then at least one character. As + is there.
Alice : (\\.[A-Za-z0-9]+)* again it follows by the dot and at least one character and it is optional again.
Jonny : (\\.[A-Za-z]{2,})$ then email ends($) with a dot and any character in a-z or A-z and length between two to any.
Sir: Great, Now Alice, tell me a Valid Email according to this regex.
Alice: or
Sir: good, Jonny tell me an invalid one
Jonny: shamik.mitra@co.i or
Sir: Well it seems you are learning regex very quickly. So before finish today's lesson I give you one tip, stick above tables in your desk so every day you can go through the regex symbols then easily you will remember the Regex.