Friday, June 2, 2017

The confusing RegEx

Regular Expression



the-regex-session-with-shamik-highres.png

Teacher: Today we will learn regex and how to use it in Java.
Jonny & Alice screaming with fear, they said in chorus, sir it is the most confusing thing which makes our life miserable as a programmer.
Teacher: Smiled!!! And replied yes I also used to think in this way when I am student :), But it is not hard as you think, Just need some key points to remember, Yes you have to remember key points like you remember History and Geography.
I try to point out those keys which will break your fear.
Before that tell me why you are saying regex confusing and also share the confusion to me.
Alice: Sir, The common problem is, it is hard to read and write, What we mean by that is , If we have a string to validate with a complex pattern, say email validation if you look the regex for that, it is a one liner with multiple backslashes, third brackets, first bracket etc. so we often perplex how to understand what it says. In simple word just seeing a regex solution, we don’t understand what it tries to say.
Teacher: So you mean readability right, you prefer to write more code to avoid RegEx, so code increases readability but lets me tell you regex is a Holy Grail if you unleash its power you can write concise and readable code.

Jonny: Sir another problem is there is no fix solution for a problem let takes the example of email validation if you search it in google you can see a ton of different solutions to validate email. So it is hard to take the right one?
Teacher: This is because you are not understood the crux of regex. Any other problem??
Students: Pin drop silence there.
Teacher slowly takes a step towards board and start his lesson.

What is Regex?
The teacher said Regular Expression is a technique  for search a pattern in a String, This search pattern can be very simple to very complex, a word to a sentence, or an expression made by different meta-characters or symbol used in the regex.
To understand regex correctly we need to know metacharacters/symbols and it’s meaning, This is the only thing you need to remember.
We found regex hard because we are not able to understand the usage of symbols.
Let take a look what are the different symbols used in the regex.
We can classify regex symbols in 3 brackets.

  1. Meta-Characters.
  2. Ranges & reserved symbols.
  3. Quantifiers.



Meta-Characters : In regex, there are some reserved metacharacters which have
pre-defined meanings to express some common patterns like the digit, whitespace etc in a compact way.



Meta Character
Expression
Alternate Expr.
Definition
To Express digit
\d
[0-9] or [^\D]
By this we represent a digit character
To Express anything but not digit
\D
[^0-9] or [^\d]
By this we represent a non-digit character
To Express a word
\w
[a-zA-Z_0-9] or [^\W]
By this we represent a word character
To Express anything but not a  word
\W
[^a-zA-Z_0-9] or [^\w]
By this we represent a non-word character
To Express a whitespace
\s
[\t\n\x0b\r\f] or [^\S]
By this we represent any whitespace like \r,\t,\n etc
To Express anything but not a whitespace
\S
[^\t\n\x0b\r\f] or [^\s]
By this we represent any non whitespace
To Express a boundary
\b
[a-zA-Z0-9_]
By this we represent a boundary




Ranges & reserved symbols :  In regex when we try to match pattern, some information has to mention like how many times a pattern will be matched or you want to match the beginning of the string or end of the string or more complex pattern like maximum how many times a pattern can be a String or minimum etc. we defined them using ranges and reserved symbols.




Symbol
Description
Example
Example Definition
.
Any character
.ha.
Start with any character followed by ha then any character -- sham match:  gyan: not match
^
Check beginning of the line
^sha
If line starts with sha matched else false
sham : match :Aha “ not match
$
Check end of the line
tra$
If line ends with tra matched else false
Mitra: match :Chakra “ not match
[xyz]
Match either x or y or z
a[xyz]
ax : Matched
aa : not matched
[xyz][a,b,c]
Match x,y or z followed by a or b or c
s[hwo][abc]
sha : Matched
sou : Not matched
XA
Exactly X followed by A
sm
sm: Matched
Sa : Not Matched

X|A
X or A
s[X|A]
sX: Matched
sZ: Not Matched
[^abc]
Remember : When ^ uses in side third braces act as Negate.
s[^abc]m
shm:Matched
sam:Not Matched
[a-c1-10]
Match between character and digit 1 to 10 remember character : between/except boundaries
digit : all/include boundaries
s[x-z1-10]
sy1:Matched
sx1 : Not Matched
()
Used for Grouping
(s[^yz])(a|b)([a-c1-10]
sab1: Matched
shac: Not Matched
syab: Not Matched
sabb: Matched






Quantifiers: Quantifiers say how many times a pattern can be found in a String.



Quantifiers
Description
Example
Example Definition
*
Pattern can occurs zero to many times
s(\s)*m
sm:Matched
s    m : Matched
s:m:Not Matched
+
Pattern can occurs one to many times
s(\s)+m
s    m : Matched
sm:Not Matched

?
Pattern can occur no or one time
s(h)?a
sha:Matched
sa:Matched
shha:Not Matched
{X}
Pattern must occurs exactly X times
s(\d)(3)
s123:Matched
s1234:Not Matched
s1:Not Matched
{X,Y}
Pattern must occurs at least X and at maximum Y
s(\d)(2,4)
s12:Matched
s1:Not Matched
s12345:Not Matched



Email Validation :

Teacher : So Jonny earlier you said that Email validation is  confusing, Now can you guys tell us what below email validation says,

^[A-Za-z0-9]+(\\.[A-Za-z0-9-]+)*
     @[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$;

Jonny: Yes sir, first part says ^[A-Za-z0-9-\\+]+, email must start with any characters and there must be one occurrence,^ denotes the start of the line and + says one or more, so email can start with any characters with any length.
Sir: Very good, Alice you tell me the second part.
Alice : (\\.[A-Za-z0-9-]+)*, this says that after first part it followed by a dot then again any length of characters but at least one and this part is optional as * is in the last.
Sir: Impressive.
Jonny: @[A-Za-z0-9-]+  Then it strictly matches @ and then at least one character. As + is there.
Alice : (\\.[A-Za-z0-9]+)* again it follows by the dot and at least one character and it is optional again.
Jonny : (\\.[A-Za-z]{2,})$ then email ends($) with a dot and any character in a-z or A-z and length between two to any.
Sir: Great, Now Alice, tell me a Valid Email according to this regex.
Alice: shamik.mitra@gmail.co.in or shamik@gmail.com
Sir: good, Jonny tell me an invalid one
Jonny: shamik.mitra@co.i or .mitra@gmail.co.uk
Sir: Well it seems you are learning regex very quickly. So before finish today's lesson I give you one tip, stick above tables in your desk so every day you can go through the regex symbols then easily you will remember the Regex.