P3.NET

When Regular Expressions Go Bad

As a compiler guy I’m very comfortable with “standard” regular expressions (REs).  After all scanners for compilers are predominantly written using regular grammars (of which REs are derived).  REs have been around since the early days of computers and yet are still not truly standardized.  Each implementation adds its own twists and rules.  As a result an RE written for one implementation might or might not work (or behave) the same in another.  Therefore whenever REs are involved proper testing is important.  Excluding this detail REs are really, really useful for validating string formats.  While REs cannot validate all string formats they can handle a large subset.  Hence I tend to use REs for validating string formats whenever possible.  One of the big benefits of REs is speed.  Most parsing can be done through REs faster and easier than through hand-generated code.

In .NET The Regex class is available for writing REs.  Here is a sample piece of code I wrote to validate that a string is a properly formatted DNS name.

private static bool Validate ( string value )
{
   Match m = Regex.Match(value, @”^(?<label>[a-zA-Z]([w-]*[a-zA-Zd])?)(.(?<label>[a-zA-Z]([w-]*[a-zA-Zd])?))*$”);
   if ((m == null) || !m.Success)
      return false;

   //Validate the length of each label
   foreach (Group group in m.Groups)
   {
      if (group.Length > 63)
         return false;
   };

   return true;
}

In a nutshell a properly formatted DNS name obeys the following rules::

  • The name is divided into one or more parts. Parts are separated by a dot.
  • Each part must begin with a letter.
  • Each part must end with a letter or digit.
  • Parts may contain letters, digits or dashes (and underscores if you are supporting NetBIOS).
  • A part must be between 1 and 63 characters long.

The above code uses an RE to validate all but the last rule.  In .NET REs you can capture substrings out of the matched expression using capture groups.  In the above code I capture each part separately and then, if the RE validates, enumerate the parts to ensure they are of the appropriate length.   Overall the code is small and easy to read.  I tested the code (using an outside program) against valid and invalid strings to ensure all the rules were correct.  This code is about a year old and is used in my application as part of UI validation, amongst other things.

The other day I was writing some code that relied on the DNS validation code shown earlier.  Basically as the user typed in a textbox the string was validated and controls were enabled or disabled (along with error messages).  So far so good.  I was trying to test an aspect of the UI so I happened to enter a medium size string and accidentally added a semicolon to the end (dfafddafafadfafsadf;).  My application locked up.  Concerned I stepped through my code and found that it was locking up inside my validator (in the RE matching).  Hmm.  It appears that the RE caused an infinite loop.  To verify it was .NET and not something with my code I copied the RE and the input to an external program that happens to use .NET as well and it also locked up.

Actually it was not an infinite loop it just took a really, really long time to finish up.  That’s odd.  What makes the particular input interesting was that it was long and would ultimately fail validation because of an invalid character.  I checked my RE to determine if I was doing anything that might cause the parser to have to do backtracking.  Backtracking occurs when a parser gets into an error state and decides to back up the input until it can recover so it can try a different path.  As strings get longer this can become very time consuming.  With this particular input the parser would reach the semicolon and, in this case, deem the input invalid.  Nowhere in the RE is a semicolon legal.  In my mind the parser should not have to do any backtracking because the semicolon is not valid anywhere in the RE.  Backtracking would do nothing but waste time.  Nevertheless this does not seem to be the case with the .NET version.  Note that this doesn’t mean the parser has a bug but rather the implementation has some worse case paths for some grammars.

Theoretically I could probably modify my RE to use either aggressive parsing and/or different rules but this would require me to: look up how to do it in .NET, document this behavior for other devs who might need to maintain it and test it thoroughly since it might introduce a completely separate set of issues.  Instead I decided that the validation rules are simple enough that I could just ditch the RE and implement the logic myself.  Here is the new version sans RE.

private static bool Validate ( string value )
{
   string[] parts = value.Split(‘.’);
   foreach (string part in parts)
   {
      int len = part.Length;
      if (!Range.InRange(len, 1, 63))
         return false;

      //Validate the part
      if (!Char.IsLetter(part[0]) || !Char.IsLetterOrDigit(part, len – 1))
         return false;

      foreach (char ch in part.Skip(1))
      {
         if (!Char.IsLetterOrDigit(ch) && (ch != ‘_’;) && (ch != ‘-‘))
            return false;
      };
   };

   return true;
}

After implementing this new version I confirmed that my inputs (valid and invalid) still work.  Even better is that the performance is just as fast as the original RE.  Even better than that, the code is even more readable with no loss of functionality or time.  Note that I haven’t fully tested this code yet so if you borrow it then be sure to test it before using it.

Consider it a lesson learned.  Use REs when they meet your need.  When you do use REs be sure to create enough test cases to verify the behavior.  Specifically make sure you test invalid characters throughout strings of varying length and verify the parser does not take too long to run.  If you do run across worse case inputs then either special case these or switch to manually parsing the code.

Comments

  1. You say “Nowhere in the RE is a semicolon legal.” but actually a semicolon IS legal in your regex. Look at your regex again and see if you can find out where. You DO cause the engine to do a ton of backtracking because you have that last optional group that can be repeated multiple times and part of it will match any single character (that’s your *hint*).

  2. Ok, I can’t stand the suspense. Here’s the answer:

    replace “.” with “.” and see if that speeds things up!

  3. Good catch. You’re correct that I forgot to delimit the dot in the posting. Unfortunately I can’t remember whether I just interposed the code incorrectly when writing the post or whether my original code was wrong as well.

    Nevertheless I think my post still has merit because there are cases where REs have to do too much backtracking such that a hand-rolled algorithm is better.

    Thanks for pointing out my mistake.