Brothers In Code

...a serious misallocation of .net resources

Regular Expressions In Oracle

More than once I've needed to find all non-ascii characters in a string.  In a .net regular expression this is relatively easy:

[^\x20-\x7F]

I thought that regex would port over reletively easily to Oracle - wrong.

First, I found more than one site reference the \x sequence.  However I just couldn't get it to work.

I then tried a brute force version:

[^A-Za-z 0-9 \.,\?''""!@#\$%\^&\*\(\)\-_=\;:<>\/\\\|\}\{\[\]`~]

To my surprise that didn't work either.  After struggling with it for a while I found the regex reference page from Oracle.  It states:

The following regular expression operators are allowed within the character list, any other metacharacters included in a character list lose their special meaning (are treated as literals):

  • Range operator '-'
  • POSIX character class [::]
  • POSIX collating sequence [. .]
  • POSIX character equivalence class [= =]

So in other words, character sets are pretty limited in what they can have in it.  If a backslash is treated as a literal and not as an escape sequence, then that explained why my brute force version didn't work - sort of.  I did eventually get it to work but for some reason I had to remove the ']' within the character class.  So that wasn't perfect either.

Finally I gave it one last shot.  Rather than the hex range, I tried the character range for the same values - from space to tilde (look at an ascii table of values and this will make sense).

[^ -~]

Finally this seemed to work as expected (keep in mind thought that character ranges are affected by the NLS_SORT variable - anything other than binaray, might yield unexpected results).  I ran the following query as a test:


select regexp_replace('This is a test '||chr(191), '[^ -~]', '$') from dual;

The only thing left was to handle non-printable chars like crlf and tabs.  I new the hex code was out so I generated the first ascii character with the chr function to replace the space:


select regexp_replace('This is a test crlf'||chr(10)|| 'blah', '[^' || chr(1) ||'-~]', '$') from dual;

Update: After I wrote this article I found a section on regular-expressions.info that explains some of the oracle differences.