PHP Regular Expression Problem

Happy Holidays!

One of my current projects is a small job search website. I have built a Google-style job display page that highlights the search terms. The search terms are turned into “stems” before the search, so I want to highlight all the words that contain that stem. For example, searching for “engineer” really searches for “engin” and I want to highlight “engineer”, “engineers”, “engineering”, etc. To do this, I’ve used a regular expression:

/\bengin(?:[[:alnum:]])*)\b/i

The \b term represents a “word boundary,” which is supposed to be things that separate words, like spaces or periods. This expression worked really well until I wanted to highlight the term “C#”. For some reason the pound sign plays havoc with the word boundary.

Somewhere on the web I found the word boundary was defined as this expression:

[^a-zA-Z0-9]*

So I added a few characters to it to help out with “C#” and “C++”, etc: [^a-zA-Z0-9_@\+\-#]* And my final regular expression became:

/\bc\#(?:[[:alnum:]])*)(?:[^a-zA-Z0-9_@\+\-#]*)/i

This will correctly highlight C# but for some reason, doing a preg_replace with this regular expression will eat the space trailing the highlighted word. It doesn’t happen with the normal \b word boundary expression. I’m not enough of a regex guru to understand why this happens, or how to fix it. Any suggestions?

Update 2005-02-05: I played around with this some more today. It turns out that preg_replace was going ahead and replacing the item from my word boundary expression, which explains why the trailing space or period would disappear. For some reason it doesn’t eat the \b word boundaries. So I adjusted the expressions some, and this is what I came up with:

/\b(c#[a-zA-Z0-9_@\+\-#]*)([^a-zA-Z0-9_@\+\-#]*)/i

and then my preg_replace uses this:

<span class="hilite">\$1</span>\$2

The \$2 does the job of replacing my homemade word boundary. It seems to work properly. Woohoo!