2011-04-05

Tip: Using Regex to Find and Replace Two or More Spaces in Text

To find and, perhaps, replace text with two or more spaces use the following regex:

/s/s+

You could use this in Notepad++ or Excel along with just about any programming language.

Regex (Regular Expressions) are one of those techie things that are everywhere (e.g. Excel) but you never use unless you're are a serious nerd (and I mean serious). In order to remain cool, I have avoided their use. However, recently I have been scrubbing large text files and I decided it might be faster if I caved and used a regex.

Regex is a syntax or language for manipulating text or strings. For example, you might remember Microsoft DOS commands like "dir *.*" that would list all files with any filename or three letter extension. Regex is similar (not exactly the same) to shell commands like that.

In my quest to become a nerd, I found that the support documents are not for novices. Send me some comments if you know of some good sandboxes for building regex.

3 comments:

Ben said...

Have you tried Google Refine yet for cleaning up data? I think it has some friendly ways to play with regex but can't remember. It is also not a total nerd tool, good for jocks like yourself as well.

David Hoenig said...

If you *really* want to become a nerd you need to learn Perl, and use that language for exploring the powerful world of regular expressions. Although the initial syntax of regexps was first, Perl really built upon it and now most programs (unix ones at least :)) support "Perl-style" regular expressions. Actually, to become an UBER-nerd, you should learn Python. That would definitely give you geek cred.

By the way, I'm not familiar with regexp substitution support in Excel, but usually the syntax is /thing to replace/replace with this/ so in your example should it be /s+/s? Or, in Perl: s/\s+/\s/g. The "g" says "do this globally, and not just the first occurrence."

David Hoenig said...

Just found this, pretty cool!

http://txt2re.com/