1. Computing
SOUNDEX String Comparison
 Join the Discussion
"Join us in the forum to discuss database issues!"
Mike Chapple
 

In the late nineteenth century, United States census officials faced a dilemma.  During the process of counting the huddled masses, our public servants created a huge paperwork trail that the law required them to preserve for future historians.  With amazing forethought, they realized that people searching for records might not know the exact spelling of their ancestor's name.  Was it Smith or Smythe?  Chapple, Chapel or Chapelle?  

To ease these searches, census officials turned to the Soundex phonetic filing system.  This system uses a simple phonetic algorithm to reduce each name to a four character alphanumeric code.  The first letter of the code corresponds to the first letter of the last name.  The remainder of the code consists of three digits derived from the syllables of the word.  

Largely unused outside of the halls of government and genealogy, the Soundex system is making a comeback in modern databases.  Database developers have long struggled with the problem of matching words that might not look alike, but actually sound alike.  Let's take a look at the following database table:

EmployeeID  LastName             FirstName  ReportsTo   Salary      
----------- -------------------- ---------- ----------- ----------- 
1           Davolio              Ann        2           NULL
2           Fuller               Andrew     NULL        NULL
3           Leverling            Janet      2           NULL
4           Peacock              Margaret   2           NULL
5           Buchanan             Steven     2           NULL
6           Suyama               Michael    5           NULL
7           King                 Robert     5           NULL
8           Callahan             Laura      2           NULL
9           Dodsworth            Anne       5           NULL
Many modern databases such as Microsoft SQL Server and Oracle implement a version of the Soundex function.  (It's important to note that these implementations may vary, so you shouldn't directly compare values obtained from different database engines.)  Here are the results of running Microsoft's SOUNDEX() function on the first name column of the table above.
FirstName  SOUNDEX 
---------- ------- 
Ann        A500
Andrew     A536
Janet      J530
Margaret   M626
Steven     S315
Michael    M240
Robert     R163
Laura      L600
Anne       A500

You'll notice that you can now detect phonetic similarities in the names simply by looking at the numeric values returned by the SOUNDEX function.  Fortunately, computers are much better at comparing numbers than comparing names.  To ease this task, we're provided with the DIFFERENCE function to compare strings based upon their Soundex values.  Microsoft's function returns an integer result ranging in value from 1 (least similar) to 4 (most similar).

Let's suppose overheard a male employee talking in the hallway but didn't quite catch his name.  You might overhear a name that sounded like "Ann" but you're positive it was a male.  The DIFFERENCE function is ideal for this type of situation.  We'd probably begin by specifying a threshold value of 4 to limit the number of results returned.  Here's the query:

SELECT FirstName, DIFFERENCE(FirstName,'Ann') AS 'Difference'
FROM Employees
WHERE DIFFERENCE(FirstName,'Ann')=4
And the results:
FirstName  Difference  
---------- ----------- 
Ann        4
Anne       4
Unfortunately, none of the results were male names.  Let's try a lower threshold of 2:
SELECT FirstName, DIFFERENCE(FirstName,'Ann') AS 'Difference'
FROM Employees
WHERE DIFFERENCE(FirstName,'Ann')>=2
And the results:
FirstName  Difference  
---------- ----------- 
Ann        4
Andrew     2
Janet      2
Laura      2
Anne       4

Hmmm... looks like Andrew might be our man!  For more information on the use of SOUNDEX, consult the documentation for your favorite database platform.



Discuss in my forum

©2013 About.com. All rights reserved.