 |
|
Join the Discussion
|
"Join us in the forum to discuss database
issues!"
Mike
Chapple
|
|
 |
 |
In the late nineteenth century, United States census officials
faced a dilemma. During the process of counting the huddled masses, our
public servants created a huge paperwork trail that the law required them to
preserve for future historians. With amazing forethought, they realized
that people searching for records might not know the exact spelling of their
ancestor's name. Was it Smith or Smythe? Chapple, Chapel or Chapelle? To
ease these searches, census officials turned to the Soundex phonetic filing
system. This system uses a simple phonetic algorithm to reduce each name
to a four character alphanumeric code. The first letter of the code
corresponds to the first letter of the last name. The remainder of the
code consists of three digits derived from the syllables of the
word. Largely unused outside of the halls of
government and genealogy, the Soundex system is making a comeback in modern
databases. Database developers have long struggled with the problem of
matching words that might not look alike, but actually sound alike. Let's
take a look at the following database table:
EmployeeID LastName FirstName ReportsTo Salary
----------- -------------------- ---------- ----------- -----------
1 Davolio Ann 2 NULL
2 Fuller Andrew NULL NULL
3 Leverling Janet 2 NULL
4 Peacock Margaret 2 NULL
5 Buchanan Steven 2 NULL
6 Suyama Michael 5 NULL
7 King Robert 5 NULL
8 Callahan Laura 2 NULL
9 Dodsworth Anne 5 NULL
Many modern databases such as Microsoft SQL Server and Oracle implement a
version of the Soundex function. (It's important to note that these
implementations may vary, so you shouldn't directly compare values obtained from
different database engines.) Here are the results of running Microsoft's
SOUNDEX() function on the first name column of the table above.
FirstName SOUNDEX
---------- -------
Ann A500
Andrew A536
Janet J530
Margaret M626
Steven S315
Michael M240
Robert R163
Laura L600
Anne A500
You'll notice that you can now detect phonetic similarities in the names
simply by looking at the numeric values returned by the SOUNDEX function.
Fortunately, computers are much better at comparing numbers than comparing
names. To ease this task, we're provided with the DIFFERENCE function to
compare strings based upon their Soundex values. Microsoft's function
returns an integer result ranging in value from 1 (least similar) to 4 (most
similar).
Let's suppose overheard a male employee talking in the hallway but didn't
quite catch his name. You might overhear a name that sounded like
"Ann" but you're positive it was a male. The DIFFERENCE function
is ideal for this type of situation. We'd probably begin by specifying a
threshold value of 4 to limit the number of results returned. Here's the
query:
SELECT FirstName, DIFFERENCE(FirstName,'Ann') AS 'Difference'
FROM Employees
WHERE DIFFERENCE(FirstName,'Ann')=4
And the results:
FirstName Difference
---------- -----------
Ann 4
Anne 4
Unfortunately, none of the results were male names. Let's try a lower
threshold of 2:
SELECT FirstName, DIFFERENCE(FirstName,'Ann') AS 'Difference'
FROM Employees
WHERE DIFFERENCE(FirstName,'Ann')>=2
And the results:
FirstName Difference
---------- -----------
Ann 4
Andrew 2
Janet 2
Laura 2
Anne 4
Hmmm... looks like Andrew might be our man! For more information on the
use of SOUNDEX, consult the documentation for your favorite database platform.
|