DBA > Articles

MySQL Fuzzy Text Searching Using the SOUNDEX Function

By: Rob Gravelle
To read more DBA articles, visit http://dba.fyicenter.com/article/

Fuzzy searching has become a very prominent feature of Web search engines like Google. One of the key factors in Google’s explosive growth was its ability to locate Web pages that are likely to be relevant to your search terms even if they don’t map exactly to the content that you are looking for. The idea is not new; approximate string matching, as it’s known in computer science, is performed by algorithms that find strings that match a pattern approximately rather than exactly.

It would seem that the best place for such functionality is right in the database itself, where all the data is stored. Unfortunately, MySQL only provides some modicum of fuzzy search capability. There has always been far more choices in the realm of programming languages like PHP, Java, and what have you.

Nonetheless, I have found that as long as you are clear on what you are trying to accomplish, you can implement fuzzy text searching within your MySQL database by using a combination of built-in and user functions. In today's article, we'll start with the native SOUNDEX MySQL function. How it Works

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. SOUNDEX codes from different strings can be compared to see how similar the strings sound when spoken. The first character of the code is the first character of the expression, converted to upper case. The second through fourth characters of the code are numbers that represent the letters in the expression. The letters A, E, I, O, U, H, W, and Y are ignored unless they are the first letter of the string. All international alphabetic characters outside the A-Z range are treated as vowels. Hence, two strings that sound almost the same should have identical soundex strings. For instance, the words "Assistance" and "Assistants" both produce a soundex of “A223”.

Soundex provides a very simple way to search for misspellings because the Soundex code for strings that are misspelled are often the same. For example, “Williams” and “Wlliams” both produce a soundex value of “W452”. Here is a query that matches authors with last names that sound like “Williams”:

SELECT * FROM `author_bios` where SOUNDEX(`auth_last_name`) = SOUNDEX('Williams')

Matching a Single Search Term against Multiple Words

Soundex() works best when comparing a single word or short phrase against a string of the same length. For that reason it’s not immediately suited for full-text searches against blocks of text. However, with a little coding (or sleuthing as the case may be), we can split the search text into individual words for comparison. This code can be placed in a user function that we can then call like so:

mysql>SELECT soundex_match('minesota', 'Maine, Maryland, Massachusetts, Michigan, Minnesota, Mississippi, Missouri, Montana', ',' ) as 'Soundex Results';

Before we get to the query result, let’s take a look at the soundex_match() function.

As alluded to above, I am not the author of this excellent function. It is the work of Imranul Hoque. I cannot easily say what the function does on a line-by-line basis, but I can say with certainty that:

1 .the function takes 3 arguments:
needle: The word you are looking for
haystack: The block of text that you are searching
splitChar: The whitespace character that’ll split the string into single words. Generally it is the space(‘ ‘)

2. If any word in the haystack sounds similar to the needle, the function will return 1. Otherwise it returns 0.

CREATE DEFINER=`root`@`localhost` FUNCTION `soundex_match`(
needle varchar(128), haystack text, splitChar varchar(1))
RETURNS tinyint(4)
DETERMINISTIC
BEGIN
declare spacePos int;
declare searchLen int default 0;
declare curWord varchar(128) default '';
declare tempStr text default haystack;
declare tmp text default '';
declare soundx1 varchar(64) default '';
declare soundx2 varchar(64) default '';

set searchLen = length(haystack);
set spacePos = locate(splitChar, tempStr);
set soundx1 = soundex(needle);

while searchLen > 0 do
if spacePos = 0 then
set tmp = tempStr;
select soundex(tmp) into soundx2;
if soundx1 = soundx2 then
return 1;
else
return 0;
end if;
else
set tmp = substr(tempStr, 1, spacePos-1);
set soundx2 = soundex(tmp);
if soundx1 = soundx2 then
return 1;
end if;

set tempStr = substr(tempStr, spacePos+1);
set searchLen = length(tempStr);
end if;

set spacePos = locate(splitChar, tempStr);
end while;

return 0;
END

Hence, our query above would return a value of 1 because “minesota” still sounds like “Minnesota”.
Matching Multiple Search Terms against Multiple Words!

The above function works great for needle-in-a-haystack type searching where you have a single search term. Search engines that I’ve used are not limited to one search term. If I was looking for a restaurant called “Ed’s Easy Diner” and I spelled it as “Eds Eezy Dinner”, Google would still find it, and probably with high relevance, unless of course there was a restaurant called “Ed’s Easy Dinner”!

As mentioned earlier, there is no need for a custom function at all if you’re trying to match against the same string but minus the spelling error. In that case a simple Soundex comparison will suffice:

select soundex('eds easy diner'), soundex('Ed\'s Eesy Dinner');

In both cases, soundex() returns a value of “E32356”.

Soundex_match() works great with single word search strings against a larger block of text, but not so much for queries that contain a phrase. This statement would return zero (no match):

select soundex_match('eds eezy dinner', 'Welcome to Ed\'s Easy Diner!', ' ') as 'matches';

The solution that I devised was to split the search string into individual words and the call soundex_match() with each one. I specifically designed it that way so that missing words won’t derail your search. For instance, searching for “Eds Dinner” against “Welcome to Ed\'s Easy Diner!” still works because the function only cares whether or not words that sound like “Eds” and “Dinner” are contained within the target string.

Full article...


Other Related Articles

... to read more DBA articles, visit http://dba.fyicenter.com/article/