String Functions

SOUNDEX and DIFFERENCE

Sometimes we need to check Strings based on how the string sounds when it is spoken and wish to perform searching on the SOUND or pronunciation of the words. For such things, SQL Server provides us SOUNDEX and DIFFERENCE Function.

SOUNDEX

SOUNDEX Function returns a four-character code as an alphanumeric expression in VARCHAR datatype to compare the similarity between strings in terms of their sounds.

Syntax-

SOUNDEX ( character_expression )

The character_expression can be a constant, variable, or column and it can be nested using CONCAT(),’+’.

Algorithm-

  • The first character of the code is always the first character of character_expression, converted to UPPER case.If the first character of character_expression is anything other than 26 alphabets, It avoids other checkings and returns simply as ‘0000’.
  • The second through fourth characters of the code are numbers that represent the letters in the expression based on how the string sounds when it is spoken.
  • Except the first character of character_expression, remove all other occurrences of the letters A, E, I, O, U, W, H and Y. There are 8 vowels (Remember WHY- ‘W, H & Y’ are added as Phonetic Vowels based on sound) and therefore 18 consonants.
  • Assign numbers to the remaining (18 consonants) letters (after the first) like below:
b, f, p, v1
c, g, j, k, q, s, x, z2
d, t3
l4
m, n5
r6
  • If two or more letters with the same number were adjacent in the original name, or adjacent except for any intervening h and w, then omit all but the first.
  • Zeroes will be added at the end of the result code if neccessary to make a four-character code.

First Example for Applying Algorithm-

SELECT SOUNDEX('SQLSERVERWHY') AS RESULT; --	S426
1S QLSERVERFirst character taken
2S QLSRVRPhonetic vowels removed
3S QLSRVRepeated letters removed (after the first letter)
4S 24261Numbers assigned
5S 4261If two or more letters with the same number were adjacent(near to) in the original name,or adjacent except for any intervening h and w, then omit all but the first
6S 426Take 2 to 4 letters after the first.(4 zeroes padded at the end if necessary to bring 4-char code)

Second Example for Applying Algorithm-

SELECT SOUNDEX('COfFeE') AS RESULT; --	C100
1C OfFeEFirst character taken
2C fFPhonetic vowels removed
3C FRepeated letters removed (after the first letter)
4C 1Numbers assigned
5C 100After the first, only 1 letter.(4 zeroes padded at the end to bring 4-char code)

DIFFERENCE

DIFFERENCE Function performs a SOUNDEX Function on two strings to evaluate the similarity of two SOUNDEX codes, and returns an integer value in int datatype that indicates how similar the SOUNDEX codes are for those strings on a scale of 0 to 4.

  • 0 indicates weak or no similarity between the SOUNDEX codes.
  • 4 indicates strongly similar, same, or even identically matching of the SOUNDEX codes.

Syntax-

DIFFERENCE ( character_expression1 , character_expression2 )

Example-

SELECT DIFFERENCE('TREE','THREE') AS [Similarity_Level]; --4

NOTES while Using SOUNDEX and DIFFERENCE

  • DIFFERENCE and SOUNDEX have collation sensitivity.
  • The SOUNDEX and DIFFERENCE Function does not trims the leading and trailing spaces by nature. For more matching capability, USE LTRIM(RTRIM( Character_Expression )) or TRIM(Character_Expression), to remove leading and trailing spaces before doing SOUNDEX, DIFFERENCE Functions.
  • If the input string is NULL or empty/blank/”, then SOUNDEX Function returns NULL or 0000 and DIFFERENCE Function returns NULL or 0 respectively.
  • If the input string is two or more phrase words, SOUNDEX and DIFFERENCE works only for first phrase of the input string. (Eg.,) String 1 is ‘Dream big’ and String 2 is ‘Dream Small’. SOUNDEX and DIFFERENCE takes only first phrase as ‘Dream’ and returns both are same.

Example Code- 1. SOUNDEX, DIFFERENCE Behaviour

IF OBJECT_ID('[dbo].[Locality]') IS NOT NULL 
	DROP TABLE [dbo].[Locality];
	GO

CREATE TABLE [dbo].[Locality](
	[Locality_key] [int] IDENTITY(1,1) NOT NULL CONSTRAINT [PK_Locality] PRIMARY KEY CLUSTERED,
	[Locality] [varchar](100) NULL,
	[Custom_Locality] [varchar](100) NULL
);
GO

INSERT INTO [dbo].[Locality] ([Custom_Locality])
VALUES
('T_NAGAR'),('T-NAGAR'),('T.NAGAR'),('T NAGAR'),('TI NAGAR'),('TE NAGAR'),
('TEE NAGAR'),(' TI NAGAR'),(' T NAGAR'),(' TE   NAGAR '),('   TEE    NAGAR    '),
('TNAGAR'),('TINAGAR'),('TENAGAR'),('TEENAGAR'),(' TENAGAR '),('   TEENAGAR    ');
GO

UPDATE [dbo].[Locality] SET [Locality]='T. NAGAR';
GO

--To check SOUNDEX,DIFFERENCE Behaviour
SELECT [Locality],[Custom_Locality],
SOUNDEX(RTRIM(LTRIM([Locality]))) AS SOUNDEX_Locality,
SOUNDEX(RTRIM(LTRIM([Custom_Locality]))) AS SOUNDEX_Custom_Locality,
DIFFERENCE([Locality],RTRIM(LTRIM([Custom_Locality]))) AS DIFFERENCE_Locality_Custom_Locality
FROM [dbo].[Locality];
GO

Example Code- 2. Comparing of SOUNDEX Sound-Data Like Search Results in SQL Server

DECLARE @locality VARCHAR(MAX)='ti nagar';
SELECT DISTINCT locality FROM dbo.[Locality] 
WHERE SOUNDEX(custom_locality) = SOUNDEX(@locality);

Example Code- 3. Sorting of SOUNDEX Sound-Data Like Search Results in SQL Server

DECLARE @locality VARCHAR(MAX)='ti nagar';
SELECT locality,custom_locality
FROM dbo.[Locality]
WHERE SOUNDEX(custom_locality) = SOUNDEX(@locality)
ORDER BY (CASE 
WHEN (CHARINDEX(@locality ,custom_locality)) = 1 AND LEN(custom_locality) = LEN(@locality ) THEN -1 
ELSE CHARINDEX(@locality ,custom_locality) END),custom_locality;
GO

CONCLUSION

Use SOUNDEX and DIFFERENCE functions when you want to Compare Sound-Like Data in SQL Server. Thanks for reading!

Advertisement

One thought on “SOUNDEX and DIFFERENCE

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s