(Note: For testing we used PostgreSQL 8.3.x)
The problem
Suppose we are asked to do a flexible search by customer name. The idea is that the user can enter as search criteria a name or part of the name without having to consider uppercase and lowercase letters and accented letters.
Alternative solutions
The simplest, though incorrect, would make a simple assessment of equality operator "=" SELECT * FROM
customers WHERE name = 'MARTIN'
This strategy is basic and has many limitations. In principle, no search by name and not find a client named "SAN MARTIN".
PostgreSQL provides some comparison functions most powerful strings that are useful for these cases. Here are the most common and used. LIKE
is a standard for comparison of strings in SQL. Lets compare a string with a certain pattern. The most common format for comparison flexible use% to represent zero or more characters:
Some examples:
'martin' LIKE 'martin'> True (equal to operator =)
'martin' LIKE '% rt%'> True (%. ..% to detect overlap of texts)
'san martin' LIKE '% martin%'> True
'San Martin' LIKE '% martin%'> False
'San Martín' LIKE '% martin% '> False
Limiting LIKE what we do is that is sensitive to uppercase and lowercase, making the search very strict.
ILIKE
ILIKE SQL is not a standard but an extension of PostgreSQL. It can be used to make a search that ignores case (case insensitive) and therefore more flexible. Continuing the example above:
'San Martin' ILIKE '% martin%'> True
'San Martín' ILIKE '% martin%'> False
'San Martín' ILIKE '% Martin%'> True
The problem is that iLike is insensitive to case sensitive, but does not ignore accented characters and so the penultimate comparison returns false. So, ILIKE considered "and" equal to "E", but does not believe "and" equal to "e". Does not make a match between stressed and unstressed letters.
to_ascii
To resolve this problem we can use, in certain circumstances, to_ascii function, which allows us to convert accented letters before making the comparison. To_ascii converts all accented vowels in unaccented equivalent. Example: to_ascii ('San Martín') returns "San Martin". Accented letters thus stop generating problems comparisons.
Example of use:
to_ascii ('San Martín') ILIKE to_ascii ('% martin%')> True
Returning to the example of the principle, with the following SELECT:
name:
MARTIN, SAN MARTIN, San Martin, Martin, San Martín, Etc.
That's what we were looking for. However, this convenient feature only supports conversion from ASCII encodings LATIN1 , LATIN2 , latin9 and WIN1250 . Until recently this was not a problem because the most common encoding of the database was latin1. The problem arises when the database is UTF8 encoded , which is becoming more common, because it fails to to_ascii allows searches to ignore accents. The question is: Is there something like databases to_ascii UTF8? Comparison
ignore accents with UTF8 encoding
The issue does not seem easy from the technical point of view. In principle and what can be seen in the technical forums are not intended to incorporate PostgreSQL in the short term to solve some native function the problem (some say a new contrib module which will allow to remove accents from version 8.5, but it's all unofficial.)
There are many proposals in the forums, in some cases small variations on the same idea. The best solution I found to solve the problem is to create a function that replaces accented characters by their non-accented and use in the comparison instead of to_ascii.
The conversion function is:
CREATE OR REPLACE FUNCTION sp_ascii (character varying)
RETURNS text AS $ BODY
$
SELECT TRANSLATE
($ 1,
'áàâãäéèêëíìïóòôõöúùûüÁÀÂÃÄÉÈÊËÍÌÏÓÒÔÕÖÚÙÛÜçÇ'
'aaaaaeeeeiiiooooouuuuAAAAAEEEEIIIOOOOOUUUUcC');
$ BODY $
LANGUAGE 'sql';
and is used as follows :
This has worked well databases with UTF8 encoding recovering:
MARTIN, SAN MARTIN, San Martin, Martin, San Martín, Etc.
A note on performance. With the function SP_ASCII information retrieval becomes a little slower. Is not very noticeable but can affect whether the database is very large. To speed up the search can limit the conversion of characters that are relevant in each case. For example, the database used for tests only found a few accented letters so they were eliminated the remaining cases of the conversion function which was reduced to:
With this simple case the search is done in half the time. Should be tested in each case and make assessments. Alternatively, some users on the forums recommend creating a functional index to improve performance. References
More information on LIKE, iLike and other string comparison (SIMILAR TO, POSIX) in:
http://www.postgresql.org/docs/8.3/interactive/functions-matching. html
Information
to_ascii and limitations:
http://www.postgresql.org/docs/current/interactive/functions-string.html Discussion
the PostgreSQL community on the subject, potugués (no accents com UTF8 problem):
http://www.mail-archive.com/pgbr-geral @ listas.postgresql.org.br / msg15630.html
versions of the "conversion function" to UTF8:
http://www.mail-archive.com/pgbr-geral @ listas.postgresql.org.br/msg15632.html
http : / / wiki.postgresql.org / wiki / Strip_accents_from_strings, _and_output_in_lowercase
conversion function in perl: @ postgresql.org/msg128804.html
http://www.mail-archive.com/pgsql-general
French Blog that is the same problem:
http://code18.blogspot.com/2009/03/comparaison-de-texte-postgresql.html
comprehensive article on UNICODE:
http://www.joelonsoftware. com / articles / Unicode.html
The problem
Suppose we are asked to do a flexible search by customer name. The idea is that the user can enter as search criteria a name or part of the name without having to consider uppercase and lowercase letters and accented letters.
Alternative solutions
The simplest, though incorrect, would make a simple assessment of equality operator "=" SELECT * FROM
customers WHERE name = 'MARTIN'
This strategy is basic and has many limitations. In principle, no search by name and not find a client named "SAN MARTIN".
PostgreSQL provides some comparison functions most powerful strings that are useful for these cases. Here are the most common and used. LIKE
is a standard for comparison of strings in SQL. Lets compare a string with a certain pattern. The most common format for comparison flexible use% to represent zero or more characters:
Some examples:
'martin' LIKE 'martin'> True (equal to operator =)
'martin' LIKE '% rt%'> True (%. ..% to detect overlap of texts)
'san martin' LIKE '% martin%'> True
'San Martin' LIKE '% martin%'> False
'San Martín' LIKE '% martin% '> False
Limiting LIKE what we do is that is sensitive to uppercase and lowercase, making the search very strict.
ILIKE
ILIKE SQL is not a standard but an extension of PostgreSQL. It can be used to make a search that ignores case (case insensitive) and therefore more flexible. Continuing the example above:
'San Martin' ILIKE '% martin%'> True
'San Martín' ILIKE '% martin%'> False
'San Martín' ILIKE '% Martin%'> True
The problem is that iLike is insensitive to case sensitive, but does not ignore accented characters and so the penultimate comparison returns false. So, ILIKE considered "and" equal to "E", but does not believe "and" equal to "e". Does not make a match between stressed and unstressed letters.
to_ascii
To resolve this problem we can use, in certain circumstances, to_ascii function, which allows us to convert accented letters before making the comparison. To_ascii converts all accented vowels in unaccented equivalent. Example: to_ascii ('San Martín') returns "San Martin". Accented letters thus stop generating problems comparisons.
Example of use:
to_ascii ('San Martín') ILIKE to_ascii ('% martin%')> True
Returning to the example of the principle, with the following SELECT:
SELECT * FROM customers WHERE to_ascii (name) ILIKE to_ascii ('% martin%')
We will win back customers name:
MARTIN, SAN MARTIN, San Martin, Martin, San Martín, Etc.
That's what we were looking for. However, this convenient feature only supports conversion from ASCII encodings LATIN1 , LATIN2 , latin9 and WIN1250 . Until recently this was not a problem because the most common encoding of the database was latin1. The problem arises when the database is UTF8 encoded , which is becoming more common, because it fails to to_ascii allows searches to ignore accents. The question is: Is there something like databases to_ascii UTF8? Comparison
ignore accents with UTF8 encoding
The issue does not seem easy from the technical point of view. In principle and what can be seen in the technical forums are not intended to incorporate PostgreSQL in the short term to solve some native function the problem (some say a new contrib module which will allow to remove accents from version 8.5, but it's all unofficial.)
There are many proposals in the forums, in some cases small variations on the same idea. The best solution I found to solve the problem is to create a function that replaces accented characters by their non-accented and use in the comparison instead of to_ascii.
The conversion function is:
RETURNS text AS $ BODY
$
SELECT TRANSLATE
($ 1,
'áàâãäéèêëíìïóòôõöúùûüÁÀÂÃÄÉÈÊËÍÌÏÓÒÔÕÖÚÙÛÜçÇ'
'aaaaaeeeeiiiooooouuuuAAAAAEEEEIIIOOOOOUUUUcC');
$ BODY $
LANGUAGE 'sql';
and is used as follows :
SELECT * FROM customers WHERE SP_ASCII (name) ILIKE SP_ASCII ('% martin%')
This has worked well databases with UTF8 encoding recovering:
MARTIN, SAN MARTIN, San Martin, Martin, San Martín, Etc.
A note on performance. With the function SP_ASCII information retrieval becomes a little slower. Is not very noticeable but can affect whether the database is very large. To speed up the search can limit the conversion of characters that are relevant in each case. For example, the database used for tests only found a few accented letters so they were eliminated the remaining cases of the conversion function which was reduced to:
SELECT TRANSLATE ($ 1, 'áéíóúÁÉÍÓÚçÇ' 'aeiouAEIOUcC');
With this simple case the search is done in half the time. Should be tested in each case and make assessments. Alternatively, some users on the forums recommend creating a functional index to improve performance. References
More information on LIKE, iLike and other string comparison (SIMILAR TO, POSIX) in:
http://www.postgresql.org/docs/8.3/interactive/functions-matching. html
Information
to_ascii and limitations:
http://www.postgresql.org/docs/current/interactive/functions-string.html Discussion
the PostgreSQL community on the subject, potugués (no accents com UTF8 problem):
http://www.mail-archive.com/pgbr-geral @ listas.postgresql.org.br / msg15630.html
versions of the "conversion function" to UTF8:
http://www.mail-archive.com/pgbr-geral @ listas.postgresql.org.br/msg15632.html
http : / / wiki.postgresql.org / wiki / Strip_accents_from_strings, _and_output_in_lowercase
conversion function in perl: @ postgresql.org/msg128804.html
http://www.mail-archive.com/pgsql-general
French Blog that is the same problem:
http://code18.blogspot.com/2009/03/comparaison-de-texte-postgresql.html
comprehensive article on UNICODE:
http://www.joelonsoftware. com / articles / Unicode.html
0 comments:
Post a Comment