Searches for text fields in MySQL
Version: MySQL 5.1.x
time ago I published a post on the problem of searching for text fields using PostgreSQL:
http://romanmussi.blogspot.com/2009/12/busquedas-por-campos-de- text-en.html
Today we will consider the same question in MySQL.
The problem
Suppose we are asked to do a flexible search by customer name. The idea is that the user can enter as search criteria a name or part of the name without having to consider uppercase and lowercase letters and accented letters.
The MySQL solution
From the point of view the solution code MySQL can be very simple and almost transparent:
SELECT * FROM customers WHERE name LIKE '% martin%'
LIKE is used, standard SQL, to compare a string with a certain pattern. Y is used% which represents zero or more characters to make the comparison more flexible (allows partial search.)
This should win back customers from name:
MARTIN, SAN MARTIN, San Martin, Martin, San Martín, Etc.
Ready? Can we play again Berusky? Mmmm ... not so fast. For this search to work we need to understand better how to make MySQL ago comparisons between texts.
CHARACTER SET and COLLATE
main thing is to understand what are the character sets and the "matches" (1). A character set is a set of symbols and their corresponding encoding. A collation is a set of rules for comparing the symbols of a set of characters (2).
MySQL Server supports many character sets. The best known English-speaking world are LATIN1 and UTF8. In turn, a character set always has at least one matching, but may have several. For example, the latin1 character set has the following matches:
latin1_german1_ci (German DIN-1) latin1_swedish_ci
(Swedish / Finnish)
latin1_general_ci (Multilingual)
latin1_general_cs (Multilingual, case sensitive)
latin1_spanish_ci (English Modern)
As noted, there is a convention for collation names: start with the name of the character set to which they are associated, then include the name of the language, and end with _ci (not case-sensitive), _cs ( case-sensitive), or _bin (binary).
The comparison tells the engine how it should compare the symbols of a set of characters. Let's say you say some commonly used comparisons of the characters that should be considered equivalent to the letter "i" (lowercase i).
latin1_general_ci For comparison the "i" is equivalent to "I" (capital i). This means that the engine will only be considered as equivalent to the lowercase i's i capital.
latin1_spanish_ci For comparison the "i" is equivalent to: "I Ì Í Î Ï ì í î ï. In this case the engine is to consider the i lowercase equivalent of a much wider range of options.
Applying the above rules if the engine evaluates: 'Martin' LIKE '% martin%'
• returns false in latin1_general_ci
• returns true in latin1_spanish_ci
To find out more rules equivalence of each comparison is very useful site:
http://www.collation-charts.org/
Once we understand how MySQL performs the comparison between characters simply must make sure you use a comparison that fits our needs and that depends on the settings. There are default settings for character sets and collations at four levels: server, database, table, and connection. And there are specific settings for fields, for example. You can see in detail how the settings are set as precedence are evaluated in the MySQL Manual (2). Here we simply mention two ways to force the use of a specific matching to solve the problem of finding a simple way.
1) You can set the character set and default collation of the column
Each column of type "character" (CHAR, VARCHAR, or TEXT) has a set of characters and collation of column. In general the default settings of the column has the highest precedence when making the comparison as it solves our problems. The column definition syntax has optional clauses for specifying the character set and collation.
Example: CREATE TABLE
customers
(
name VARCHAR (20) CHARACTER SET latin1 COLLATE latin1_spanish_ci
)
can change these values \u200b\u200bin an existing column using phpMyAdmin or the command ALTER TABLE, according to your convenience.
2) You can specify the collation be used to define the query
If you want to modify the default definitions in the database you can specify the collation to use when defining the SQL query as follows:
SELECT * FROM customers WHERE name
LIKE '% martin%' COLLATE latin1_spanish_ci
In this case one must be careful that the character set matches the set in column (latin1, utf8, etc.).
With either of these two procedures achieved run queries as needed.
Notes (1) In the MySQL 5.0 Manual in English COLLATE translates as "snack." Here we have preferred to "matching" as being more appropriate or friendly. Anyway it is a matter of taste.
(2) For a thorough analysis of character sets and collation see:
MySQL Manual 5.1 (English)
Chapter 9. Internationalization and Localization
http://dev.mysql.com/doc/refman/5.1/en/internationalization-localization.html
MySQL Manual 5.0 (English, corresponds to an earlier version but still useful for the topic )
Chapter 10. Character Set Support
http://dev.mysql.com/doc/refman/5.0/es/charset.html
0 comments:
Post a Comment