After reading websec's brilliant post on optimizing data retrieval from mysql, I thought of further optimizations. The advantage of websec's "find_in_set"-method is that the binary encoding of ascii characters is effectively reduced from 7 bits to 1-6 bits. The reason is that the "find_in_set" function returns the index at which the character appears in a string, meaning that the character 'd' in the set 'a,b,c,d' is encoded as 11 instead of 1100100.
It is possible to further optimize this by ordering the characters in the set by frequency. Meaning that the most commonly used characters will be found with the least number of requests. Let's redo the example from websec's article using the letter frequency of the english alphabet as a guide line:
FIND_IN_SET(MID(table_name,1,1), 'e,t,a,o,i,n,s,h,r,d,l,
c,u,m,w,f,g,y,p,b,v,k,j,x,q,z,_,0,1,2,3,4,5,6,7,8,9,$, ,
[,],!,@,#,%,^,&,*,(,),-,+,=,\,,",\',~,`,|,{,},:,;')
String: | c | h | a | r | a | c | t | e | r | _ | s | e | t |
Alphabetical: | 3 | 8 | 1 | 18 | 1 | 3 | 20 | 5 | 18 | 37 | 19 | 5 | 20 |
By frequency: | 12 | 8 | 3 | 9 | 3 | 12 | 2 | 1 | 9 | 27 | 7 | 1 | 2 |
Even this can be further optimized by using individual frequencies for each position in the string. Even though the letter 'e' is the most common in english text, it is not the most common letter for the first character in the word. Another optimization can be achieved by gathering the character frequencies for the target text we are attempting to extract (in websec's example the name of a table). Also, when working with table names we can exclude disallowed characters ('/', '.', '\'...), otherwise we need to include all characters that can be expected in the string.