Using Index Scan for Text in Postgresql
While implementing the backend database for the Infoplease Distance Calculator, we found that Postgresql text fields used sequential scans instead of index scans. This was maddening -- I've frequently used indexed text data in Postgresql, and functional indexes of text are not that rare either. After a bit of poking around, I did find the problem. As usual, the documentation was quite plain. But it did not turn up quickly in any of the searches I did. I kept searching for things like:
postgresql text sequential scan
postgresql text force index scan
postgresql index text
But it took a long time for me to find stuff that wasn't about making sure the like operator was specified with the wildcard at the end of the string, or creating functional indexes on lower(text). Finally, I found the relevent note on the postgresql docs. In order to do the character-by-character incremental search of the index, the collation order needs to be well defined, and it is not with UTF. In our case, the database is SQL_ASCII, but that's not enough: the DBMS was initialized with UTF-8 (initdb) using the standard Fedora postgresql init scripts.
Fortunately, there is a workaround:
CREATE INDEX loc_name ON loc (lower(name) text_pattern_ops);
instead of
CREATE INDEX loc_name ON loc (lower(name));
To illustrate:
<br />db=# CREATE INDEX loc_name ON loc (lower(name));<br />CREATE INDEX<br />db=# EXPLAIN SELECT name FROM loc WHERE lower(name) LIKE lower('Empire State Building');<br /><br /> QUERY PLAN <br />-----------------------------------------------------------------------------------------<br />Seq Scan on loc (cost=0.00..19389.57 rows=3339 width=45)<br /> Filter: (lower(name) ~~ 'empire state building'::text)<br />(2 rows)<br /><br />db=# EXPLAIN SELECT name FROM loc WHERE lower(name) = lower('Empire State Building');<br /><br /> QUERY PLAN <br />--------------------------------------------------------------------------------------------------<br />Bitmap Heap Scan on loc (cost=29.08..4540.32 rows=3339 width=45)<br /> Recheck Cond: (lower(name) = 'empire state building'::text)<br /> -> Bitmap Index Scan on loc_name (cost=0.00..29.08 rows=3339 width=0)<br /> Index Cond: (lower(name) = 'empire state building'::text)<br />(4 rows)<br /></pre></blockquote><br /><br /><br />So that's our problem, "LIKE" forces a sequential scan, but "=" does not. So we need to fix it with a better index.<br /><br /><br /><blockquote><pre><br />db=# CREATE INDEX loc_name_tpo ON name (lower(name) text_pattern_ops);<br />CREATE INDEX<br />geobase=# EXPLAIN SELECT name FROM loc WHERE lower(name) LIKE lower('Empire State%');<br /> QUERY PLAN <br />------------------------------------------------------------------------------------------------------------------------<br />Index Scan using loc_name_tpo on loc (cost=0.00..4.01 rows=1 width=45)<br /> Index Cond: ((lower(name) ~>=~ 'empire state'::text) AND (lower(name) ~<~ 'empire statf'::text)) Filter: (lower(name) ~~ 'empire state%'::text) (3 rows) </pre></blockquote><br /><br /><br />To save the space of a second index, you can drop the first one with no ops code, but be careful, once you drop that you must use "LIKE" and not "="<br /><br /><br /><blockquote><pre><br />db=# DROP INDEX loc_name;<br />DROP INDEX<br />db=# EXPLAIN SELECT name FROM loc WHERE lower(name) = lower('Empire State Building');<br /><br /> QUERY PLAN <br />----------------------------------------------------------------------------------------<br />Seq Scan on loc (cost=0.00..19389.57 rows=3339 width=45)<br /> Filter: (lower(name) = 'empire state building'::text)<br />(2 rows)<br />
The alternative is to reinitdb in the C locale - which would mean dumping and reloading every DBMS we run. I'm not really up for that task, so I'm sticking with this for now, unless there substantial reasons to undertake that effort. I tend to think it would be better if Fedora used the C locale to init the DBMS, but for now I'll live with what I've got.