Sunday, October 01, 2006

PostgerSQL Index Scan for UTF8 Text

Using Index Scan for Text in Postgresql



While implementing the backend database for the Infoplease Distance Calculator, we found that Postgresql text fields used sequential scans instead of index scans. This was maddening -- I've frequently used indexed text data in Postgresql, and functional indexes of text are not that rare either. After a bit of poking around, I did find the problem. As usual, the documentation was quite plain. But it did not turn up quickly in any of the searches I did. I kept searching for things like:

postgresql text sequential scan
postgresql text force index scan
postgresql index text


But it took a long time for me to find stuff that wasn't about making sure the like operator was specified with the wildcard at the end of the string, or creating functional indexes on lower(text). Finally, I found the relevent note on the postgresql docs. In order to do the character-by-character incremental search of the index, the collation order needs to be well defined, and it is not with UTF. In our case, the database is SQL_ASCII, but that's not enough: the DBMS was initialized with UTF-8 (initdb) using the standard Fedora postgresql init scripts.

Fortunately, there is a workaround:

CREATE INDEX loc_name ON loc (lower(name) text_pattern_ops);

instead of

CREATE INDEX loc_name ON loc (lower(name));

To illustrate:

<br />db=# CREATE INDEX loc_name ON loc (lower(name));<br />CREATE INDEX<br />db=# EXPLAIN SELECT name FROM loc WHERE lower(name) LIKE lower('Empire State Building');<br /><br />                                      QUERY PLAN                              <br />-----------------------------------------------------------------------------------------<br />Seq Scan on loc (cost=0.00..19389.57 rows=3339 width=45)<br />  Filter: (lower(name) ~~ 'empire state building'::text)<br />(2 rows)<br /><br />db=# EXPLAIN SELECT name FROM loc WHERE lower(name) = lower('Empire State Building');<br /><br />                                           QUERY PLAN                         <br />--------------------------------------------------------------------------------------------------<br />Bitmap Heap Scan on loc (cost=29.08..4540.32 rows=3339 width=45)<br />  Recheck Cond: (lower(name) = 'empire state building'::text)<br />  ->  Bitmap Index Scan on loc_name  (cost=0.00..29.08 rows=3339 width=0)<br />        Index Cond: (lower(name) = 'empire state building'::text)<br />(4 rows)<br /></pre></blockquote><br /><br /><br />So that's our problem, "LIKE" forces a sequential scan, but "=" does not. So we need to fix it with a better index.<br /><br /><br /><blockquote><pre><br />db=# CREATE INDEX loc_name_tpo ON name (lower(name) text_pattern_ops);<br />CREATE INDEX<br />geobase=# EXPLAIN SELECT name FROM loc WHERE lower(name) LIKE lower('Empire State%');<br />                                                      QUERY PLAN                                                      <br />------------------------------------------------------------------------------------------------------------------------<br />Index Scan using loc_name_tpo on loc  (cost=0.00..4.01 rows=1 width=45)<br />  Index Cond: ((lower(name) ~>=~ 'empire state'::text) AND (lower(name) ~<~ 'empire statf'::text))    Filter: (lower(name) ~~ 'empire state%'::text) (3 rows) </pre></blockquote><br /><br /><br />To save the space of a second index, you can drop the first one with no ops code, but be careful, once you drop that you must use "LIKE" and not "="<br /><br /><br /><blockquote><pre><br />db=# DROP INDEX loc_name;<br />DROP INDEX<br />db=# EXPLAIN SELECT name FROM loc WHERE lower(name) = lower('Empire State Building');<br /><br />                                      QUERY PLAN                              <br />----------------------------------------------------------------------------------------<br />Seq Scan on loc (cost=0.00..19389.57 rows=3339 width=45)<br />  Filter: (lower(name) = 'empire state building'::text)<br />(2 rows)<br />



The alternative is to reinitdb in the C locale - which would mean dumping and reloading every DBMS we run. I'm not really up for that task, so I'm sticking with this for now, unless there substantial reasons to undertake that effort. I tend to think it would be better if Fedora used the C locale to init the DBMS, but for now I'll live with what I've got.

Saturday, September 30, 2006

Infoplease Distance Calculator

Distance Calculator



Infoplease has published a new distance calulator, which includes over 7.5 million location names worldwide. And to eliminate the guesswork associated with a static entry form, the application uses AJAX-based type-as-you-go inputs.

The coolest part has got to be the size of the data set, though. In addition to cities, towns, and other populated places, you can enter mountain peaks, lakes, and so on. For U.S. locations, the level of detail gets down to things like the Boston Massacre Marker, in Boston, Massachusetts.

Along with distances, the results include latitude and longitude of both points. However, there is also a separate latitude/longitude finder that uses the same data set.

Thursday, May 18, 2006

Linux iTunes Server on Fedora Core 5

Linux iTunes Server - Fedora Core 5



There are 2 components to an iTunes server on Fedora Core 5: the iTunes server daemon and the zeroconf (or bonjour, or rendezvous) multicast.

For the iTunes server, I used mt-daapd (http://www.mt-daapd.org/). I downloaded the standard SRPM from sourceforge and made a minor edit to the configure command in the spec:

./configure --prefix=$RPM_BUILD_ROOT/usr --enable-howl --with-howl-includes=/usr
/include/avahi-compat-howl

With that change, the rpm compiled cleanly, and hooked into the FC5 zeroconf multicast system, avahi.

To configure avahi, you need to add an xml fragment to /etc/avahi/services/ - it should look something like this:





Basement Tunes

_daap._tcp
3689





(man avahi.service for more info)

The file gets placed in /etc/avahi/services/ and must have an extension of ".service"

Once you do this, and start the avahi-daemon and mt-daapd services, and "Basement Tunes" will show up on your iTunes sidebar.

Keep in mind that the mDNS multicast will not typically pass through a router, and everything should work fine.