A teammate installed a new feature on CodeRanch JForum that uses a 4,515,409 row table. When dealing with over a million rows, scans become a huge performance drain. To the point where one query was slow but real usages with many at the same time brought down the app. The reason why the query was slow was interesting so I asked him if I could blog about it.
The original query
select countryCode, countryName, region, city from ip2location where 540815125 >= low and 540815125 <= high;
Running this through explain says it uses the index with a cost of:
Bitmap Heap Scan on ip2location (cost=5949.66..54170.71 rows=219870 width=32)
That’s a really high cost explain plan.
My first thought was to change it to:
explain select countryCode, countryName, region, city from ip2location where 540815125 >= low and 540815125 <= high;
Which has a much better explain plan of
Index Scan using ip2l_low_asc_idx on ip2location (cost=0.00..8.77 rows=1 width=32)
The reason is that in the first query, postgres needs to scan the large index from the beginning until it hits the low value. In the second, I gave it permission to start really close to the target row. I subtracted 1000 but that was arbitrary. It just needs to high enough to be in the vicinity of the row without missing out on any data.
My approach also makes the lookup time consistent. It is always looking through 1000 worth of index. (Which is always less than 1000 rows given the bunching of low to high.) The original is immediate through a full index scan.
Then the original teammate switched it to:
select countryCode, countryName, region, city from ip2location where low = (select max(low) from ip2location where 540815125 >= low)
This has the same explain cost as the hacky one and is clearer. Same logic though – it doesn’t require scanning/returning extra results.