Last month, I migrated the jforum.net forum data to a coderanch jforum forum. I had a requirement/goal to update the links in our forum so they work rather than point to the jforum broken links.
First I created the mapping. (lesson: store this data as you migrate so you don’t have to do it later.) I wound up mapping on subject lines and dates. Luckily the threads were migrated in numeric order so I could fill in the gaps. But that wasn’t interesting enough to blog about. What was interesting enough to blog about was the SQL queries to update the database.
My goal
I wanted to make the changes entirely through the database – no Java code. I also wanted to avoid postgres stored procedures because I encountered some time sinks last time I wrote a postgres stored procedure. I am happy to say I achieved my goal.
Step 1 – Analysis
Noted that I need to update 375 rows. Too many to do by hand. There are just under 5000 posts in the jforum forum. Which means therea rea t most 5000 search/replace strings to check. This doesn’t seem bad for a computer. Once I know the SQL, I can write code to generate 5000 of them using my local mappings and then run the SQL script on the server.
select * from jforum_posts_text where post_text like '%http://www.jforum.net/posts/list/%'
Step 2 – Horrible performing but functional query
It’s got to work before you can tune it. My first attempt was:
explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ') where post_text like '%http://www.jforum.net/posts/list/5.page%';
The query plan was:
Seq Scan on jforum_posts_text (cost=0.00..132971.82 rows=1 width=459)
Filter: (post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)
5000 of those is going to take over an hour of hitting the database hard. Not good. I could run it over the weekend when volumes are need be, but I can do better than that.
Step 3 – Trying to use an index
I know that most (95% maybe) of the updates are in the JForum forum. We had a few “legacy” links to jforum.net in other forums, but not a lot. I then tried adding a condition on forum id
explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ') where post_text like '%http://www.jforum.net/posts/list/5.page%' and post_id in (select post_id from jforum_posts where forum_id = 95);
The query plan was:
Nested Loop (cost=22133.48..97281.40 rows=1 width=459)
HashAggregate (cost=22133.48..22263.46 rows=12998 width=4)
Bitmap Heap Scan on jforum_posts (cost=245.18..22100.99 rows=12998 width=4)
Recheck Cond: (forum_id = 95)
Bitmap Index Scan on idx_posts_forum (cost=0.00..241.93 rows=12998 width=0)
Index Cond: (forum_id = 95)
Index Scan using jforum_posts_text_pkey on jforum_posts_text (cost=0.00..5.76 rows=1 width=459)
Index Cond: (jforum_posts_text.post_id = jforum_posts.post_id)
Filter: (jforum_posts_text.post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)
Well, it is using the indexes now. But it is only about a 30% drop in cost for the worst case. On an untuned complex query, I usually see at least an order of magnitude performance jump on my initial tuning.
Step 4 – time for a temp table
Most of the work is finding the 375 rows that need updating in a table with 1,793,111 rows. And it has to happen for each of the 5000 times I run the query.
I decided to use a temporary table so I could run the expensive part once.
create table jeanne_test as select * from jforum_posts_text where post_text like '%http://www.jforum.net/posts/list/%';
explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ') where post_id in (select post_id from jeanne_test where post_text like '%http://www.jforum.net/posts/list/5.page%');
Now I’m doing the expensive part once. It still takes a couple seconds to do the first part. But the second part – the update I’m running 5000 times – drops the query plan to
Nested Loop (cost=13.50..21.99 rows=1 width=459)
HashAggregate (cost=13.50..13.51 rows=1 width=4)
Seq Scan on jeanne_test (cost=0.00..13.50 rows=1 width=4)
Filter: (post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)
Index Scan using jforum_posts_text_pkey on jforum_posts_text (cost=0.00..8.46 rows=1 width=459)
Index Cond: (jforum_posts_text.post_id = jeanne_test.post_id
Nice. Running the script with the 5000 update statements only took a few seconds.
Conclusion
Database tuning is fun. Explain is your friend. As are different approaches. And for those who aren’t doing match, the performance jump was 3-4 orders of magnitude.