educating software developers

Reading  Bjarne Stroustrup on Educating Software Developers really got me thinking – about questions I see from students at JavaRanch/reading, how my own education fits in and what I would like to see from new college graduates.

Yet while Stroustrup agrees that Java has been used to dumb down CS programs, ultimately, “the problem is one of attitude, more than an issue of programming language.

As the quote shows, he doesn’t just talk about a simplistic view of things.  It was one of the most thought out articles I’ve read in quite some time.

My view from JavaRanch and reading

Few students see code as anything but a disposable entity needed to complete the next project and get good grades. Much of the emphasis in teaching encourages that view.

This is definitely a problem.  One of the manifestations of it is students who try to get help on one specific programming assignment (or worse – purchase a solution.)  No thought is given to how not thinking about the assignment is going to hurt the student of future assignments or classes.  It’s like math – you can’t get away with not learning how to do algebra if you expect to do calculus.  Similarly, you can’t expect to write a complex program if you can’t write a simple one.

Better are the students who do the assignment of course.  But it’s still a disposable entity in most courses.  There are a few that build it up over the course of the semester such as the LISP course that I took.  In my mind, these are the best kind because they force students to look at code they wrote weeks/months ago.  Reading code instills an awareness of the impacts of style down the road.

Essentially all had their code littered with “magic constants.” They had never been taught that was bad style – in fact they had never been taught about programming style because the department “taught computer science; not programming.” That is, programming was seen as a lowly skill that students either did not need or could easily pick up on their own.

Showing examples of “good” and “bad” code is still rare in teaching programming.  This is true in both academia and in books.  When I read the Dietel Intro to Java book, some of the examples were quite long.  (The Swing methods spanned pages.)  I saw many instances of things that could have been extracted into methods.   Now this is a book that does teach programming so the “computer science” thing isn’t an excuse.  I’ve seen this in other books as well.  If the first lengthy code students see uses the “throw tons of code into one method” approach, how are they expected to know better?

Similarly, many books that teach JSP have examples with JDBC code in the JSP.  This is something that is considered horrible practice by industry and not a stylistic thing.  When people post questions on JavaRanch mixing JSP and JDBC someone will typically point out that is bad practice.  The poster usually comes back saying he or she read it in a book or that’s what the teacher said to do.  Again, how is a beginner supposed to know not to do things like this when all signs indicate this is how things work.

My education

As an undergrad, I was in a Computer Science program.  This covered the range of topics you would expect – theory, practice, programming (in a few languages, data structures, etc), database, networking, etc.

We did have a software engineering course where we built a larger system as a class.  While it covered the phases (design, etc), I don’t remember much emphasis on clean code or testing.  To be fair, I took this class the first semester it was offered.  It likely got better after that.  The current course description reads “Principles of software engineering including the software life cycle, reliability, maintenance, requirements and specifications, design, implementation, and testing.”  So I have hope that it teaches more of the important skills now.

As a graduate student, I studied “Computer Information Technology” which was a mix of more practical topics – general industry themes, management, communication, programming (one language) and design.

What really jumped out at me in the programming class was that we were asked to peer review each other’s code.  This is a good thing – code review is an important skill.  Then I saw the checklist.  One of the things we were supposed to check was “whether the code compiles.”  I attended grad school while working full time as a Java developer.  I had enough experience to know that code that doesn’t compile isn’t useful.  It’s better to have compilable/runnable code that doesn’t do as much.  I flat out refused to review code that didn’t compile.  My “review” of such code was one line long – “doesn’t compile – doesn’t work.”  The problem here is that academia is willing to accept and grade code that doesn’t compile in some instances.  Would that fly at your job?  It wouldn’t at mine.  Nobody can claim something is “done” if it doesn’t compile.

What I’d like to see

Education should prepare people to face new challenges; that’s what makes education different from training. In computing, that means knowing your basic algorithms, data structures, system issues, etc., and the languages needed to apply that knowledge. It also means having the high-level skills to analyze a system and to experiment with alternative solutions to problems

I absolutely agree with this.  However, I don’t think think education should conflict with training.  I’m all for training entry level hires and interns.  However, I don’t think it should be necessary to re-train them to unlearn bad habits taught by a college.

So, brush up on your communication skills.

Another thing that it would be nice for schools to acknowledge.  My business minor taught communication and writing.  My computer science major barely recognized they exist.

Learn your first language well. That means trying it for difficult tasks. Don’t obsess about technical details. Focus on techniques and principles.

Learn another programming language; choose any language that’s quite different from what you are best acquainted with. You can’t be a professional in the IT world knowing only one language. No one language is the best for everyone and for everything.

Don’t just do programming. Computing is always computing something. Become acquainted with something that requires your software development skills: Mediaeval history, car engine design, rocket science, medical blood analysis, image processing, computational geometry, biological modeling, whatever seems interesting. Yes, all of these examples are real, from my personal experience.

All good advice.  Learning multiple languages certainly makes you stronger.  And learning a business domain makes you able to relate programming to the real world.

javaranch/jforum – nailing down a threading problem

Similar to the thought process for debugging a performance problem, here’s one for debugging a threading issue:

  1. In Production, we have a problem where the threads in a forum listing disappear showing only one or two threads in their place.  We found a workaround pretty quickly (clear the cache), but want to stop the problem from occurring.
  2. For the better part of a week, we found it puzzling.  A couple moderators noted that the problem seemed to crop up after moving a thread from one forum to another.
  3. We still haven’t been able to reproduce the bug on demand.  It sounds like something threading related.  These tend to be the mysterious looking bugs.
  4. All right.  I’m going to sit down and look at the relevant code now.
    • I know the problem has something to do with the TopicRepository class which is where this data is stored.
    • I see clearCache method that empties the cache is in fact called when we move threads.
    • I also see addAll resets the forum list.  This seems to not be called since we only see a post or two.  Hmm.  Let’s see where this is called.  I see; it’s called when someone requests to view the list of topics in a forum.  The code gets the topics from the cache.  If there aren’t any topics cached or the cache reports as “not loaded” for the forum, it gets them from the database and loads the cache.  Clearly this isn’t happening.
    • I also see a method that adds a single topic to the cache.  It is called when we post announcements.  This looks promising.
    • I tried locally to reproduce the bug and got it!  We weren’t posting announcements though.  Then I searched the code for other calls to this method.  There is another call when we “update board status” which happens after someone replies to an existing thread.  Eureka!  This must be what’s happening.
  5. Now I can reproduce on our sandbox server on demand.  (see path below.)
  6. How to fix.  Well the topics being non-empty doesn’t seem like something I can control.  Looking at the other part – the loaded flag – seems more promising.
  7. I see the flag isn’t cleared in the clearCache method.  I’ll add that in and see what happens.
  8. The bug goes away!  Great news!

The path that causes the bug to manifest:

  1. open two threads in the Test Forum
  2. move a thread from Test Forum to Another Test – this clears the cache
  3. in Test Forum, add a post to an existing thread – this adds one topic to the the list
  4. reload list of threads for Test Forum – now it thinks there is one topic in the list and doesn’t look for more
  5. you only see the one edited thread in the list

production is slow – tuning jforum for javaranch – part 2

A moderator reported that moving topics took a long time in production and our sandbox.  I just tried it and got 70 seconds through the GUI!  It didn’t used to take this long.  I’m comparing our development sandbox a few months ago to our development sandbox today.  Clearly something has changed.  The question is what!

I heard a couple people liked reading about the thought process for troubleshooting slow queries. If you aren’t familiar with what a database explain is used for, see that entry before reading this one.

This time the path to success was a little more meandering.  I’ve left the dead ends in here so you get a feel for the actual thought process and not a sanitized version.  if you are a student, this is a decent feel for what problems look like in the “real” world.

  1. First step – find piece of code in JForum that handles moving threads.
  2. It’s GenericForumDAO.moveTopics.  I see this method was modified by us (JavaRanch) to support leaving a copy of the thread in the original forum or creating a new copy and locking the original (for use in moving to our private “Trash Can” forum.)  We made that change a long time ago though and it wasn’t slow then.
  3. Looking at the method, I see a number of database queries.  I’m adding logging (nothing fancy – just System.out’s) to see which query is using up so much time.
  4. The method I added logging to takes 5-8 milliseconds.  (across two trials).  Yet the browser plugs away for almost a full minute after that.
  5. Ah!  The web/action layer class ModerationHelper makes a bunch of calls to clear/reload the cache.  I bet that’s where all the time is going.  Adding logging on that level and redeploying.
  6. Now we are getting somewhere.  Everything goes very quickly until the part where the hotest topics get reloaded and then that takes almost a minute.
  7. Yuck.  Table scans left and right.  No wonder it takes so long.  Three tables scans, joining on the result and THEN discarding almost all the records.  (see “original query” and “original explain” in supplemental materials section below)
  8. If I was doing this by brute force, I would add an index on topic_views and refactor the query to search on the the topics table only joining with the others on result.  This is similar to the last post where I walked through an optimization.  However, I’m not sure that makes sense here.
  9. This definition of hottest posts uses the most page views forever.  This is ok for now because all the migrated posts started with zero views against them.  Making the hottest threads really “the hottest threads since our Saturday January 3rd migration.”  What happens down the road.  When a thread stops being viewed altogether, it’s not hot anymore.  I’m thinking it would be better to put a time based constraint on things.
  10. I’ll post in our Moderators Only forum to get some feedback on my proposed definition change and go have dinner.
  11. Unanimous agreement on my proposed definition from everyone who saw the question while I was eating.  Including the site owner.  Good news.
  12. Modify the query to add a where clause by thread last modified date
  13. This modified query takes 80 thousand units instead of 253 thousand units.  Better but still way too high.
  14. Next I’ll try adding index on topic_last_post_id, topic_views
  15. Running explain again gives the exact same output.  It isn’t using the index.
  16. Dropping the index since it didn’t work.  I may re-add it later but I like to test only one variable at a time.
  17. The database seems to be favoring the table scan of the forum table first.  Probably because this table is order of magnitudes smaller than the other two in the join.
  18. Now I’ll try removing the forum table from the query and run an explain to see what happens.  It’s still 78 thousand units, but at least it is using an index (post_time) as the first step.
  19. Time to try to add the index and run explain again.  Nothing.  Still 78 thousand units.
  20. All right.  This tweaking thing isn’t getting me anywhere.  New approach.  How would I build this query if I were writing it from scratch.
  21. The base query that I want to happen first is explain select p.post_id from jforum_posts p WHERE p.post_time >= ‘2008-01-01’.  Running an explain on just this says it takes 1 thousand cost units.  That sounds high – I’ll query some details.  Over 100 thousand posts in a few days?  Oh yeah!  It’s 2009 – a New Year.  I should know that – I celebrated recently.
  22. Re-running the original query with the post_time clause gives me under two thousand cost units.  Whew.  That’s much better.
  23. Now I’m uploading the changed query to our sandbox environment to see how the move performs.  Three to five seconds.  Good.
  24. Another moderator suggested using replies and not views.  This is a good idea, but has no effect on performance so I switched it. (see “Final query” in supplemental materials)
  25. And a good techie verifies everything so running explain one last time to confirm this last change didn’t affect performance.  It didn’t.  Performance is still ok. ( see “Final explain” in supplemental materials)

Summary

Two orders of magnitude performance improvement on the database query and a web action that isn’t painful to wait for.  I’m happy with that.  Not only is it faster, but we uncovered a better business definition for the query.

I also learned the last time we tested a move on our sandbox server was before we tested migrating large quantities of data.  Making it an invalid test.

Supplemental materials

Original query

SELECT t.*, p.user_id AS last_user_id, p.post_time, p.attach AS attach, f.forum_name \
FROM jforum_topics t, jforum_posts p, jforum_forums f \
WHERE p.post_id = t.topic_last_post_id \
AND p.need_moderate = 0 \
and p.forum_id = f.forum_id \
ORDER BY topic_views DESC \
LIMIT ?

Original explain

Row #         QUERY PLAN
1       Limit (cost=253457.29..253457.39 rows=40 width=132)
2       -> Sort (cost=253457.29..254467.05 rows=403905 width=132)
3       Sort Key: t.topic_views
4       -> Hash Join (cost=26807.80..136006.35 rows=403905 width=132)
5       Hash Cond: (“outer”.forum_id = “inner”.forum_id)
6       -> Hash Join (cost=26707.81..129847.79 rows=403905 width=113)
7       Hash Cond: (“outer”.post_id = “inner”.topic_last_post_id)
8       -> Seq Scan on jforum_posts p (cost=0.00..50770.74 rows=1775019 width=24)
9       Filter: (need_moderate = 0)
10       -> Hash (cost=19386.05..19386.05 rows=403905 width=93)
11       -> Seq Scan on jforum_topics t (cost=0.00..19386.05 rows=403905 width=93)
12       -> Hash (cost=99.79..99.79 rows=79 width=27)
13       -> Seq Scan on jforum_forums f (cost=0.00..99.79 rows=79 width=27)

Final query

SELECT t.*, p.user_id AS last_user_id, p.post_time, p.attach AS attach, f.forum_name \
FROM jforum_topics t, jforum_posts p, jforum_forums f \
WHERE p.post_time >= ? \
and p.post_id = t.topic_last_post_id \
AND p.need_moderate = 0 \
and p.forum_id = f.forum_id \
ORDER BY topic_replies DESC \
LIMIT ?

Final explain:

Row #         QUERY PLAN
1       Limit (cost=1774.18..1774.23 rows=20 width=132)
2       -> Sort (cost=1774.18..1774.28 rows=40 width=132)
3       Sort Key: t.topic_replies
4       -> Hash Join (cost=99.99..1773.12 rows=40 width=132)
5       Hash Cond: (“outer”.forum_id = “inner”.forum_id)
6       -> Nested Loop (cost=0.00..1672.53 rows=40 width=113)
7       -> Index Scan using idx_posts_time on jforum_posts p (cost=0.00..678.22 rows=177 width=24)
8       Index Cond: (post_time >= ‘2009-01-05 00:00:00’::timestamp without time zone)
9       Filter: (need_moderate = 0)
10       -> Index Scan using idx_topics_lp on jforum_topics t (cost=0.00..5.61 rows=1 width=93)
11       Index Cond: (“outer”.post_id = t.topic_last_post_id)
12       -> Hash (cost=99.79..99.79 rows=79 width=27)
13       -> Seq Scan on jforum_forums f (cost=0.00..99.79 rows=79 width=27)