Stackoverflow comes through again

There is a major function in our program called the Migrator. It’s purpose is to “migrate” content from the content storage to the various feature players, either manually under user control or automatically based on what is on the schedule. Unfortunately the original requirements, put in place for a potential customer who ended up going a different way, were that the user had to be able to see and control each individual file in the migration job, and to each destination. This meant there is a tree showing each destination, and all the top level containers (playlists), all the middle level containers below that (CPLs) and then all the individual files (track files, fonts, projector control files, etc) below that, all wrapped up in a nice little collapsable tree. (Thanks to Sun providing an example implementation of JTreeTable on their web site.)

I was partially responsible for the View, and entirely responsible for the Model and Controller for the gui side, and one of our Chinese colleagues was responsible for the engine that does the actual migration. Except Kris, the guy who did the View in the GUI, has also heavily modified the engine in later releases to change the way it worked. And also, he’s left the company over very similar issues to why I’m leaving at the end of this month, and so now the GUI code is entirely my responsibility.

Now that the migrator is getting some heavy work-outs – we are using a multi-hundred terabyte NAS for the content storage instead of the 1.5 TB local storage that we were using before, and one of our test setups has over 25 feature players (destinations). And the guy testing that setup keeps complaining about how slow it is to create a migration job – he says that it starts off taking about half a second per file/destination pair, and gets slower the more pairs there are. And also that if anything is playing on some of the feature players, things get even worse.

Our most junior programmer, Sandy, has been investigating this, and he’s convinced the problem is in my GUI code. His “proof” seemed pretty shaky to me. But coincidentally, I was reading StackOverflow yesterday (you wondered when I was going to get to mentioning StackOverflow, right? And like it’s any coincidence – I hit refresh more often on StackOverflow than all my other browser tabs put together – it’s become an obsession.) and somebody asked about Java profilers, and it was mentioned that there is a new profiler, VisualVM, built into JDK 1.6.07. We’re using JDK 1.6.05 right now, but I figured it wouldn’t be too hard to upgrade to 1.6.07 and run the profiler.

Unfortunately the profiler kind of sucks. It shows both the GUI and the engine spending 60% of its time in sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run, but it doesn’t tell me whether that time was waiting for RMI connections or processing something. Not very helpful. But up near the top of the profiler’s “Hot List” was “MigratorProxy_Stub.addToJob”, which is called each time a file/destination pair is added to the migration job before it’s started.

So I looked at the source code for the “addToJob” method in the engine, and found something very odd – every time you add a single file/destination pair to the job, it calls a method called “validateJob”, which goes through the entire job so far and for each file/destination pair already in the job, it does a DNS lookup on the destination to ensure that it’s a valid destination. Then it calls an RMI method to enumerate all the feature players that are currently playing, and then goes through all the file/destination pairs to see if the destination is one that is currently playing, and if it is, it sets the status on that file/destination pair to “throttled”. As far as I can tell, that’s two expensive processes that are done in a O(N^2 x M^2) manner when even if it should be done at all – which is doubtful in the case of the first check – could be done in an O(N x M) manner by doing it in the startJob method instead.

I can’t wait to see what happens when the junior guy tests the code with that change in it.

I’ll never feel guilty about reading StackOverflow at work again.

One thought on “Stackoverflow comes through again”

  1. Oh yeah BTW, all that checking if the FP is playing is unnecessary, since the FP is responsible for throttling now. That validateJob() crap was a China leftover.

    Good luck at Paysuxx.

Comments are closed.