Java Barbie says “kill -3 [pid] is my new best friend”

The same problem I mentioned in Rants and Revelations » Java Thread Locking cropped up again. This time it was quite random, but repeatable. I dreaded going through the crap I went through last time to find where the lockup was happening, until I discovered a nifty new trick – if you do a “kill -3” of the java process id, it dumps a stack trace of every thread, including what locks it’s holding, to stdout.

Going through the stack trace, I could see where one thread on the client had three locks and was calling an RMI method on the server that was locked waiting for the delete thread to finish. And the delete thread was calling a callback on the client that was waiting for one of those three locks, so the delete thread was locked as well. Not good. I removed most of the locks and things started working. Maybe eventually I will put some of the locks back.

Rohan suggests that I might have to rewrite parts of the server to take care of the next bug report on my list – the complaint is that deleting content takes too long. Unfortunately the bits he wants me to rewrite are his code, and it will take me 2 weeks just to understand it well enough to start to make the changes, and I’ve only got 10 days to clear all the bug reports off my list.

Jeppesen Responds

After receiving the email I mentioned in Rants and Revelations » Who’d have thunk it?, I responded with

I have renamed the part of the Wiki that uses the trademarked word
“NavData” to “DAFIFReplacment”. However, I am going to continue to use
the “/navdata/” part of the URL as that is a generic term and
untrademarkable and changing would break people’s bookmarks. You can
have a look at http://xcski.com/navdata/ if you wish.

I hope that meets your requirements.

Evidently their lawyers work nights, or they’ve outsourced it to India or something, because I got a response at 8:47pm:

Mr. Tomblin,

We appreciate your prompt action and reply to our notice.

While we cannot agree that the navdata term is generic, we understand the
bookmark issue and are satisfied with your action regarding this matter.

John Jaugilas
Jeppesen Intellectual Property
(303) 328-4178

Who’d have thunk it?

Well, it turns out that using the WikiWord “NavData” has upset Jeppesen Sanderson because they’ve got a product with that name, and they’ve sent me an email telling me to stop using the word or they’ll start legal action. I’m still using the word “navdata” because lots of people use it as a generic word meaning “navigation data”. So my Wiki url is still http://xcski.com/navdata/, but all deep links you might have are broken. Replace the word “NavData” with “DAFIFReplacement”.

Java Thread Locking

Ok, maybe I was a little succinct in my previous post.

You see, we’ve got an architecture where there are 3 or 4 layers of code, each one of which calls the one below it and then gets information back in the form of callbacks. Oh, and one of the very lowest layers is accessed through an RMI interface. Also, the very lowest layer deals with content, which can be created/modified/deleted through the program, or through other programs or just by doing file system stuff, which that lowest layer finds out about through dnotify.

The front end GUI has a dialog where you can delete content, and the problem was that evidently one of our customers have the fastest fingers in the world, and they complained that they delete the content and then go to ingest (slurp in) new content but the content they just deleted is still there (the deletion process takes a good 10-15 seconds) so the ingest fails due to lack of disk space. So they wanted the deletion to actually wait until it was done. And the lower level library actually provided a method called “deleteContentWaitTilDone”. So I thought it would be a simple matter to call it – once the method returned, the content would be really gone.

That’s when my problems started. I spent a week on this damn thing. The sad thing is that if Martin was still around, I could have used his Eclipse debugging skills and got this done in half the time. But when I attempted to install Eclipse on my machine, every time I fired it up, the whole machine locked up.

The problem seemed to be that the deletion process called callbacks in the higher levels, and ultimately some of them would do GUI stuff, and they’d also call down to the library. I had a hell of a time working out what was the actual problem. I ended up putting System.out.println debugging statements all over the damn place.

What I found first was a bunch of extraneous “synchronized” methods – the problem with that was the methods were synchronized to prevent different things. So instead of synchronizing 6 methods in a class where 2 of them were synchronized to prevent simultaneous accesses to a variable named “childThread”, and 2 of them were synchronized to prevent simultaneous accesses to the library, and 2 of them were synchronized for some other reason. I removed the “synchronized” on the method names, and then protected the important parts with different synchronization Objects, one called “childThreadSyncObject”, one called “librarySyncObject”, and it turned out the other ones didn’t have to be synchronized at all. Further digging revealed that the code one level above this that called this also had a synchronization object, which was redundant and I removed it.

The next thing I found was that one of the GUI level callbacks called “fireIntervalChanged” and it never returned. Ever. That’s when I had another epiphany – the callbacks aren’t in the gui event thread, and the event thread is currently locked because it’s waiting on that “deleteContentWaitTilDone”. So I went through all the GUI level code and made all the callbacks do the bulk of their processing in the event thread using SwingUtilities.invokeLater. The standard way to do that is

SwingUtilities.invokeLater(new Runnable()
{
public void run()
{
//..do stuff..
}
});

but unfortunately you can’t pass arguments that way, so I ended creating a metric buttload of tiny private classes that implement Runnable but take arguments in the constructor.

After all that work, I finally had stuff working. But unfortunately I neglected something that’s probably important – I didn’t give any sort of dialog or busy cursor or anything while that processing is going on. Oh well, maybe next time.