Thursday, August 24, 2006

Back to the basics of isolation levels and locking

In one of my older projects, we were using WebSphere's Scheduler in order to schedule tasks that will run at regular intervals or just once at the specified time. The WebSphere Scheduler exposes certain APIs and mandates that the task be implemented in a certain way. The business logic to be executed at the scheduled time (i.e. what you want the task to do) is to be written in the "process" method of a stateless session bean that implements the "TaskHandler" interface. The information about this stateless session bean is registered with WebSphere at the time of scheduling the task. WebSphere in turn invokes this process method at the scheduled time in order to invoke the task. WebSphere stores the information about the tasks in its own tables, the main one being the TASK table.

For a very long time, we were facing a peculiar, consistent problem when it came to actually starting the tasks at the scheduled time. The WebSphere Server used to hang whenever the WebSphere Scheduler tried to start a scheduled task. The only option and a way out used to be restarting the server and telling everyone around - hey, you can test the remaining application all you want, but please do not schedule any tasks to run immediately!

We did figure the problem later on, and it really turned out to be a very computer science basics issue.

It had to do with locking of the TASK table. At the time of starting a scheduled task, WebSphere Scheduler does the following:

1. Gets a "ROW" level "write" lock on the row of the TASK table that contains information about the current task to be executed. It updates the table with the new state and the next fire time of the task
2. Invokes the task's process method in order to actually execute the business logic therein.
3. Once the task completes and the process method returns, the WebSphere Scheduler commits the transaction and then releases the lock.

Thus, all the calls to the Scheduler API are transactional.

Now, what was happening in our case was:

In our process() method implementation, we were trying to update our internal data structures with the next fire time of the task and hence trying to read the task information from the TASK table. This is when it was trying to get a "ROW" level "read" lock on the TASK table. However, since the row was already locked by WebSphere Scheduler and the transaction was not yet complete, our request for a read operation was getting blocked. It was a typical "deadlock" situation, ultimately resulting in the server threads getting hung.

So the solution to our problem was as simple as not reading the information from the TASK table at that point of time and moving that code to a later point in time. It all boiled down really to the basics of isolation levels, locks, transactions, levels of locks etc.

Overall - time well spent that made me refresh my "database locking" concepts once again!

No comments: