Monday 9 May 2016

ODP.NET Connection Pooling - The Black Box

I hate not being able to step into code and work out what's going on. Unfortunately, with ODP.NET connection pooling that's the exact situation that we're all in. To make it worse, documentation is rubbish. No surprise there.

When errors inevitably start occurring, finding the cause can be a tad difficult. I thought I'd share my latest debug battle in hope that it may help others.

As you probably know, connection pooling is a method of improving performance in regards to taking out connections to a database. If enabled, when you open a given connection, you're not opening a connection directly with the database, instead, you're grabbing an open connection from a pool of connections. If there isn't a spare connection then the pool will request a connection from the database on your behalf and then return that to you. When you're done with it and you close the connection, you don't actually close the connection to the database. Instead, the open connection is returned to the pool for use later.

This is a well known practice with some very misunderstood configuration options which leads to unexpected behaviour. To make matters worse, it's behaviour that's difficult to track as it's part of the "black box" which is the connection pool. All you'll know is your system will throw an exception randomly and you'll spend hours even days trying to find out what the cause is.

I want to cover the two configuration options that have caused me massive headaches. Connection Timeout and Connection Lifetime.

The options are set in the connection string, e.g.

Data Source=MyDatabase; User Id=username; Password=password; Connection Lifetime=15; Connection Timeout=30; Max Pool Size=100; Incr Pool Size=5; Decr Pool Size=3; Validate Connection=false;

Connection Timeout

This timeout value refers to the amount of time the system should wait for a connection to be grabbed from the connection pool in seconds. How long this actually takes depends on the state of the pool. 
There are three possible paths:
  1. The pool already has a spare connection so just returns it. This is by far the fastest option 
  2. The pool doesn't have a spare connection so has to request one from the database. For performance reasons, the pool won't just request one connection, it'll request an entire batch. The number of connections it requests as part of this batch is defined by the Incr Pool Size attribute of the connection string. This is obviously slower and if you've got network problems or the database server is busy doing something else then it may be that the timeout value gets hit. This will cause a "Connection Request Timed Out" exception to be thrown.
  3. It may be that the pool doesn't have a spare connection and that the pool already has the maximum number of connections that it's allowed (as defined by the Max Pool Size attribute). In this case, the pool will wait until a connection is freed up (more about that later) so it can either retirn that or make the request for a new connection without breaking the maximum pool size limit. If the time it waits exceeds the timeout value then it to will produce the "Connection Request Timed Out" exception.

Connection Lifetime

This value refers to how long the connection will live for in seconds. We're now talking about the actual connection to the database. Now think about this for a moment, the purpose of the pool is to reduce the amount of connections we open to the database. Opening a new connection is expensive and time consuming so the less time we spend doing that the better. But here we have an option that allows us to kill off connections after a given period of time. If you set this value to a low number, it severely impacts the effectiveness of the pool. You'd open the connection, the lifetime value kicks in, closes the connection to the database and nothing has actually been pooled. No matter what you set it to, you'll be limiting the effectiveness of the pool, the lower the value, the more you lose, the worse the performance.
So, why does it exist? It's all to do with clustering, there's a very good post regarding it found here. The summary of it is, do not use this unless you're connected to a cluster.

Which is where my problems began. It turns out an application I've been using had this value set to 15 seconds. The application doesn't use clustering and each user in this application has their own connection string and therefore their own connection pool. Running some rough figures then, for 5 users using the system continously for an hour, instead of opening 5 connections and using them (ignoring the incr pool size attribute here) it opens 1200 connections. Yep. That's right. 1200. Instead of 5. That's 1195 more times the application has to sit and wait for a connection to be established severely affecting performance. That's 1195 more requests creating unnecessary network traffic. And it's 1195 more times the database has to open a new connection when quite frankly, it's got better things to be doing (like returning data to your users!). That just for 5 users for 1 hour. Scale it up for 100 or 200 users over several hours and you can see how the situation gets out of hand pretty quickly. In this case we were seeing random connections fail with the "Connection request timed out" error for seemingly no reason at all... when you think about it, these are errors on an action that shouldn't even be performed. 

The obvious thing to do then was to remove the attribute, setting all connections to last forever. This works really well. The connections are taken out when the application starts up and then that's it, it just uses those connections over and over again. If it turns out that a connection isn't active, perhaps that user logs off for example, then the connection pool service which runs every 3 minutes, will close the connection to the database for you. Brilliant.
Until... a connection becomes invalid. At which point you have a connection in the pool that'll error as soon as you try to use it. This can happen in a variety of circumstances. Perhaps there's a network error or a firewall has killed off your connection at the TCP level. Perhaps the database crashed. Whatever the reason, your connection pool now has a load of connections that can't be used and unless the user logs off for 3 minutes, won't ever be cleared down. Your application won't recover.

What now? The validate connection property to the rescue! In the example connection string above, you'll see it's set to false and this is the recommended default value. If set to true, every time you grab a connection from the connection pool, the connection pool will ensure the connection is valid. To do that, it has to go back to the database server and run a query (it runs SELECT 'x' FROM DUAL) and if everything comes back okay then it assumes the connection is valid and returns it to you for use. If not, it ditches the connection and opens up a new, valid connection.
Great. We have a full solution then. Ensure the connection lifetime is set to a rather large number so that it never dies and set validate connection to true so if any connection does become invalid, it will fix it self by detecting the invalid connection, getting rid of it and opening a new valid connection. However, let's think about performance for a minute... every time we request a connection from the connection pool, we have to go off to the database to ensure the connection is still valid. If you're requesting a lot of connections then that's a large overhead. Going to the database server is expensive, running a query is even more expensive. Is there a better solution?

I've only thought of one and it's not suitable for everyone. If you use the connection as you normally would, without validating it, you can detect the failure when you run the next database query. After detection you can clear the connection pool, take out a new connection and try again. I made a central method to handle such a thing which went a bit like this.

private static readonly CONNECTION_INVALID_CODE = new int[]{ 28, 3113 };

public T ConnectionValidate(Func<T> databaseMethod, OracleConnection conn){
    int retryCount = 0;
    while(true){
        try {
            return databaseMethod();
        }
        catch(OracleException ex){
            if(CONNECTION_INVALID_CODE.Contains(ex.Number)){ 
                if(retryCount == 5)
                    throw;

                conn.Close();                 OracleConnection.ClearPool(conn);
                retryCount++;
            }
            else {
                throw;
            }

        }
    }
}

Unfortunately it's not quite as simple as that, you need to take into account if you were in the middle of a transaction or not (in which case you probably just want to re-throw the exception) but let's run with it for the time being...

The method does a few things. First, we define the oracle error codes that relate to the connection being invalid. ORA-00028 is thrown when your session has ended and ORA-03113 is for when the connection has ended prematurely. I'm sure there are others but that's a start.
We then define a method which as a parameter takes a function which returns a generic. Usually this is something like the number of rows affected by the statement, or the number of rows returned by the select. etc, using a generic just makes the method more... well, generic, we could for example, populate a DataTable and return that instead. We try and run that method which contains the oracle statement, if it succeeds then great. If it fails, we check to see why. If it's failed because of one of the errors mentioned before then we close the connection clear the connection pool which will remove all inactive connections. Then we try the whole thing again. The next time it runs, the connection grabbed will hopefully be a new connection from the database, rectifying the problem. If it fails another 4 times, we re-throw the exception and assume there's a bigger problem that we need to deal with.

You can call that method like so:

using (OracleConnection connection = new OracleConnection())
{
    ConnectionValidate(() => {
        connection.Open();
       
        OracleCommand command = connection.CreateCommand();
        string sql = "SELECT * FROM myTableName";
        command.CommandText = sql;

        OracleDataReader reader = command.ExecuteReader();
        while (reader.Read())
        {
            string myField = (string)reader["MYFIELD"];
            Console.WriteLine(myField);
        }
    }, connection);
}

Note that you have to open the connection as part of the inner function for this to work (you could move it around so that you open the connection as part of the ConnectionValidate method but I haven't).

Now we have a solution where we only validate the connection if it's absolutely necessary (i.e. it's failed when used). We only open connections to the database if we absolutely need it, reducing the network traffic, the load on the database and more importantly, we're now reducing the chance of failure (due to network connections or otherwise) when opening a new connection to the database.

As I said, my solution isn't very innovative but it seems to work. This is still very much code in testing so I could find issues with it.

How do other people handle this scenario? It seems like a fundamental issue with a lot of confusion around regarding what configuration options do what and how to handle various scenarios. Perhaps I'm missing something? I'd love to know your thoughts.