Clements Code: Oracle

Showing posts with label Oracle. Show all posts

Monday, 9 May 2016

ODP.NET Connection Pooling - The Black Box

I hate not being able to step into code and work out what's going on. Unfortunately, with ODP.NET connection pooling that's the exact situation that we're all in. To make it worse, documentation is rubbish. No surprise there.

When errors inevitably start occurring, finding the cause can be a tad difficult. I thought I'd share my latest debug battle in hope that it may help others.

As you probably know, connection pooling is a method of improving performance in regards to taking out connections to a database. If enabled, when you open a given connection, you're not opening a connection directly with the database, instead, you're grabbing an open connection from a pool of connections. If there isn't a spare connection then the pool will request a connection from the database on your behalf and then return that to you. When you're done with it and you close the connection, you don't actually close the connection to the database. Instead, the open connection is returned to the pool for use later.

This is a well known practice with some very misunderstood configuration options which leads to unexpected behaviour. To make matters worse, it's behaviour that's difficult to track as it's part of the "black box" which is the connection pool. All you'll know is your system will throw an exception randomly and you'll spend hours even days trying to find out what the cause is.

I want to cover the two configuration options that have caused me massive headaches. Connection Timeout and Connection Lifetime.

The options are set in the connection string, e.g.

Data Source=MyDatabase; User Id=username; Password=password; Connection Lifetime=15; Connection Timeout=30; Max Pool Size=100; Incr Pool Size=5; Decr Pool Size=3; Validate Connection=false;

Connection Timeout

This timeout value refers to the amount of time the system should wait for a connection to be grabbed from the connection pool in seconds. How long this actually takes depends on the state of the pool.

There are three possible paths:

The pool already has a spare connection so just returns it. This is by far the fastest option
The pool doesn't have a spare connection so has to request one from the database. For performance reasons, the pool won't just request one connection, it'll request an entire batch. The number of connections it requests as part of this batch is defined by the Incr Pool Size attribute of the connection string. This is obviously slower and if you've got network problems or the database server is busy doing something else then it may be that the timeout value gets hit. This will cause a "Connection Request Timed Out" exception to be thrown.
It may be that the pool doesn't have a spare connection and that the pool already has the maximum number of connections that it's allowed (as defined by the Max Pool Size attribute). In this case, the pool will wait until a connection is freed up (more about that later) so it can either retirn that or make the request for a new connection without breaking the maximum pool size limit. If the time it waits exceeds the timeout value then it to will produce the "Connection Request Timed Out" exception.

Connection Lifetime

This value refers to how long the connection will live for in seconds. We're now talking about the actual connection to the database. Now think about this for a moment, the purpose of the pool is to reduce the amount of connections we open to the database. Opening a new connection is expensive and time consuming so the less time we spend doing that the better. But here we have an option that allows us to kill off connections after a given period of time. If you set this value to a low number, it severely impacts the effectiveness of the pool. You'd open the connection, the lifetime value kicks in, closes the connection to the database and nothing has actually been pooled. No matter what you set it to, you'll be limiting the effectiveness of the pool, the lower the value, the more you lose, the worse the performance.

So, why does it exist? It's all to do with clustering, there's a very good post regarding it found here. The summary of it is, do not use this unless you're connected to a cluster.

Which is where my problems began. It turns out an application I've been using had this value set to 15 seconds. The application doesn't use clustering and each user in this application has their own connection string and therefore their own connection pool. Running some rough figures then, for 5 users using the system continously for an hour, instead of opening 5 connections and using them (ignoring the incr pool size attribute here) it opens 1200 connections. Yep. That's right. 1200. Instead of 5. That's 1195 more times the application has to sit and wait for a connection to be established severely affecting performance. That's 1195 more requests creating unnecessary network traffic. And it's 1195 more times the database has to open a new connection when quite frankly, it's got better things to be doing (like returning data to your users!). That just for 5 users for 1 hour. Scale it up for 100 or 200 users over several hours and you can see how the situation gets out of hand pretty quickly. In this case we were seeing random connections fail with the "Connection request timed out" error for seemingly no reason at all... when you think about it, these are errors on an action that shouldn't even be performed.

The obvious thing to do then was to remove the attribute, setting all connections to last forever. This works really well. The connections are taken out when the application starts up and then that's it, it just uses those connections over and over again. If it turns out that a connection isn't active, perhaps that user logs off for example, then the connection pool service which runs every 3 minutes, will close the connection to the database for you. Brilliant.

Until... a connection becomes invalid. At which point you have a connection in the pool that'll error as soon as you try to use it. This can happen in a variety of circumstances. Perhaps there's a network error or a firewall has killed off your connection at the TCP level. Perhaps the database crashed. Whatever the reason, your connection pool now has a load of connections that can't be used and unless the user logs off for 3 minutes, won't ever be cleared down. Your application won't recover.

What now? The validate connection property to the rescue! In the example connection string above, you'll see it's set to false and this is the recommended default value. If set to true, every time you grab a connection from the connection pool, the connection pool will ensure the connection is valid. To do that, it has to go back to the database server and run a query (it runs SELECT 'x' FROM DUAL) and if everything comes back okay then it assumes the connection is valid and returns it to you for use. If not, it ditches the connection and opens up a new, valid connection.
Great. We have a full solution then. Ensure the connection lifetime is set to a rather large number so that it never dies and set validate connection to true so if any connection does become invalid, it will fix it self by detecting the invalid connection, getting rid of it and opening a new valid connection. However, let's think about performance for a minute... every time we request a connection from the connection pool, we have to go off to the database to ensure the connection is still valid. If you're requesting a lot of connections then that's a large overhead. Going to the database server is expensive, running a query is even more expensive. Is there a better solution?

I've only thought of one and it's not suitable for everyone. If you use the connection as you normally would, without validating it, you can detect the failure when you run the next database query. After detection you can clear the connection pool, take out a new connection and try again. I made a central method to handle such a thing which went a bit like this.

private static readonly CONNECTION_INVALID_CODE = new int[]{ 28, 3113 };

public T ConnectionValidate(Func<T> databaseMethod, OracleConnection conn){
int retryCount = 0;
while(true){
try {
return databaseMethod();
}
catch(OracleException ex){
if(CONNECTION_INVALID_CODE.Contains(ex.Number)){
if(retryCount == 5)
throw;

conn.Close(); OracleConnection.ClearPool(conn);
retryCount++;
}
else {
throw;
}

}
}
}

Unfortunately it's not quite as simple as that, you need to take into account if you were in the middle of a transaction or not (in which case you probably just want to re-throw the exception) but let's run with it for the time being...

The method does a few things. First, we define the oracle error codes that relate to the connection being invalid. ORA-00028 is thrown when your session has ended and ORA-03113 is for when the connection has ended prematurely. I'm sure there are others but that's a start.
We then define a method which as a parameter takes a function which returns a generic. Usually this is something like the number of rows affected by the statement, or the number of rows returned by the select. etc, using a generic just makes the method more... well, generic, we could for example, populate a DataTable and return that instead. We try and run that method which contains the oracle statement, if it succeeds then great. If it fails, we check to see why. If it's failed because of one of the errors mentioned before then we close the connection clear the connection pool which will remove all inactive connections. Then we try the whole thing again. The next time it runs, the connection grabbed will hopefully be a new connection from the database, rectifying the problem. If it fails another 4 times, we re-throw the exception and assume there's a bigger problem that we need to deal with.

You can call that method like so:

using (OracleConnection connection = new OracleConnection())
{
ConnectionValidate(() => {
connection.Open();

OracleCommand command = connection.CreateCommand();
string sql = "SELECT * FROM myTableName";
command.CommandText = sql;

OracleDataReader reader = command.ExecuteReader();
while (reader.Read())
{
string myField = (string)reader["MYFIELD"];
Console.WriteLine(myField);
}
}, connection);

}

Note that you have to open the connection as part of the inner function for this to work (you could move it around so that you open the connection as part of the ConnectionValidate method but I haven't).

Now we have a solution where we only validate the connection if it's absolutely necessary (i.e. it's failed when used). We only open connections to the database if we absolutely need it, reducing the network traffic, the load on the database and more importantly, we're now reducing the chance of failure (due to network connections or otherwise) when opening a new connection to the database.

As I said, my solution isn't very innovative but it seems to work. This is still very much code in testing so I could find issues with it.

How do other people handle this scenario? It seems like a fundamental issue with a lot of confusion around regarding what configuration options do what and how to handle various scenarios. Perhaps I'm missing something? I'd love to know your thoughts.

Sunday, 9 December 2012

Web Apps and Pagination Queries

Paging controls are found on many popular web apps.
Amazon is a great example.

In a fair few web applications out there, you'll see the concept of pagination. That is you have a list of results and they'll be separated into pages which show a limited number of results (say 25 records per page). You can then usually go forward a page, go back a page or the first or last page. This is a common UI practice and I'm sure you've seen it before.

But, how is this implemented and what is the best way of doing so?

Well, the results shown on the page are usually held against a database. So the question quickly becomes:

"How can I query the database to bring back a given page of results?"

Now before I carry on lets first state that this blog post consists of database queries that have been run against an Oracle database (both the 10 and 11g versions). There's a strong possibility that they will not run against SQL Server, or any other database for that matter, but I suspect the general concepts remain the same. I should also state that I am by no means an Oracle expert. Yes, I know SQL reasonably well, but an expert? Far from it. So, if you find a better way of implementing the following features then it'd be great to here from you.

Now, back to the task in hand. We need to write a SQL command to run on an Oracle database that will bring back a "page" of data of a particular table and as an extra condition, the data returned must be in a given order. Let's also say that a page consists of 25 records.

You'll be surprised at how many ways there are to achieve this. In the 4.5 years that Qube have been using this functionality, our pagination queries have changed no less than four times. Some times we've made the odd small change to optimize the query, in other times we've re-written the whole thing and used a different technique. In each case, we've improved performance over the last iteration of changes. This is done with the goal of making our web application as fast as possible. Finally, we've now got to the stage where using one technique is faster on one table but slower than a different technique on another. This means there is no technique that's best for all tables. It'll depend entirely on the amount of data you have in each table, the indexes you've set up and the way in which your database is optimized.

I'm now going to take you through our little journey of finding the best possible pagination query. In all of the example queries I've highlighted the base query (the query which we want to paginate) in blue. Generally, the following methods wrap that base query up in order to implement paging. You'll need to change the base query to be the actual query you want to use pagination on.

Revision One - Retrieve All

So, our first option was to retrieve all records and let the application handle the pagination. So, if we have a table that consists of 100,000 records, we retrieve all records and then the application will display 25 at a time. The bonus of this method is it's extremely easy to implement. Also, moving from one page to the next is extremely quick, all the records are held in memory so navigating between them is simple and fast. There are however, some major downsides. For starters, retrieving all records is slow so the initial load time for the user is severely affected. If that's not bad enough (and it should be), the memory usage of such a technique is outrageously large, especially when you consider the fact that it's highly unlikely a user will want to navigate through the entire 100,000 records anyway, so you're using valuable memory on records that are never going to be shown. Not cool! Just with these two things in mind, this technique is going to scale very badly.

Example Query
SELECT COLUMN_ONE, COLUMN_TWO, COLUMN_THREE
FROM TABLE_NAME
ORDER BY PRIMARY_KEY

Revision Two - Nested Queries

Bringing back everything clearly isn't the way forward. So, with a little bit of use of ROWNUM (an Oracle feature which brings back the number of the row in a dataset, e.g. the first row has a number 1, the second has a number 2, etc.) we can bring back 25 rows at a time which will give us the paging functionality that we need.

Using this method the application doesn't need to store all the possible rows which lowers the memory consumption on the application significantly. It also improves the performance of the first page load. Instead of bringing back thousands of rows, the database only returns 25 rows and there's obvious performance implications for that. The downside however is that the performance of the next/last/first/last page buttons will be affected as the query will need to be re-run but for the next page of data. If all the rows were stored on the application from the original query then this additional database query wouldn't need to be run.

Example Query
SELECT * FROM (SELECT rownum as f2n_rownum, f2n_table.*
FROM (SELECT COLUMN_ONE,
COLUMN_TWO,
COLUMN_THREE
FROM TABLE_NAME
ORDER BY PRIMARY_KEY) f2n_table
WHERE rownum <= 25)
WHERE f2n_rownum >= 1

Revision Three - Nested Queries Using WITH

As always, speed is key!

We've now got an implementation of our pagination query but is it the best implementation? Did you know that your average user expects your web application to load in two seconds or less and up to 40% of your users will leave your site if it hasn't responded after three seconds.¹ That means these queries need to be as quick as possible, every second counts. When you're dealing with possibly thousands of records, it can be difficult to bring back results in that time frame.

So, we looked to see if we could improve the performance of our pagination query. If we can then that's a performance improvement across near all of our pages. As it turns out... there is a better way! Kind of.

We could improve the performance in two different ways. Firstly, we can use the WITH clause. This is known as subquery factoring.² The second improvement is that we can tell the oracle optimizer how many rows we intend on using, this allows the optimizer to use this information to choose a faster explain plan for what we want. We do this by using optimizer hints in the query and in this hint we tell the optimizer that we want the first 25 rows brought back first (or however many rows are contained in your "page" of data). For more information on this try here: www.orafaq.com.

Example Query
SELECT /*+ FIRST_ROWS(25) */
pageouter.*
FROM (WITH page_query AS (SELECT COLUMN_ONE,
COLUMN_TWO,
COLUMN_THREE
FROM TABLE_NAME
ORDER BY PRIMARY_KEY)
SELECT page_query.*,
ROWNUM AS innerrownum
FROM page_query
WHERE rownum <= 25) pageouter
WHERE pageouter.innerrownum >= 1

Revision Four - Nested Queries With ROW_NUMBER

Ok, so we're now using optimizer hints and we're using subquery factoring. All good stuff. But can we do more?

Well, we can. Kind of.

There is a SQL function called ROW_NUMBER(). It serves very much the same purpose as ROWNUM in Oracle but it works in both Oracle and SQL Server and oddly, when used can perform better than our previous queries but only in certain scenarios.

The problem here is I can't tell you why it performs better in certain scenarios, I can't even tell you in which scenarios it performs better but here is what I have found:

It seems to perform better than our previous methods if the query is modified to have a complex 'where' clause.
It seems to perform better than our previous methods if the data is ordered by a row that is not uniquely indexed.
The performance gains can be dramatic. In the previous examples, changing from one method to another may have seen an improvement ranging from nothing to a second or two. I've seen this method improve some queries by up to 5-8 seconds, especially on queries that order data by columns that aren't indexed.

Now I suspect all of this is very much dependent on the indexes you have set up on your tables, the amount of data in your tables, how you've got your database optimized and probably a fair few other factors that I have no idea about so, the best way to know how this will perform for your queries is to test it.

Example Query
SELECT /*+ FIRST_ROWS(25) */
*
FROM ( SELECT ROW_ONE,
ROW_TWO,
ROW_THREE,
row_number() OVER(ORDER BY PRIMARY_KEY) innerrownum
FROM TABLE_NAME
)
WHERE innerrownum BETWEEN 1 AND 25

Conclusion

I've shown you three different ways of implementing pagination within the database query. There are other ways which I haven't discussed. For example, you could follow this process:

Run the query for ALL records (no paging) but insert the results of that query into a temporary table.
Query that temporary table for the "page" of data that you want, using one of the methods above.
When implementing the next/previous page function, you can then query the temporary table directly.

Assuming that your original query isn't bringing back the entire table, you'll be selecting from a subset of the original data which should make the next/previous functionality faster. However, your original page load time will be slower as you'll need to insert the records into the temp table so, there's a trade off.

I would imagine there's loads of other ways of doing this, if you find any that perform better than the above then let me know, it'd be great to hear from you!

And finally.... SQL Server

I couldn't end without mentioning the latest version of SQL Server and the good work Microsoft have been doing in this area. Microsoft have cottoned on to the fact that this paging functionality is now widely used and, as you can tell by this article, it isn't straight forward. They've gone out of their way to simplify this and built this functionality straight into the language making it very simple and, I would hope, a whole lot quicker than anything we can write in standard SQL.

I can't say I've had the pleasure of testing this but, according to the documentation, the feature is implemented by the introduction of two new keywords, OFFSET and FETCH NEXT and they're used in the following way:

SELECT COLUMN_ONE,

COLUMN_TWO,

COLUMN_THREE

FROM TABLE_NAME

ORDER BY PRIMARY_KEY

OFFSET 0 ROWS

FETCH NEXT 25 ROWS ONLY

This tells the database to bring back the first 25 rows. To bring back the next page, you'd increase the offset by your page size (in our example, 25). For more info, check out raresql.com.

And it's that simple.

The sooner Oracle implement this functionality the better!

^{1 - Forrester Consulting, “eCommerce Web Site Performance Today: An Updated Look At Consumer Reaction To A Poor Online Shopping Experience” A commissioned study conducted on behalf of Akamai Technologies, Inc., August 17, 2009

2 - For more information on subquery factoring, see www.dba-oracle.com}