ASE 15 introduces completely new Optimizer and Query Processing engines, incorporating many fundamental changes. With these improvements comes the requirement that you give the optimizer as much accurate information about your data and databases as you possibly can. All that the optimizer needs to know is found in the ‘statistics’.
This paper will not go in depth into the technical side of the optimizer statistics describing each and every facet of the subject and muddying the waters for you. Instead, it will give you the information about the statistics that you need in order to insure that you can reach and maintain the highest level of performance for ASE 15.
Before we continue let’s quickly review exactly what statistics are available to the optimizer. There are two types of statistics – column level and object level. The column level statistics describe the distribution of values in the column; they consist of the column’s histogram and density values and are updated when an index is created or an ‘update statistics’ command is run. The object level statistics describe a table and its indexes and include values such as number of rows and pages in the table and/or index (es), the number of empty pages, and the cluster ratios among others. Some of the object level statistics are updated automatically by ASE, others when ‘update statistics’ is run.
Why Creating and Maintaining Accurate Statistics is Important to ASE 15
The optimizer has always been dependant on the statistics because it is ‘cost based’. That is, it makes its decisions about which query plan to use based on the estimated resource costs of executing it. Without them it’s flying blind and can only guess at which is the most efficient query plan.
ASE 15 is no exception to this. In fact, the statistics have become even more critical to good performance. Why is this? Because many of the optimizer’s new features and functionality can be very I/O intensive, especially if an inefficient query plan is used. Some of the new features and functionality include new methods for handling groupings, unions and all query level sorting operations.
New and improved join processing that is sensitive to the accuracy of the statistics has also been added. Hash joins are new to ASE and Sort-Merge joins have been improved and are turned on by default in ASE 15. If there are no useful indexes or if the statistics tell the optimizer that an index would not be efficient to use, then a worktable has to be created for the join. Joining values are moved into the worktable where they are sorted into the order required by the join and then merged with rows from the other joining table. All this, especially the sorting of the values in the worktable requires a great deal of I/O. Since both of these join methods can include sorting steps it is imperative that efficient query plans are chosen by the optimizer. The bottom-line is that large, unnecessary sorts are the bane of good query performance. It can’t be over emphasized how important it is to keep accurate statistics available for the optimizer to take advantage of when estimating the costs of joins. Even though ASE 15 is designed to avoid worktables that were often used in earlier versions, inaccurate statistics can lead to query plans that revert to using them.
One new piece of functionality added to ASE 15 in order to deal directly with a long-standing join performance issue is Join Histograms.
Join Histograms -
The Problem - Prior to ASE 15 the optimizer could only use a column’s density value, if it was available, to estimate the cost of joining one column to another. The density value is a ‘weighted average’ of the distribution of values in the column. If there was no density value, the optimizer used a preset selectivity value, a sort of ‘magic number’ that was based on the join operator. Since the most common join is an equi-join, the ‘magic number’ used was 0.10 (the optimizer believed that 10% of the rows in the column qualified for the join). As you might imagine most joins don’t qualify exactly 10% of either column.
When the column contained a fairly even distribution of values, the density value was accurate. However, when the column contained any degree of data skew (many values occupying a small number of rows each and a few values occupying many rows each) the density value was not accurate. When the optimizer used a density value of a skewed column it would lead the optimizer to believe that a join of the skewed table would be more efficient than it actually was. This in turn resulted in some very poorly performing joins.
The ASE 15 Solution - The join histograms of ASE 15 always give the optimizer an accurate view of the table. For example if the where clause of a join statement disqualifies a highly duplicated value it is not included in the join histogram; why estimate the cost of retrieving the skewed values if they aren’t needed for the join? How do join histograms work? Very simply – If there are statistics on a column that is being joined and there is a search argument (SARG) in the query then a histogram will be built on the fly containing only the values that the SARG has qualified.
A quick example: customerID table contains 100 distinct customer ids, orderID table contains all the orders placed by the 100 customers, 100K rows. Let’s say that of the 100 customers 3 have placed 50% of all the orders and of the 3 all have placed close to the same number of orders; the orderID table will contain data skew while the customerID table will be evenly distributed.
In pre-ASE 15, the density value would be larger due to the highly duplicated values for the three big customers. There is a good chance that the optimizer would estimate that an index on the custID column in the orders table would be expensive to use and call for a table scan. In fact, the index would be very selective for the query. In ASE 15, the join histogram would not include the highly duplicated values thus accurately making the index access look cheaper than a table scan.
Of course, for join histograms to be accurate and useful, accurate statistics need to be in place for the joining columns.
Here’s another situation that missing or inaccurate statistics can cause – The optimizer having to examine more query plans than is necessary.
The Problem – In ASE 15 the new functionality in the optimizer can lead to the optimizer having MANY more query plans to examine than in earlier versions. Add to this the large and complex queries common today and the optimizer is working much harder than ever before. The more query plans to estimate costs for, the more time it takes to find the best plan to execute. In fact, there is a possibility that the optimizer can take more time optimizing the query than it takes to actually execute it and return the results.
The ASE 15 Solution - It is for this reason that ‘timeout’ was included in the ASE 15 optimizer. Put simply, once the optimizer has reached the timeout limit it will choose the cheapest (most efficient) query plan it has examined up to that point. This plan will be used to execute the query whether it’s the best or not. Accurate statistics can result in the optimizer having to examine far fewer query plans and finding the most efficient plan before the timeout limit is reached. This in turn will result in less proc cache usage for the optimizer’s search. When there are no statistics or the statistics are inaccurate the optimizer can overestimate the cost of what is actually the best query plan and go on to examine many more.
How to Create and Maintain Accurate Statistics
Statistics are kept up to date by regularly running the update statistics command, or by dropping and recreating indexes, which can be very time and resource consuming. In the past, there were a number of ‘rules of thumb’ floating around among DBAs about when to update statistics. Some of the more popular were –
* “when 5-10% of the data has changed”
* “once a week”
* “every day”
* “only when queries start performing badly”
* “only when you have time to do it”
* “I never run update statistics”, etc.
The honest answer to the question was, is, and will always be - “It all depends on your data and your queries”.
With the release of ASE 11.9, many more options on how to update, read and even directly write the statistics are offered. You can update statistics as you always have, on the leading column of an index. Or, you can also put statistics on all the columns of an index or on columns that are not part of an index. You can even put statistics on all the columns of a table; all this by issuing a variety of ‘update statistics’ commands. You can specify how many ‘cells’ will be in each column’s histogram, thus making the statistics more granular….choices, choices everywhere--and, each choice has a potential effect on performance; either for good or bad.
As we’ve seen, the ‘why’ of updating statistics is obvious and clear-cut; to provide the optimizer with the most accurate picture of your data as possible. The ‘how’ of maintaining accurate statistics offers you the opportunity to apply a little creativity to the process. By this point in your work with ASE, you’ve likely developed your own statistics maintenance practices based on your system and experience. In most cases, there shouldn’t be a need to make many changes to this for ASE 15. One change that you should consider very seriously is putting and maintaining statistics on all columns of all your composite indexes, and on all columns that are referenced in queries (especially joins) that do not belong to any index.