Aug 22

Risk Analytics: Methods and Benefits

This spring I took a course in Risk Analytics for the Masters of Science in Predictive Analytics (MSPA) degree that I’m working towards.  I wrote up an overview for my employer’s blog.

 

 

Jul 31

Showing Error Magnitude Using a Banded Column Chart in SSRS

Forecasting is never completely accurate and there are always differences between the predicted value and the actual value.  Users often want to see the errors displayed so that they can judge for themselves how accurate the forecast process is.  This article will show how to create an error chart  in SSRS with a visual indicator to judge the size of the errors.

The data set that we’ll be using for this example will contain twelve months of error values as well as the standard error for the data set.  The standard error would be provided by the statistical software that generates the forecasts.  Here is the sample data set:

Raw Data Set

Next we’ll create a Dataset in SSRS that selects this data from SQL Server.

Data Set

We’ll add a Chart to a blank SSRS report and then select a Column Chart for the type.

Select Chart

Now we’ll add the fields from the ErrorData dataset to the chart and see what we get with the default settings.  Month goes in the “Category Groups” area and PredictionError goes in the “Values” area.

Initial Chart

Initial Chart Preview

This column chart shows the months along the time axis and the error values along the y axis.  The chart gives the analyst some information, but it doesn’t give a sense of whether the errors for this time series are large or small.  We see the magnitude of the errors compared to each other, but not to the standard error.  In order to show that we’ll add colored bands to the column chart.

The first step to do this is to open the Vertical Axis Properties dialogue box.  We’ll change some of the default settings to match what we want to do.

Vertical Axis Properties

Notice that we’ve checked the “Use interlacing color” option as well as selected an appropriate color.  This is the color that will appear in horizontal bands across the column chart.  We’ve also changed the “Interval” to be equal to the StandardError column from the dataset.  This means that each colored band will be one standard error tall.  The “Interval type” has also been changed to Number.

The final change that we need to make is to the “Minimum” and “Maximum” fields.  If we leave these to adjust automatically, then the bands won’t be centered around zero.  This will make it hard to compare a column to the colored band.  So instead we’ll change these to be three standard errors above and below zero.

Minimum Expression

Maximum Expression

Using a range of three standard error amounts will account for 99.5% of the error values.  If you want to have even greater certainty that all of the error values lie in this range, then you can increase the values to 4 standard errors.

Here is what the final chart looks like once we have made all of these changes:

Final Chart

Now it is much easier to estimate how serious the errors are.  We can see that there are a few errors that are greater than one standard error, but none that are greater than two standard errors.  We can feel confident that the forecast model is reasonably accurate since errors are within the expected range.

Jun 08

Displaying Tick Marks on SSRS Line Chart

I was recently working with a client to display some of their sales data over the last several fiscal years.  We were using a line graph in an SSRS report to display the data.   The odd thing was that the monthly tick marks were showing up at odd intervals.  It seemed that there was space for more, but SSRS wasn’t displaying them.  I’ve created a similar data set to demonstrate the problem and the solution.

The data set is a simple time series with sales by month.

DataSet

The report contains a line graph with FiscalYear and FiscalMonth as the Category Groups and Sales as the Value.

ReportSetup

When the report displays two years of data with the default settings, however, it only displays every fifth month even though there is plenty of room to show more tick marks.  This would be confusing for the users to try find the month they are looking for on the graph.

DefaultInterval

I wanted to display more tick marks – at least every other month.  After playing around with the settings a bit I didn’t see an easy way to do it.  Some web searches found a page on how to Specify an Axis Interval on MSDN.  According to it, all I had to do was alter the Axis Properties.  I right clicked on the Horizontal Axis on the Design tab and opened the Horizontal Axis Properties dialogue box.  I changed the Interval setting from Auto to 2.

HorizontalAxisProperties

Now the chart displayed the way I wanted.

UpdatedInterval

Apr 29

Forecasting Software Comparison

Introduction

At a client I was asked to help with selecting software for a sales forecasting project. The most important factors were the accuracy of the forecasts, how well it integrated into the technical environment and the level of support for the software.  We wanted to forecast sales throughout the sales hierarchy so the final result would include hundreds of predictive models.  The project involved only a small amount of data so it was unnecessary to worry about whether the software could scale. The main packages we looked at were SAS programming system, SAS Forecast Studio for Desktop, R from Revolution Analytics, Autobox and SQL Server Data Mining.  Here’s how they stacked up.

SAS Programming Language

SAS is the gold-standard of corporate statistical software.  It is developed by the SAS Institute which is the largest market-share holder for analytical software.  The forecasting methods are powerful and have extensive documentation on their use.  The software is guaranteed by the vendor and has an excellent product support group.

The strength of SAS comes at a price which makes it the one of the highest cost options among the candidates.

Advantages

  • Complete statistical computation platform
  • Full featured forecasting tools
  • Can completely automate forecasting process
  • Large community knowledge of forecasting techniques in SAS programming language
  • Long history of use
  • Technical support from large and well-established vendor

Disadvantages

  • High cost
  • More technical and statistical expertise required to implement
  • Requires custom programming to compare multiple forecasting algorithms

SAS Forecast Studio Desktop

SAS Forecast Studio for Desktop is a GUI driven forecasting tool.  It is built on top of the base of the SAS programming language, so the statistical foundation is solid.  The GUI makes the tool accessible to business users who don’t have statistical programming experience.  One of its critical features is that it can forecast a different model for each level of a hierarchy such as a product sales hierarchy.

Advantages

  • Compares multiple forecasting algorithms including ARIMA, exponential smoothing and unobserved components to get best model
  • Can model each member of a hierarchy individually
  • Aggregates forecast models up the hierarchy
  • Configured using simple graphical user interface
  • Business events and holidays can be integrated into forecasts
  • Provided by large and well-established vendor

Disadvantages

  • Must be run on desktop workstation / virtual desktop
  • Needs manual user intervention to generate forecasts
  • System functionality limited to forecasting only

R from Revolution Analytics

R is an open source statistical system which is used heavily in the academic and research communities.  Many of the statistical features have been created by top academic researchers.  The forecasting methods in R are powerful and flexible.

Since R is open source, there is no support or warranty for the product by itselft.  Revolution Analytics, however, provides fee-based support for R.

Advantages

  • Complete statistical computation platform
  • Full featured forecasting tools
  • Straight-forward integration and automation
  • Low cost option
  • R is most popular and fastest growing statistical platform
  • Large community of knowledgeable users on Internet

Disadvantages

  • Open source software
  • Will need to hire third party like Revolution Analytics to support at enterprise level
  • More technical and statistical expertise required to implement
  • Requires custom programming to compare multiple forecasting algorithms

Autobox

Autobox is published by a small vendor who has been in the business of forecasting for decades.  They provide both desktop and server versions of their software.  The company includes many statistical options in the software, but also is able to create forecasts automatically without any configuration by the user.

Advantages

  • Unique forecasting algorithm takes many factors into consideration including outliers and level changes
  • Business events and holidays can be integrated into forecasts
  • Models generated using graphical user interface
  • Vendor has long history of forecasting systems and will spend considerable effort helping with set up

Disadvantages

  • Adding business events and holidays takes a larger time investment
  • Tiny community of support on Internet
  • Vendor is smallest in study

Microsoft SQL Server Data Mining

Microsoft SQL Server Data Mining is a feature of the SQL Server database product.  It is already integrated into the database engine and the tools to perform predictive analytics are included with the licensing fees.  However, the forecasting tools have limited functionality and produce models which are obviously incorrect.

Advantages

  • Automation is straight-forward

Disadvantages

  • Forecasting features are limited and flawed

Conclusion

We ended up selecting SAS Forecast Studio for Desktop because of it’s ability to automatically generate a large number of forecast models.  It also automates the selection of models from a variety of algorithms which produces more accurate forecasts.  Finally, it was able to integrate into the client’s current Business Intelligence environment with a minimum amount of hassle.

Feb 05

A Simple Method to Compare Financial Results

The sales department at my current client were making claims about the sales cycle that were hard to believe.  I discuss in a post on my employer’s blog an easy statistical test to determine whether their claims were consistent with reality or not.

Nov 29

Data Type Results from UNION Queries

Sometimes when we are exporting data from one database to another we need to combine data sets at the source. One example of this is moving data in a data warehouse from a landing area where all of the tables are replicas of the source tables to a staging area where the tables are combined into a single data model.

One way to extract the data is through a view or stored procedure which uses UNION operators. If the source tables have different data types, the question then becomes what data types come out of the UNION query. This information is necessary so that we can build our target table appropriately.

When there are different data types for the same column between two or more UNION’d queries, the assignment of data type follows SQL Server Data Type Precedence. Data types at the top of this list have the highest precedence and data types at the bottom have the lowest. So, for example, if one column of a query is a NVARCHAR and the same corresponding column of a UNION’d query is a VARCHAR, the column will become an NVARCHAR since that data type has higher precedence.

Let’s look at a few examples of this. We’re going to create a temp table from the UNION query so that we can see which of the data types are selected. First we’ll create a UNION query that combines two base queries and has two columns which are both different data types.

USE Sandbox
GO

SELECT
 src.Col1
 ,src.Col2
INTO
 dbo.TestDataTypes
FROM
(
    SELECT
      CAST('1' AS NVARCHAR(10)) AS Col1
      ,CAST(2 AS INT) AS Col2
   UNION
   SELECT
      CAST(1 AS INT) AS Col1
      ,CAST(2 AS DATETIME) AS Col2
) AS src

Let’s look at the results of the script:

TestDataTypes Columns

The data types in the table are as we expected. INT is a higher precedence than NVARCHAR, so Col1 turns out to be type INT. However for Col2, DATETIME has a higher precedence than INT so that’s the data type that prevails.

One final question is what will happen when data lengths conflict. For example, what if we have a high precedence NVARCHAR of length 10 matched up with a lower precedence VARCHAR of length 20? Will the VARCHAR be truncated or will the lower precedence type win because of data length?

USE Sandbox
GO

SELECT
	src.Col3
INTO
	dbo.TestStringLengths
FROM
(
	SELECT
		CAST('teststring' AS NVARCHAR(10)) AS Col3
	UNION
	SELECT
		CAST('longerteststring' AS VARCHAR(20)) AS Col3
) AS src

TestStringLengths Columns

It looks like we get the best of both worlds. The higher precedence data type is selected, but the column is lengthened to accommodate the lower precedence data type.

In conclusion, SQL Server follows the data type precedence list when determining the data type of a column in a UNION query, but makes smart decisions about the data size to accommodate larger data fields with lower precedence.

Oct 27

Correcting Excel Data Types in the Import Wizard

A common task that business intelligence professionals have to perform is to import data from Microsoft Excel into SQL Server.  This is usually a unnecessarily challenging project with many pitfalls.  One of the pitfalls is getting the SQL Import Export Wizard to correctly determine the data types.

One pattern that you occasionally see in data files is a column which is a sequential list of whole numbers, followed at the bottom with some strings.  This pattern can form when the data file was created in sorted order which will put all the alphanumeric strings at the bottom.

010 - Data List 1

020 - Data List 2

The way that the SQL Import Export Wizard determines data types of the Excel columns is to take a sample of the values.  Unfortunately, this sample is taken from a limited number of rows in the worksheet.  In this sample data file you can see that there are approximately 100 rows of numeric data followed by a string in the NaturalKey column.  The Import Wizard identifies both of the input columns as floats, which is obviously wrong.
030 - Incorrect Column Mapping 1

Now we have the clever idea of moving the record with the string up to the top of the file.  This is so that the Import Wizard will see the string in the sample of rows it takes.

040 - String at top

However, now the Import Wizard decides that there aren’t enough values of type Varchar to set the data type to Varchar.  Instead it only imports the data that matches the majority.  The string value ‘ABC123′ is omitted from the import.
030 - Incorrect Column Mapping 1 050 - Missing Data

This is one choice that SQL Server makes that is definitely wrong.  As a data professional, I would much rather have no data and an error than most of the data without getting any warning.  This is the kind of event that will cause much hand-wringing and forehead-scratching as you trace the missing data back from the final destination to the source.

One way to avoid this is to insert some dummy records at the top of the Excel file.  After some experimentation, it seems that Excel requires at least five rows with the correct data type to automatically map the data type.

060 - Dummy Variables Added

070 - Correct Column Mapping

After the import you’ll have to delete the extra rows from the destination table.

In conclusion, you sometimes have to use black magic to get the Import Wizard to correctly set data types on the import of an Excel file.

Oct 27

Experience at SQL Saturday Fargo

SQL Saturday is a free conference organized by PASS organizations around the country and the world.  Fargo, ND hosted their first SQL Saturday on April 27, 2013.  The day was full of learning, socializing and a surprise celebrity appearance.
SQL Saturday was originally organized by PASS to give beginning presenters a venue to speak at.  Some speakers at these conferences are presenting for the first time while others tour nationally.  The conferences are a great way to improve your technical knowledge while meeting other professionals in the same field.
A colleague and I made the road trip up to Fargo from Minneapolis, MN.  The conference started early on Saturday morning on the Microsoft campus and ran the entire day.  There were four tracks running simultaneously to provide variety for the 150 attendees.  Many of the other speakers were also from Minnesota and there were even a few who had flown in from other parts of the country.
My presentation was the first of the morning.  This was a great time because then I could enjoy the rest of the day learning from other presenters.  I could also follow an interesting subplot that developed: an attendee told me that Bill Gates was on campus.
My topic was “Data Mining Deep Dive: Clustering”.  SQL Server provides tools for predictive analysis including many popular data mining algorithms.  Few people use these features and even fewer are aware of how to optimize their results.  My talk was meant to give people more familiarity with these algorithms so they could use them effectively.  There were about ten people in the presentation which I thought was a good turn out considering how specialized the topic was.
After my talk I enjoyed several other presentations during the day.  I learned about more exotic index types from Jes Borland and automating SQL Server administration using PowerShell from Jim Dorame.   As the sessions went by we noticed a podium and folding chairs being set up in an open area adjacent to the conference area.
While I was waiting for a presentation by Ross McNeely on data analysis in F# I walked out in the hall and though a pair of double doors.  As I tried to get back to the talk, I saw through the double doors an entourage come up the stairs.  Then suddenly Bill Gates was in the hall talking and chatting with people.  I was star-struck.  Security quickly ushered Bill into a meeting room to prepare for his presentation and I was directed to return to Ross’ talk.

Unfortunately, by the time we were out in the hall again Bill’s talk was almost finished.  He had presented on his work with the Bill Gates Foundation and how they are striving to help reduce suffering in the developing world.  After finishing, he shook some hands in the crowd as he walked out and then disappeared into the twilight of a perfect spring evening.

Apr 26

Speaking at SQL Saturday #175 Fargo

I’ll be traveling with the rest of the Minnesota SQL Server gang up to Fargo, ND for their first SQL Saturday. My presentation is on the Microsoft Clustering data mining algorithm.

Microsoft SQL Server makes it simple to apply data mining algorithms to a wide variety of data. Applying the results to business decisions without a thorough understanding of how the algorithms work is dangerous to the bottom line of the business, though. This session will take one of the algorithms, the Microsoft Clustering Algorithm, and do a deep dive into the mechanics of how it works. The algorithm is valuable for analyzing data in the fields of marketing, social networks and many others. The session will also examine the types of data that are valid for clustering. A demonstration of building a clustering model using SQL Server Analysis Services and viewing the model using the Excel Data Mining Add-In will be given.

The slides are online too if you’d like to take a look.

Apr 01

Extracting Data from the Weather Underground API

Many companies are beginning to bring external data into their data infrastructure.  One of the most common and easily available types of external data is weather data.

There are several excellent sources of weather data.  If you need historical weather data then Fetch Climate is a great resource.  The data has been combined from many sources and patched together into a consistent whole.  The Microsoft Research project was conceived of and led by Drew Purves.

Fetch Climate doesn’t include current weather data, though. Recently I was working with a client that needed a daily average temperature inserted into a table in their SQL Server database.  Another resource on the web, Weather Underground, publishes current daily weather data. It also has an API that you can program against.  Since there isn’t much documentation out there about how to connect to the Weather Underground API using a .Net client, I decided to document how I accessed it and inserted the data into SQL Server using a CLR assembly.

The first step is to sign up for a developers account on Weather Underground.  This will give you a personal key that you can use to authenticate against the API.  The free Developer account gives you 500 API calls per day for free.  This should be plenty to get started and it’s easy to upgrade later if your business case justifies it.

You use a http request to call the API and it returns either a JSON or XML object with the requested weather data included.  Since we’ll be accessing the API using .Net, XML will be the easiest to parse.  We’ll need to create a C# Class Library to store the code for the SQL Server CLR library.  I used Visual Studio 2010 as the development environment.

Class Library Dialogue

The end result that we want to achieve is a SQL function that takes a date and a Weather Underground key and returns a decimal with the average temperature for that day.  We’ll use the .Net XML library as well as the SQL Server API to achieve this end.  The code that can fetch the data looks like:

using System;
using System.Collections.Generic;
using System.Data.SqlTypes;
using System.Linq;
using System.Text;
using System.Xml;
using Microsoft.SqlServer.Server;

public partial class WeatherUndergroundFunctions
{
public const string weatherUndergroundAddress = "http://api.wunderground.com/api/{0}/history_{1}/q/MN/Minneapolis.xml";
public const string tempXPath = "/response/history/observations/observation/tempi";
[SqlFunction()]
public static SqlDecimal getDailyAverageTemperature(SqlDateTime tempDate, SqlString weatherUndergroundKey)
{
string formattedDate = tempDate.Value.ToString("yyyyMMdd");
decimal totalTemp = 0;
decimal observationCount = 0;

XmlDocument xmlTemp = new XmlDocument();
xmlTemp.Load(string.Format(weatherUndergroundAddress, weatherUndergroundKey.ToString(), formattedDate));

foreach(XmlNode obs in xmlTemp.SelectNodes(tempXPath))
{
totalTemp += Convert.ToDecimal(obs.InnerText);
observationCount += 1;
}

return new SqlDecimal(totalTemp/observationCount);
}
}

Let’s walk through a few parts of this code.  The weatherUndergroundAddress variable stores the http address to fetch the weather for Minneapolis.  It’s simple to get weather data for any major U.S. city by substituting its state code and name into the URL. We’ll substitute the key and date into the string as parameters.  The tempXPath stores the path to extract the temperature from the XML.  There are multiple temperature readings per day that the code averages out.

The first step in the function code creates a new XmlDocument to store the retrieved XML.  This variable is then loaded by using Load to retrieve data from the http address.  A foreach loop then averages out the temperatures during the day and finally the result is returned to the caller.

At this point I created an application to wrap the library in for testing.  It was simply a C# Console project with a reference to the class library.  I’ll provide the code, but won’t go into detail on it.  It simply displays the average temperature to the console window.

using System;
using System.Collections.Generic;
using System.Data.SqlTypes;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Threading;
using System.Xml;
using Microsoft.SqlServer.Server;

namespace WeatherUndergroundTestApp
{
class WeatherUndergroundTestApp
{
static void Main(string[] args)
{

SqlDateTime yesterday = new SqlDateTime(DateTime.Today.AddDays(-1));
SqlString key = new SqlString("<your_api_key_here>");
SqlDecimal avgTemp = WeatherUndergroundFunctions.getDailyAverageTemperature(yesterday, key);
Console.WriteLine("Average Temp: " + avgTemp.ToString());
Thread.Sleep(5000);

}
}
}

Once the class library has been tested successfully, all we need to do is to install the assembly into SQL Server as a CLR library.  Be sure to enable CLR before you take the next steps.  The first step is to make sure that the database will run the CLR code securely.  Two ways to do this are to cryptographically sign the code and the other is to set the TRUSTWORTHY property to ON.  Be sure to understand the implications of this choice in your own environment.  We’ll use the TRUSTWORTHY property to keep things simple.

USE Sandbox
GO

ALTER DATABASE Sandbox SET TRUSTWORTHY ON
GO

CREATE ASSEMBLY WeatherUnderground
FROM 'C:\CLR Code\WeatherUnderground\WeatherUndergroundFunctions.dll'
WITH PERMISSION_SET = EXTERNAL_ACCESS
GO

Next we’ll create a function that uses the assembly.  This is essentially a wrapper for the CLR function.

CREATE FUNCTION dbo.getDailyAverageTemperature(@tempDate DATETIME, @weatherUndergroundKey NVARCHAR(30))
RETURNS DECIMAL(10,2)
AS EXTERNAL NAME WeatherUnderground.WeatherUndergroundFunctions.getDailyAverageTemperature

Finally we’ll test the function.

SELECT dbo.getDailyAverageTemperature(DATEADD(DAY,-1,GETDATE()), N'<your_api_key_here>')

Run CLR Function

Once the function has been created you can use it like any other function in SQL Server.