mattostanik
- Feb 11, 2019
- 6 min read

How to analyze a data set

Updated: Feb 15, 2019

I have always liked numbers. In my first business ten years ago, a software company named Submittal Exchange, I began spending time analyzing data from our sales team’s activities to understand where our revenue growth was headed. I found it intriguing enough that I started a second company, FunnelWise, which provided sophisticated sales and marketing analytics software. After FunnelWise ended, I began spending time studying how analytics can be applied to financial transactions and stock trades.

From my experience there are a common set of principles that can be applied to almost any data analysis project. I am sharing the high level list here to help others and to illustrate how I personally approach analysis projects.

1. Get the right sample size and quality

First make sure that you have an appropriate quantity of data in the set that you are going to analyze. It’s great if you have the entire set in one file, for example all transactions over all time. But sometimes that is not possible if there are too many data points to include in one easily manageable file or database, or if there is a cost or time delay to gathering or extracting all of the data from its source.

If the data set is too large, you can select a smaller sample set from it. How large of a sample do you need? That’s a key question. If you search for “sample size calculator”, there are a number of good options to help with this. Here is one sample size calculator that I have used frequently. The language on the calculator is written for surveys, but really it can be applied to almost any data analysis exercise.

I often times will start with a small sample to test ideas or answer basic questions. This is particularly helpful for projects where there is a cost (in dollars or time) to obtain a larger data. If I run an analysis on a relatively small sample and it suggests an outcome I am interested in, then I can re-run the analysis on a larger set. But using an initial smaller sample can make it easier to test multiple theories in less time or at a lower cost. Sometimes it may be beneficial to use a multi-step progression with the sample sizes. Start very small and run a quick test. If the result points in the right direction, then re-run it on a larger sample, let’s say anywhere from 2 X to 10 X larger than the first one. If it passes the test again, then continue to expand to even larger samples or to the entire data set.

Quality of the sample is also very important. The sample needs to be representative of the entire data set, to the extent possible. Using random selection to pull sample data from the larger set can be preferable, but double-check the data distribution to make sure it still roughly aligns with the larger set.

2. Probabilities are more reliable with larger sets

Another key principle is that statistical probabilities will be more accurate with larger data sets. There is a lot of power in working with aggregate numbers. For example, if I look at 1 individual event and predict its probability at 60%, the probability can be correct but ultimately the event either happens or it does not. It can be challenging to predict the occurrence of any 1 event with a high degree of certainty. But if you are predicting 100 events or 1,000 events, then the probabilities are much more likely to be accurate at that aggregate level.

3. Slice, and slice again

Slicing or segmenting is the next step in my approach to data analysis. I prefer to break the data into groups based on different attributes, then I analyze each segment and compare them. The comparisons often lead to very important conclusions that drive the overall results and assist greatly with developing forecasting algorithms.

When I talk about slicing or segmenting, I always think of a simple example from sales forecasting in my first business. In that business we had two types of customers: customers who were brand new (they had never purchased our software before) and customers who were repeat. The close win rates and average sales cycles on opportunities from new customers behaved completely differently from repeat customers. It was a critical dimension to understand to build accurate projections for that business.

Which and how many segments to test depends on the data set, what you know about it, and what the goals of the project are. I personally prefer to test anywhere from 10 to 50 different slices or segmentation approaches on most projects, if the data allows for it. I view it as an investigative journey. Often times most of the segments will not show outcomes any different than the larger set. But if I can uncover one or several dimensions that make a key difference, then the effort is well worth it.

4. Test a forward-looking projection using only backward-looking data

Once I develop a predictive algorithm or statistical model using data segments, my next step is to test it by attempting a forward-looking projection using only historical data. In other words, can I accurately forecast future data using only the information given to me by past data? This is significantly different and more challenging than retrofitting a model to the data when the outcome is already known.

An example of this is if I have three years worth of data. A novice approach might be to build a model that accurately predicts the outcomes from the data. This is relatively easy because we already know what happened, it is just a matter of fitting an algorithm to it. But that does not guarantee that the algorithm will work for future events.

The better approach is break the data into different time periods and test the predictiveness of the model using only backward-looking segments. I might build a model based solely on the first year’s data. Then I test the predictiveness of that model by comparing it to the second year’s results. Did the probabilities from the first year allow me to accurately forecast what happens in year two? Usually some refinement is needed, then I repeat the test again for year three’s outcomes - but basing it only on data from the prior years. How this looks in practice can vary a lot depending on the data set, the available time period and the project goals, but the conceptual approach remains the same.

5. Look for caveats

Finally I like to look for specific practical limitations or “caveats” that may impact the outcomes of my data model. This requires a good understanding of the business functions that generate the data and how they work in the real world.

One example of this is with sales team data. When I started building algorithms for sales forecasting, I found that my initial model could predict quantities with reasonable accuracy but was not as accurate for actual revenue. As I dug further, I uncovered that the reason was because of the practical reality of sales discounting. Sales reps would enter their opportunities into the CRM with ‘list price’, but they would frequently end up discounting the price before a deal was closed. As a result building projections based only on open dollar amounts in the CRM was resulting in inaccurate forecasts. But when the data model was adjusted to incorporate likely discounting, the model became much more accurate.

Another example is with stock market data. In recent months I have spent time analyzing stock data and exploring a couple possible strategies. My initial approach showed promising potential returns, but it relied on a mix of a long and short trades. (In stock marketing terminology, a short is a bet that a stock’s price will go down.) A caveat with the model was that some stocks are easier to short than others because placing a short requires borrowing the share. In fact, some stocks may be impossible to short with certain brokers. I re-examined the model and focused on stocks that appear on the easy-to-borrow list so they would be more likely to be available to short. This revised approach also generates solid returns but it was a key learning point.

Data always tells a story

It’s amazing the insights that can be obtained from thoughtful examination of interesting data sets. I enjoy data analysis that uncovers or validates practical business and financial strategies. If you’d like to speak more about projects that you have considered or data sets that might contain valuable storylines, please feel free to reach out to me.

IAQuant

How to analyze a data set

Recent Posts