Does More Data Equal Better Analytics?
In our modern world of everything digital and big data, organizations are flush with available data assets that can be used for analytics. We will set aside the problem of engineering all this data for the moment and look at a different problem – or benefit if you’re a glass half full type of person – how does more data equal better analytics?
Google’s Research Director Peter Norvig has been quoted many times on the value of more data in analytics:
“We don’t have better algorithms. We just have more data.”
“More data beats clever algorithms, but better data beats more data.”
“Simple models and a lot of data trump more elaborate models based on less data.”
And once you hear statements such as these from a world renowned expert, the logical next two questions are “where?” and “how?” The statements from Peter Norvig are heavily slanted towards data science as that was his main job and area of expertise. But more data can deliver better results in BOTH data science and traditional analytics. Let’s explore these.
More Data = More Features
Let’s start in the world of data science. The first and perhaps most obvious way in which more data delivers better results in data science is the ability to expose more features to feed your data science models. In this case, accessing and using more data assets can lead to “wider datasets” containing more variables.
Uniting more datasets into one helps the feature engineering process in two ways. First it gives you more raw variables that can be used as features. Second, it gives you more fields that you can combine to make derived variables.
It is important to note that the brute force approach of throwing more features at a model is NOT the objective. That would be over engineering the model. The objective is to explore as many features as possible to find their fit for the problem at hand and choose the best features.
More Data = Better Training
AI and machine learning models are only as good as the data you use to train the model. And to most people the natural conclusion is that the larger the volume of data – “longer datasets” – I throw at a model the better my model will “learn.” While a good goal, one needs to be careful with this as well and explore two areas: variance and bias.
A situation called high variance can occur when we have added too many features to the model – over engineered it as we discussed above – and don’t have enough data volume to train the model well. This situation can be fixed by simplifying the model and throwing greater data volumes at it.
Another case is high bias, where the model is too simple with not enough variables or relevant features. In this situation, throwing more data at the model will not make things better. The better approach is do as prescribed above – explore more data to find the right features, and then throw more data at the model.
More Data = More Dimensions and Measures to Explore
In traditional analytics, more data can help as well. In the case of ad-hoc analytics you are trying to answer new questions that the business is asking, or re-ask questions that have a high degree of variability in the answers based on this situation. Bring more data together allows you to explore it much more in-depth to find the rightanswer.
By uniting more datasets and creating wider data, you then have more dimensions to explore and a greater number of measures that can be rolled up. More data can also give you a greater number of values in particular fields that can be explored also. This combination lets you “fail fast”, meaning I explore various analysis paths rapidly, and if that doesn’t produce the answer, you quickly explore another path until the best answer is produced.
Be careful however, as some BI tools have limitations on the size of data sets and the number of variables one can explore. Excel, the most popular analysis tool in the world certainly has its’ limitations. Large scale data exploration requires a robust data infrastructure that facilitates the volume of data.
More Data = Wider Purview
Adding more data to an analysis can also help gain a broader and more complete perspective on a business problem. The more data I add from different aspects of a problem the more complete view I have. It can help create what is often referred to in the analytics world as a 360 degree view.
Prime examples of this are in customer analytics: customer experience, customer behavior, customer retention. For example, in each of these use cases, if I only have data from some channels but not all, I have blind spots that that may be keeping me from getting the most accurate answers. The more data added, the broader the purview to the problem, creating increased accuracy and trust in the results.
More Data = More Detailed Results
Many of the new analytic questions that come from the business are trying to answer “why” and “how” questions. Perhaps a dashboard showed metrics that varied greatly from the norm. So immediately, the business wants answers that explain why or how the situation is happening. And they also want “actionable results” telling them what to do about the situation.
This requires adding a great deal of detail data to the analytics to dig deep and find the in-depth answers the business is seeking. In this case, we are creating a wider dataset to explore more variables and find the right set of variables that are influencing the situation, to create actionable results that explains not just the why and how, and more importantly the “what” – what to do.
A prime example of this comes up in marketing analytics. A dashboard may show which marketing campaigns are performing better than others and which are performing poorly. Making adjustments is not as simple as continuing the good ones and shutting down the bad ones.
In this case the business wants the detailed aspects of the campaigns analyzed to determine the best course of action. Are there aspects of the marketing channel that are making campaigns succeed or fail? Demographic characteristics of the targets? Features of the offers?
Armed with these details, the business can make the proper adjustments and action plan to adjust the marketing mix. Given very fast answers – within hours – can also eliminate wasted costs incurred on the poor marketing campaigns because the business had to wait for the answers.
More Data = Better Segmentation
Related to the problem above, adding more data to the mix helps create better segmentation models in general. This is done both with wider data and longer data.
Creating wider data will add more variables to the equation that can be used for segmentation. Teams can explore algorithmically (e.g. clustering, decision trees) or visually. And using longer data will add a greater amount of time to the analysis and help improve the accuracy of the metrics used within the segmentation.
As we seen, adding more data to your analysis will help you produce better results. This is not just from just broadly adding more data, but also finding the rightdata to fit your problem and produce a trusted result. Adding more data will help in data science problems to improve accuracy, and in traditional analytics to explore detailed why and how questions, produce actionable results, and gain a wider purview on various analytic situations.
The Neebo Virtual Analytics Hub is a cloud-first solution allowing analytics teams to find, create, collaborate and publish trusted analytic data assets in complex hybrid landscapes. Neebo provides unified access across data silos, increases use of data assets and furthers data knowledge to build trust and rapidly answer new business questions. Neebo allows analytics teams to easily access and consume more data in their analytics and find the right data to produce the most accurate and detailed results. To learn more visit the Neebo website or test drive Neebo by registering for a free 14-day trial.