Predictions are only as good as the data used to calculate them. But when it comes to quality, not all data is good data. As they say, “garbage in, garbage out.” Of course, the best models should digest and analyze a wide variety of data sets. But even though quantity matters, it’s the “cleanliness” of the data that’s ultimately paramount.
Quality is king
Some data sources are of such low quality that they should be excluded from being incorporated from forecasting models altogether; four trillion incomplete and mismatched data points is exactly that.
Just as high-quality ingredients are essential to a world-class meal, clean data sets are vital to the predictive models that produce insightful, actionable demand forecasts. One spoiled ingredient can affect the overall quality of a dish across the board. Similarly, if the data used to predict product demand is “dirty”—incomplete, inconsistent, and incorrect—the quality and accuracy of a forecast will be off.
The data-cleaning process
Far too often CPG companies will employ less sophisticated forecasting techniques, such as manually crunching aggregated monthly data. It is therefore essential to have access to technology that incorporates what can be thought of as an inspection and cleaning process. Prior to being pushed to predictive models, cleaning algorithms are applied to a number of data sources, detailed below, to shore up any gaps in data that may exist. Then, the data can be properly utilized.
Remember, when supply chain forecasts rely on ample data that’s clean—well organized, timely, precise, consistent, and complete—then accuracy is more likely to follow. The sheer amount of data that’s necessary to feed the best machine-learning models is enormous, requiring enterprise-grade servers and the work of a team of data engineers.
Data sets vital to predictive models and forecast accuracy
Traditionally, demand forecasts have primarily been calculated using historical shipment data—in other words, using the past in order to identify trends and predict the future. But the most capable predictive models take into account historicals sales plus a variety of data sets, including:
- Historical sales (shipments at the SKU-level by customer, ship-to and ship-from locs.)
- Own and competitive marketing and trade investments (promotions, e.g.)
- Online search behavior and social media mentions
- Point-of-sale (POS) consumption data at both chain and store levels (foot traffic)
- Seasonality (and other fluctuations related to time)
- Weather patterns
- Ingredient and health trends
Data cleaning in action: Common gaps and errors
The more data there is, the higher the likelihood of error existing within the data pool. Of course, data cleanliness levels vary from company to company. Some brands have executed a plan to maintain consistent, complete data throughout the supply chain, from their financial statements and products to their inventory at warehouses and fulfillment vendors. But most CPG brands have significant errors within their business data, often because they do not have the proper technological infrastructure in place.
Here are some examples of where gaps in data sets commonly occur, which can ultimately harm demand forecast accuracy.
Historical Sales and Seasonality
Historical sales order data is among the most important data sources for a supply chain forecast. The more granular and accurate the data, the more accurate forecasts tend to be. The best sales order data contains details about every order going back many years, which enables the dynamic relationships between seasonal trends and long term growth trends to be ascertained. Because there is an immense wealth of information contained in day-to-day sales patterns alone, it's vital for brands to possess substantial data science computing capabilities designed to process raw daily sales data.
It's critical to gain a deep understanding of the relationship between orders (over time) and inventory (over time). This helps identify inventory problems relative to sales, particularly trends over time. For instance, the analysis of quantities ordered versus quantities shipped is very important in understanding sales that could have occurred versus sales that actually occurred—due to insufficient inventory. This data is vital to producing and providing accurate inventory predictions, which helps brands plan optimum inventory levels to match (seasonal) sales demand.
SKU codes and/or UPC numbers often change during a product’s lifetime. This informational trail provides vital data about a product’s history, especially during the process of establishing and analyzing time relationships between SKUs and UPCs.
Products often become discontinued or obsolete. Data detailing these product sunsets is vital to understanding the demand of comparable products.
New Products and Distribution
It’s essential for brands to have (and utilize) a list of future product launches, as well as to establish a clear vision of expected increases in pipeline shipments and distribution gains. Machine-learning models can’t be applied to a potential new product unless they’re programmed to do so. Put simply, it's quite difficult to produce accurate predictions for new products unless it is known what there is to predict.
For some products, sales are heavily driven by price discounts. It's vital to establish the dynamic numerical relationship between promos and changes (lift/drop) in sales.
Misspellings, human entry mistakes, and other errors can slow down the pace and quality of data analysis. Cleaning algorithms enable data mapping, which helps to identify various common trends in human error.
Data cleaning in action: The importance of using clean mapping data
Top-performing machine learning models work to clean and associate, or map, countless invaluable data points. Some data sources should undergo what can loosely be described as an associative data mapping process, in order to become more useful to predictive models. Here are a some examples:
Retail and Wholesale Data
In order to fully incorporate store-level consumption data, predictive models should utilize a mapping key to match the case SKU with an individual unit UPC. From a machine learning perspective, this allows us to understand how retail customer demand drives wholesale case orders.
Wholesale and Warehouse Fulfillment Data
Many brands understandably want to optimize their forecasts by fulfillment warehouse. To do this, it’s essential to map key data between wholesale customers and the specific warehouses that fulfill those orders. Doing so enables a high level of accuracy for these types of forecasts.
In short, brands that possess clean internal and external data across the board, plus the necessary computing power, will reap the benefits of forecast accuracy. The trend of CPG brands relying on clean ample data, and the continuous development of machine learning models that analyze the data to produce accurate forecasts across supply chain, is a trend that will continue to enjoy plenty of wind behind its back.