Exploratory data analysis: Data science in industry
The advancement of the Smart Industries is inevitable and has the power to revolutionize the entire state of the art of industrial technology already known today. Even with its apparent growth, the implementation of this new production model must be done strategically to be able to generate results. According to the article “Quality 4.0: a review of big data challenges in manufacturing“(2021), 92% of leaders interviewed are investing in Big Data and Artificial Intelligence. However, many of them are doing this implementation with a lack of structure, and, as consequence, report problems in taking proper advantage of these technologies.
To keep up with this changes, new concepts have been created to further improve the performance of industries in these era of digital transformation. “Quality 4.0” is an example, as it improves quality management, by supervising types of manufacturing and distribution, using data science. This idea is based on discovering improvements analytically and collecting data in real time, resulting in strategically decisions.
Data science plays an essential role in these and other processes, serving as a foundation for new insights. To identify potential improvements, data scientists use different methods of analysis, like the ones explain in this article. These different methods of analysis pave the way for assertive results, forming the basis for the implementation of disruptive concepts in the industry, such as “Quality 4.0”.
In summary, the analysis methods further addressed will be:
- Exploratory Data Analysis (EDA): it is the use of graphical or quantitative techniques to get a better perception of a data set;
- Explanatory Data Analysis: it comes after exploratory analysis, and it encompasses confirming existing theories or hypotheses;
Exploratory Data Analysis: unlocking results in industry
Although the implementation of data culture in industries has seen greater growth in the past few years, its creation is not recent. The term “Exploratory Data Analysis” was introduced by John W. Tukey, a renowned statistician, in the 1970s. With the arrival of Big Data, its application in industry has become more prevalent in the last two decades, resulting in an important phase for data analysis.
Exploratory Data Analysis (EDA) helps to find the best way to manipulate the information collected to get to the answers needed in each case. This facilitates for data scientists to identify anomalies and patterns, test hypotheses, or check assumptions. It is used to explore what data can reveal beyond its formal modeling, as well as to understand the variables of the set and how they relate with each other.
The steps for applying EDA in industry are:
- Data collection and cleaning: Data is collected directly from machines or industrial processes. After that, before starting the analysis, it is important to clean it up, removing its missing values, outliers, and correcting possible errors. This process ensures the accuracy of the analysis;
- Exploration of variables: Analysis of the variable’s characteristics, such as distribution, central tendency, and dispersion of data. This is done through descriptive statistics such as arithmetic mean, median, mode, and standard deviation;
- Identification of outliers: Outliers are extreme or unusual values that distort the analysis and results. The EDA makes it possible to identify whether these should be removed, treated or kept, depending on the context of the industrial plant and the purpose of its analysis;
- Correlation analysis: By applying the EDA, it is possible to notice the relationship between the variables and identify connections that can be useful in the construction of predictive models. This information can provide important insights when it comes to developing new strategies;
- Visualization of results: From this analysis method, it is possible to create graphical visualization of the data, which is more accessible and easier to understand. Charts such as histograms, scatter plots, and boxplots can reveal patterns and trends that have previously been hidden;
Different Analysis Techniques
There are 4 main types of EDA:
- Non-graphical univariate: It is the easiest way of data analysis, as it analyzes one variable at the time to understand its distribution and identify patterns or anomalies. It does not deal with causes or connections, and its main purpose is to describe the data and monitor its behavior;
- Non-graphical multivariate: Analysis of two or more variables together to understand their complex connections. Non-graphical EDA techniques usually show the interaction between variables by cross-referencing tables or statistics;
- Univariate graphing: Non-graphical methods do not provide a complete picture of the data collected, so graphical methods are required. Common types of univariate charts include histograms, box plots, and stems or leaf plots;
- Multivariate graphing: Multivariate data uses graphs to display connections between two or more data sets. The most commonly used type is the clustered bar chart or bar chart where each group of bars represents a value of one variable, and each bar within the group represents a value of the other variable;
It is also possible to use statistical and visualization techniques as a form of flexible and wide exploration that enables data scientists to delve deeper into data beside preconceived ideias. Some examples are:
- Descriptive Statistics: This technique involves calculating measures of central tendency (such as mean), dispersion (amplitude, variance, standard deviation), and shape (asymmetry, kurtosis) for each variable in the data set;
- Clustering: Clustering techniques such as K-means clustering, hierarchical clustering, and DBSCAN are used to group similar data points together;
- Outlier detection: techniques such as Z-score and the IQR method are used to detect outliers in the data;
Analytical methods that bring results
Exploratory data analysis is an important step for the implementation of data science in industry, and its positive impact can be perceived in different forms.
The data collected in real-time is used as a basis for meaningful insights that go beyond traditional production optimization methods. As a result, efficiency is increased due to the quality of the data collected, which is visualized in an intelligent way at the time of its analysis. Also, this technique enables a more assertive analysis, which reduces the waste of supplies used in operation.
Additionally, EDA can predict machine failures and improve maintenance schedules, reducing operation downtime and increasing production. Through data analysis, it is possible to implement predictive maintenance, instead of corrective maintenance. This process assists in better production line planning, while support managers in making better decisions. This reflects both in an optimize control of the supply chain and in implement scheduled shutdowns, which do not damage the production.
Finally, exploratory analysis, used to identify patterns and anomalies, enables the quick identification of instabilities, contributing to increased product quality.
Several industries are already making use of this method of data analysis, each focusing on different purposes. General Electric, for example, uses this technology primarily to improve its customer experience, thanks to the reduction on the number of defective products. In a broader scale, Nike uses data analysis to track the performance of athletes during training, assisting on their improvement.
Comprehensive Data Analysis: Explanation of the data found
Comprehensive data analysis (or “explanatory” data analysis) is a technique that is concerned with making inferences from the data collected, aiming to explain the patterns of the data after hypothesis testing. It is used when the data scientist identifies a specific issue that needs to be communicated to the public. In summary, this type of analysis is a statistical approach that involves explaining the insights of a data set. In the industry context, it is used to explain data and provide new insights, which help improve performance, efficiency, and productivity.
This process happens after Exploratory Data Analysis, and uses data visualization, statistics, and transformation methods to explain the core features found. In this explanatory phase, the scientist can use many techniques to clarify how the input variables (or characteristics) are related to their output variable (setpoint). Some examples are:
- Regression analysis: It is used to model the connection between a dependent variable and one or more independent variables. It is useful for understanding what factors influence the outcome of a process;
- Time Series Analysis: It is used to analyze data collected over time, such as production rates or quality measurements;
- Pareto analysis: This technique is used to identify the most significant factors in a data set;
- Statistical Process Control (SPC): SPC methods such as attribute and variable charts, individual and moving interval, execution and pre-control are used to monitor and explain the manufacturing process;
The process of hypothesis testing, and measurement of results may also involve statistical significance (indicating if the results are reliable). In addition, its impact (the magnitude of the difference or relationship) is another important measure in explanatory analysis.
Data analysis used in different types of industry
As previously mentioned, the application of both analysis techniques is widely used in manufacturing industries, and its positive impact can vary according to the sector. For example, in the automotive industry, exploratory data analysis can be applied to monitor the health of equipment and predict failures. At the same time, the Explanatory type is used to identify the impact of different assembly line configurations on vehicle defect rates.
In the food and beverage industry, exploratory data analysis can identify variations in ingredient consistency. On the other hand, the explanatory technique can explain the relationship between processing parameters (such as temperature, mixing speed) and the quality of the final product.
Finally, in electronics manufacturing, exploratory data analysis enables the industry to visualize component failure rates over time to identify trends. And explanatory analysis helps to understand how variations in environmental conditions (such as humidity and temperature) affect weld quality.
As seen, these two data analysis methods play a crucial role in various sectors of industries, as they provide insights that contributes to process optimization. With them, the data scientists can identify patterns and promote the productivity and strategic management of the line. Exploratory analysis acts in understanding and visualizing data, while comprehensive analysis explains interactions between variables and tests hypotheses. Together, these techniques are part of a constantly growing digitalization process and help to realize the so-called smart industries. Learn more about us.