Exploratory data analysis (EDA) and its application
the so-called exploratory data analysis (hereinafter referred to as EDA) refers to a data analysis method that explores the existing data (especially the original data from investigation or observation) under as few prior assumptions as possible, and explores the structure and laws of data by means of mapping, tabulation, equation fitting, calculation of characteristic quantities, etc. Especially when we don't have enough experience in the information in these data and don't know what traditional statistical methods to analyze, exploratory data analysis will be very effective. Exploratory data analysis was proposed in the 1960s, and its method was named by John Tukey, a famous American statistician
eda appears mainly in the preliminary analysis of data, and it is often impossible to carry out routine statistical analysis. At this time, if the analyst first makes an exploratory analysis of the data, discriminates the patterns and characteristics of the data, and excavates them in an orderly manner, he can flexibly select and adjust the appropriate analysis model, and reveal various deviations of the data from the common models. On this basis, statistical analysis techniques based on significance test and confidence interval estimation can be used to scientifically evaluate the specific situation of the observed patterns or effects
so to sum up, the analysis of data can be divided into two stages: Exploration and verification. The exploration phase emphasizes flexible exploration of clues and evidence to find valuable information hidden in the data, while the verification phase focuses on evaluating these evidences and studying some specific situations relatively accurately. In the verification stage, the main method commonly used is the traditional statistical method. In the exploration stage, the main method is EDA. Next, we will focus on further explanation of EDA
eda has three characteristics: first, let the data speak in the analysis idea, and do not emphasize the collation of the data. Traditional statistical methods usually assume a model first, for example, the data obey a certain distribution (especially the normal distribution), and then use the method suitable for this model for fitting, analysis and prediction. But in fact, most data (especially experimental data) can not guarantee to meet the assumed theoretical distribution. Therefore, the statistical results of traditional methods are often unsatisfactory, and their use is greatly limited. EDA can start from the original data and deeply explore the internal laws of the data, rather than starting from some assumptions, applying theoretical conclusions and sticking to the assumptions of the model
second, EDA analysis method is flexible, rather than sticking to traditional statistical methods. The traditional statistical method is based on probability theory, using hypothesis test and confidence zone with strict theoretical basis, otherwise the heater room and other processing tools will be destroyed when powered on. EDA processes data in a flexible and diverse way. The choice of analysis methods is completely based on the data. It is flexible to treat and process. It uses whatever method can achieve the purpose of exploration and discovery. It is particularly emphasized here that in the test process, the force value is usually closed-loop controlled by the computer. EDA pays more attention to the robustness and resistance of the method, rather than deliberately pursuing the accuracy in the sense of probability
third, EDA analysis tool is simple and intuitive, which is easier to popularize. Traditional statistical methods are abstract and abstruse, which are difficult for ordinary people to master. EDA emphasizes more on intuition and data visualization, and more on the diversity and flexibility of methods, so that analysts can clearly see the valuable information hidden in the data, show the universal laws and distinctive characteristics they follow, promote the discovery of laws, get enlightenment, and meet the various requirements of analysts, This is also the main contribution of EDA to data analysis
it is worth mentioning that because EDA emphasizes intuition and graphic display, it adopts many innovative visualization technologies. At present, these visualization technologies have a good implementation carrier. At present, the most mainstream exploratory data analysis software is the statistical discovery software JMP, which is famous for its good graphic effect, strong interactivity and easy to learn and use. Even analysts without statistical foundation can easily find the laws of data, fitting and residuals with the help of JMP, obtain unexpected findings, and inspire ideas and point out the direction for subsequent analysis
next, a typical small case is used to illustrate the practical application of EDA
in order to do some research on the development trend of the global economy and the business status of the world's top companies, we can download data from the public website (such as the list of Forbes 2000 in), and after slightly sorting with JMP, we can get the data table shown in Table 1, which includes the name of the listed company, its industry, country, year of listing, ranking, market value, assets, sales Nine variables such as profit amount, a total of 14000 records (2000 records per year, a total of 7 years from 2004 to 2010). Now the question is: what kind of valuable information is hidden in the data? How can we find this information
someone said: since it is continuous data and contains time variables, it should be analyzed by time series method! Indeed, time series can tell us the change of variables over time. However, in practice, the valuable information we want and can get is often far more than "change over time". Moreover, users who need to analyze these business data often do not know what "time series analysis" method is
some people say: using some traditional graphic tools, such as line chart, bar chart, pie chart, etc. to analyze, can't we explore the data? This method seems to be feasible, but there are many category variables in these data, and their classification levels are many (for example, the year spans 7 years, the industry is divided into 30, there are 75 countries, and the company name is as many as 3505). In this way, mapping alone may make us exhausted, and where can we start "data exploration"
table once the ranking data of Forbes 2000 compiled by JMP software
what method can we explore these data well and find out the important information we expect or even unexpected? Where should we analyze to find this information? Our trial production of low iron aluminum officially kicked off to try to use the visualization technology "bubble chart" in modern EDA to look and think. With the help of JMP software, we can quickly get a graph similar to figure 1. The horizontal axis represents the market value of the company, the vertical axis represents the sales volume of the company, the size of the bubble represents the profit of the company, and the color of the bubble represents the industry of the company. The most significant thing is that all bubbles are not static. Their position and size will change dynamically with the change of the year. At the same time, the historical track line of the whole change will also be shown in the figure
Figure 1 dynamic bubble chart generated based on JMP software
in this way, we can intuitively find some obvious data features. Take the two well-known companies identified in the picture. We will find that the operating performance of General Electric is relatively stable, while Exxon Mobil is relatively volatile. Although there are obvious differences between the two, the market value has fallen significantly since 2008, which should be related to the economic crisis sweeping the world at that time
some people will come up with some new ideas after discovering these characteristics: General Electric and ExxonMobil are American enterprises, and how about the performance of Chinese enterprises? We can call the "data filtering" function in JMP while using the "bubble chart" to get an interface similar to figure 2
Figure 2. The combination of dynamic bubble chart and data screening in JMP software
it can be clearly observed that in the seven years since 2004, a total of 392 Chinese enterprises have been listed on the Forbes ranking list. Although there is a certain gap with the world's top enterprises in terms of quantity, market value, sales and other business indicators, the reasons for this phenomenon are various, but a number of large state-owned enterprises represented by PetroChina and Sinopec China petroleum have developed rapidly, attracting world attention
in fact, exploratory data analysis is far more than that. At the initial stage of data analysis, analysts can fully expand the wings of imagination without being bound by too many theoretical conditions, and visually explore the laws of existing data from multiple angles and layers. New clues often appear naturally, laying a good foundation for the next step of statistical modeling, prediction and other refined analysis
in short, exploratory data analysis emphasizes flexible exploration of clues and evidence, focusing on finding valuable information that may be hidden in the data, such as data distribution patterns, change trends, possible interactions, abnormal changes, etc., while traditional statistical methods focus on evaluating the found evidence, which usually requires analysts to have a certain statistical basis. Choosing different technologies according to different business purposes and data resources, or using these two types of technologies together, will enable us to obtain more discoveries faster. For most enterprise personnel (such as market analysts, quality managers, etc.) who do not have statistical skills but have more and more data analysis tasks, paying attention to, learning and making good use of exploratory data analysis can often get twice the result with half the effort. (end)
LINK
Copyright © 2011 JIN SHI