Previously we discussed what is Data science, roles and responsibility of data experts and how data projects are conducted. Also, we mentioned data science pipelines and examined their main steps on the surface. But it is impossible to describe all aspects of each this step within just a few sentences, while every step would require an article or a few to just grasp the basics.
So, here we continue to talk about data science projects and start with Data Exploration - the initial step of Data Mining.
Data Exploration is the initial step of data mining and data analysis. Its role is to help Data Science specialists (us) to inspect data, understand its structure and nature, and identify patterns or anomalies. Our goal is to extract useful and meaningful insights that will be later either presented to the end user (to clients or, maybe, to your executives) or for future development of the intellectual automated systems.
Extracting insights is a complex, yet crucial task that requires a deep understanding of the problem at hand. To find something, you first need to know what are you looking for. If you try to predict future sales, likely you would be interested in finding temporal patterns in your data, like seasonality, holiday influence, or overall growth trends. On the other hand, if you try to create a customer support bot, the listed features are of no use and you would rather spend time developing sentiment and context analysis. Or if you do a recommendation system of some sort, you would require developing user and product profiles. The list goes on.
Yet, once you know what are you looking for and what the target variable is, half the work is already done. All that is left is to apply common Data Exploration steps, techniques, and practices. They are, for most projects you will face, somewhat universal and will just differ in implementation details. We will discuss them in the next part.
Though the Data Exploration process is highly dependable on the domain and the final goals of the project, there are some common steps and practices you may encounter most of the time. We will cover them with small examples.
Data cleaning is the process of removing irrelevant and/or useless data points from the dataset at hand. Usually, without the proper cleaning, we face multiple problems during the next steps, especially during the ML model training, where the quality of the data is the main determiner of the efficiency.
This step intersects with data ingestion and preparation and in some cases may already be done by this point. The common practices are to remove duplicates and noise, treat missing values, correct typos and formatting, and other “inconveniences” within your data. Data cleaning saves you time on the next steps and ensures you use only relevant data for analysis. This step requires you to either manually look at the data sample or to use IDEs, online solutions, or programming libraries that will help you automatically detect and separate “trash” data.
Let’s look at the example of processing the Titanic Dataset. We have twelve columns that give passengers’ characteristics of the infamous Titanic ship, also these columns include information on whether the passenger survived the catastrophe or not.
We can see that some columns contain missing values - that is not good, because many ML models do not properly process nulls. So we need to treat them. There are multiple ways to do it depending on the nature of the data, the business task at hand, how important this feature is, the amount of data, etc. Let’s take a brief look at the most common techniques:
Remove values:
Filling values - this approach requires a little more work and a semantic understanding of each feature. There are multiple ways to fill values:
Let’s process missing values in our dataset:
After this easy step, the data is already much cleaner and won’t crush the ML model:
Data wrangling is the next step representing transforming and preparing data for analysis to make it easier for models to effectively process the features. This step, or parts of it, may also be already done during the data ingestion or cleaning steps. Data wrangling includes but is not limited to such practices as:
Going back to our example, we have a column called “Age”. If you remember history, during the evacuation women and children had a priority. In such cases, humans tend to judge a numeric value in categorical space. Age can be categorized into “age groups”, like children (<18 years old) and adults (≥18 years old). In the current step, it is very logical to categorize age in such a way, so that it reduces the chances of the model fixating on specific numerical patterns that are of no use.
Additionally, we can delete unrelated columns as names or ticket numbers. If we won’t remove them, they may cause our future model to overfit and just remember the names of the survivors, which in production environments will make the model useless.
After making such wranglings, our dataset looks like this:
A descriptive statistic is a branch of statistics that utilizes quantitative description and summarization of the features with the goal of further data analysis to extract useful information, knowledge, and insights. During this step, you will calculate summary statistics that include: mean, median, mode, and standard deviations; percentiles and ratios; analyze distributions; etc. Most data can be easily described using only statistics, though, not always - chaotic data require totally different approaches and qualifications.
Overall, summary statistics are useful because they provide additional information that will be used to identify trends and patterns. They can be used to identify central tendencies and variability of the data, spot unusual data points, and create clusters and groups. Distribution analysis provides insights into the nature of the data, whether it is normally, universally distributed or the data is skewed.
The easiest and fastest way is to build a data summary table. It is often provided out-of-the-box in programming and software solutions. It usually looks something like this:
Even these simple statistics can provide useful insights into our data. For example, we can see that most people (more than 50% of passengers) were in passenger class #3, which means that this class is the lowest they could have. On average, people were “singles” or “couples” with only 0.52 family members on average, though there were people with a family size of 8. Also, there were infants and elderly people. Finally, only 38% of people survived this tragedy, this value can be used as a “ground proof” value that will show us whether the model’s metrics are better than a random generator.
Data visualization is a practice of data representation in a graphical format, usually in the form of various charts. Unlike computers, humans are visual creatures. It is hard for us to analyze thousands or millions of data points. But when we put them on the canvas - billions of years of the evolution of our sight will help us find additional patterns. Scatter plots will help us spot relationships between multiple variables, box plots are useful to show statistical information about a variable divided by groups in a condensed view, and heatmaps will help us identify correlations between multiple variables and so on. It is especially useful when you need to present results to non-technical people.
Data visualization is a hard and vast topic that requires separate discussion we would like to have. However, let’s go through a few examples of how visualization can help us analyze the data.
1. How histograms can be used to analyze distributions.
We can see that most people are young adults in the age range of 18-35, but also there are a significant amount of small children younger than 10. It can show that most families consist of two adults 18-35 and a small kid/s.
Also, it is actually quite close to the negative binomial distribution, which is very common in statistics.
2. How to analyze distributions with pie charts.
Pie chart is more suitable for categorical values, also it can be multi-level. We can see a distribution of Passenger classes and a distribution of the survivors among them:
3. Use heat map to analyze magnitude of the relationship between values.
The last example we will give today is a heat map. It is a matrix that shows a magnitude of a specific “phenomenon” when values of different variables intersect. With our data, we can use a heat map to count how gender determines whether you survive titanic or not:
We can see that the magnitude of a dead male is the largest, while most of the survivors are indeed females. Also, such visualization is very useful with the next step of the EDA.
4. Vizualize dataset statistics with box and violin plots.
Boxplots and violinplots are useful to show all the main statistical metrics of the numerical columns. Among these metrics are median, min/max, different percentiles, distributions (in the violins), and outliers. An example can be seen below:
Correlation analysis is a common statistical technique that measures the metric called correlation that displays the degree to which two variables are related. Its coefficient measures the strength and direction of the relationship in the range between -1 and 1. A value of -1 indicates a perfect negative correlation - when one variable increases, the second one decreases. A value of 1 is a reverse situation - when one variable increases, the second one follows along. A coefficient of 0 indicates no correlation, meaning there is no linear relationship between the variables, though there still can be other types of relationships. Example of the well-known correlation: the more you smoke the higher your chances of developing cancer.
Using correlation analysis on our data will indeed show a negative correlation between being a man and survive in a crash at the same time, or being in the lowest class and survive:
The process of identifying variables focuses on pinpointing predictors, which include both input and output variables, to delve deeper into data exploration. Depending on requirements, the data type of a variable can be modified.
Single-variable exploration delves into each variable individually. The approach taken in this exploration is influenced by the nature of the variable, be it continuous or categorical.
Dual-variable analysis is instrumental in uncovering the connection between two distinct variables. This analytical method is applicable to various combinations of categorical and continuous variables. The analysis employs multiple techniques to address different variable combinations, such as categorical with categorical, categorical with continuous, and continuous with continuous.
This concept involves modifying variables through specific functions. Three primary methods for altering variables include Logarithmic shifts, Grouping (Binning), and Root transformations (Square or Cube). Altering variables can reshape their relationship or distribution in relation to other variables. Such transformations are beneficial when there's a need to adjust a variable's scale, simplify understanding, or convert intricate non-linear associations into linear ones. A symmetrical distribution is often preferred over an asymmetrical one since it simplifies interpretation and inference-making. From a practical standpoint, variable alterations are also undertaken.
This entails the creation of fresh variables by leveraging existing data variables. The primary goal is to underscore connections between concealed variables. There are various methods to craft or introduce new features, including the use of derived variables and the introduction of dummy variables.
Data Exploration is a complex task that often is done incorrectly even by the most experienced specialists. This can make any work you have done meaningless or not effective. Here are some common mistakes to avoid:
We continue our sequence of articles about data science and its components. Here we have covered the basics of one of the most important steps in any data science and data mining project - Data Exploration. This step is important because the insights generated by it are the key to any successful machine-learning model. The common wisdom says that you can’t teach another person a subject you do not understand yourself. The same truth is for us trying to teach machines - If we, the smartest creatures on the Earth with billions of years of evolution, can’t understand at least the basics of the data and the goal, then how we can ask the same from the computers who yet lack the opportunities to evolve as we did.
We are planning to continue the ‘Introduction to Data Science’ article sequence and dig deeper into other steps of the analytics project.
Check out the previous article where we have discussed the expertise required to tackle a data science project, as well as the stages and roles involved: Data Science: Bridging the Gap between Raw Data and Business Insights