Previously we discussed what is Data science, roles and responsibility of data experts and how data projects are conducted. Also, we mentioned data science pipelines and examined their main steps on the surface. But it is impossible to describe all aspects of each this step within just a few sentences, while every step would require an article or a few to just grasp the basics.
So, here we continue to talk about data science projects and start with Data Exploration - the initial step of Data Mining.
What is Data Exploration?
Data Exploration is the initial step of data mining and data analysis. Its role is to help Data Science specialists (us) to inspect data, understand its structure and nature, and identify patterns or anomalies. Our goal is to extract useful and meaningful insights that will be later either presented to the end user (to clients or, maybe, to your executives) or for future development of the intellectual automated systems.
Extracting insights is a complex, yet crucial task that requires a deep understanding of the problem at hand. To find something, you first need to know what are you looking for. If you try to predict future sales, likely you would be interested in finding temporal patterns in your data, like seasonality, holiday influence, or overall growth trends. On the other hand, if you try to create a customer support bot, the listed features are of no use and you would rather spend time developing sentiment and context analysis. Or if you do a recommendation system of some sort, you would require developing user and product profiles. The list goes on.
Yet, once you know what are you looking for and what the target variable is, half the work is already done. All that is left is to apply common Data Exploration steps, techniques, and practices. They are, for most projects you will face, somewhat universal and will just differ in implementation details. We will discuss them in the next part.
Data exploration steps and practices
Though the Data Exploration process is highly dependable on the domain and the final goals of the project, there are some common steps and practices you may encounter most of the time. We will cover them with small examples.
Data cleaning
Data cleaning is the process of removing irrelevant and/or useless data points from the dataset at hand. Usually, without the proper cleaning, we face multiple problems during the next steps, especially during the ML model training, where the quality of the data is the main determiner of the efficiency.
This step intersects with data ingestion and preparation and in some cases may already be done by this point. The common practices are to remove duplicates and noise, treat missing values, correct typos and formatting, and other “inconveniences” within your data. Data cleaning saves you time on the next steps and ensures you use only relevant data for analysis. This step requires you to either manually look at the data sample or to use IDEs, online solutions, or programming libraries that will help you automatically detect and separate “trash” data.
Let’s look at the example of processing the Titanic Dataset. We have twelve columns that give passengers’ characteristics of the infamous Titanic ship, also these columns include information on whether the passenger survived the catastrophe or not.
We can see that some columns contain missing values - that is not good, because many ML models do not properly process nulls. So we need to treat them. There are multiple ways to do it depending on the nature of the data, the business task at hand, how important this feature is, the amount of data, etc. Let’s take a brief look at the most common techniques:
Remove values:
- Fully remove the column with missed values - this is, perhaps, the easiest option but also very dangerous if done carelessly. This action is advised only in cases a column is not important (let’s say, all values are unique or represent non-general features like the user’s name) and/or the fraction of missed values is too large;
- Remove rows with missing values - this option can be utilized when a fraction of missing values is small to the overall large dataset. But you should be cautious with that approach - sometimes missing values can represent an important category without which a model will not be able to represent the full data and will fail to deliver efficient predictions;
Filling values - this approach requires a little more work and a semantic understanding of each feature. There are multiple ways to fill values:
- fill with a constant or with a specific value - useful when a missing value represents a separate category or you have a “default” value for such cases, also you can have a conditional filling which depends on other features;
- fill with a function - you can fill the value by using different statistical functions like average, min, and mode for numbers, or top (most common option) value for strings/categories;
- backward and forward fill - technically that is also a function but it behaves in a slightly different manner. These functions are used in ordered datasets (usually such datasets have dates) with temporal dependencies between neighboring rows, e.g. in predicting the stock market or other time series. In these cases, you might want to fill in missing values from either the previous row or the next one. Basically, you are trying to extrapolate the data.
- using ML techniques - a more advanced approach to fill the missing values but one of the most effective. While other techniques utilize only general statistics of the dataset, often they don’t properly characterize every data point. The best example is when we try to create user profiles - while many users share characteristics with one another and can form some type of group, it is not wise to give them just an average value during the MVT - you wouldn't want to recommend a teenager to purchase alcohol because your average user is legally allowed to purchase it. This is where ML-based techniques are useful when we want to find the most similar data points and treat missing values according to that. An interesting example is the use of “neighbors” ML models (like K-nearest neighbors) to find “spaciously-close” data points and check what is the value we miss.
Let’s process missing values in our dataset:
- “Age” is a numerical column and we better use an average as a filling value.
- “Cabin” is a categorical/string feature and has a very big fraction of the missing values (77.10 %); but this column contains very instance-specific and non-general values - basically, each cabin should contain a small number of people, - and will cause the model to overfit, so we can easily delete the entire column.
- The last feature with missing values is “Embarked”. It only has 0.22 % of such values. We can either fill them with a constant or simply delete them.
After this easy step, the data is already much cleaner and won’t crush the ML model:
Data wrangling
Data wrangling is the next step representing transforming and preparing data for analysis to make it easier for models to effectively process the features. This step, or parts of it, may also be already done during the data ingestion or cleaning steps. Data wrangling includes but is not limited to such practices as:
- converting data types - let’s say your date is stored as a number or you want to remove the timezone from it;
- creating new features - for example, if we work with recommendation systems or customer analysis we would rather know the age group of the user instead of the exact date they were born. This “artificial” feature is easy to calculate and helps us unite our user base in specific categories;
- feature selection or dimensionality reduction - a reverse of the previous steps. Many features might not be useful enough and just divide the attention. Simplification of the input data will help both humans and machines to optimize information extraction
- outliers treatment - outliers can significantly alter statistics if left unhandled. Sometimes you would need to remove them, in other cases - you would just like to mark them and use them as a separate feature
Going back to our example, we have a column called “Age”. If you remember history, during the evacuation women and children had a priority. In such cases, humans tend to judge a numeric value in categorical space. Age can be categorized into “age groups”, like children (<18 years old) and adults (≥18 years old). In the current step, it is very logical to categorize age in such a way, so that it reduces the chances of the model fixating on specific numerical patterns that are of no use.
Additionally, we can delete unrelated columns as names or ticket numbers. If we won’t remove them, they may cause our future model to overfit and just remember the names of the survivors, which in production environments will make the model useless.
After making such wranglings, our dataset looks like this:
Descriptive statistics
A descriptive statistic is a branch of statistics that utilizes quantitative description and summarization of the features with the goal of further data analysis to extract useful information, knowledge, and insights. During this step, you will calculate summary statistics that include: mean, median, mode, and standard deviations; percentiles and ratios; analyze distributions; etc. Most data can be easily described using only statistics, though, not always - chaotic data require totally different approaches and qualifications.
Overall, summary statistics are useful because they provide additional information that will be used to identify trends and patterns. They can be used to identify central tendencies and variability of the data, spot unusual data points, and create clusters and groups. Distribution analysis provides insights into the nature of the data, whether it is normally, universally distributed or the data is skewed.
The easiest and fastest way is to build a data summary table. It is often provided out-of-the-box in programming and software solutions. It usually looks something like this:
Even these simple statistics can provide useful insights into our data. For example, we can see that most people (more than 50% of passengers) were in passenger class #3, which means that this class is the lowest they could have. On average, people were “singles” or “couples” with only 0.52 family members on average, though there were people with a family size of 8. Also, there were infants and elderly people. Finally, only 38% of people survived this tragedy, this value can be used as a “ground proof” value that will show us whether the model’s metrics are better than a random generator.
Data visualization
Data visualization is a practice of data representation in a graphical format, usually in the form of various charts. Unlike computers, humans are visual creatures. It is hard for us to analyze thousands or millions of data points. But when we put them on the canvas - billions of years of the evolution of our sight will help us find additional patterns. Scatter plots will help us spot relationships between multiple variables, box plots are useful to show statistical information about a variable divided by groups in a condensed view, and heatmaps will help us identify correlations between multiple variables and so on. It is especially useful when you need to present results to non-technical people.
Data visualization is a hard and vast topic that requires separate discussion we would like to have. However, let’s go through a few examples of how visualization can help us analyze the data.
1. How histograms can be used to analyze distributions.
We can see that most people are young adults in the age range of 18-35, but also there are a significant amount of small children younger than 10. It can show that most families consist of two adults 18-35 and a small kid/s.
Also, it is actually quite close to the negative binomial distribution, which is very common in statistics.
2. How to analyze distributions with pie charts.
Pie chart is more suitable for categorical values, also it can be multi-level. We can see a distribution of Passenger classes and a distribution of the survivors among them:
3. Use heat map to analyze magnitude of the relationship between values.
The last example we will give today is a heat map. It is a matrix that shows a magnitude of a specific “phenomenon” when values of different variables intersect. With our data, we can use a heat map to count how gender determines whether you survive titanic or not:
We can see that the magnitude of a dead male is the largest, while most of the survivors are indeed females. Also, such visualization is very useful with the next step of the EDA.
4. Vizualize dataset statistics with box and violin plots.
Boxplots and violinplots are useful to show all the main statistical metrics of the numerical columns. Among these metrics are median, min/max, different percentiles, distributions (in the violins), and outliers. An example can be seen below:
Correlation analysis
Correlation analysis is a common statistical technique that measures the metric called correlation that displays the degree to which two variables are related. Its coefficient measures the strength and direction of the relationship in the range between -1 and 1. A value of -1 indicates a perfect negative correlation - when one variable increases, the second one decreases. A value of 1 is a reverse situation - when one variable increases, the second one follows along. A coefficient of 0 indicates no correlation, meaning there is no linear relationship between the variables, though there still can be other types of relationships. Example of the well-known correlation: the more you smoke the higher your chances of developing cancer.
Using correlation analysis on our data will indeed show a negative correlation between being a man and survive in a crash at the same time, or being in the lowest class and survive:
Identifying Variables
The process of identifying variables focuses on pinpointing predictors, which include both input and output variables, to delve deeper into data exploration. Depending on requirements, the data type of a variable can be modified.
Single-Variable Exploration
Single-variable exploration delves into each variable individually. The approach taken in this exploration is influenced by the nature of the variable, be it continuous or categorical.
Dual-Variable Analysis
Dual-variable analysis is instrumental in uncovering the connection between two distinct variables. This analytical method is applicable to various combinations of categorical and continuous variables. The analysis employs multiple techniques to address different variable combinations, such as categorical with categorical, categorical with continuous, and continuous with continuous.
Altering Variables
This concept involves modifying variables through specific functions. Three primary methods for altering variables include Logarithmic shifts, Grouping (Binning), and Root transformations (Square or Cube). Altering variables can reshape their relationship or distribution in relation to other variables. Such transformations are beneficial when there's a need to adjust a variable's scale, simplify understanding, or convert intricate non-linear associations into linear ones. A symmetrical distribution is often preferred over an asymmetrical one since it simplifies interpretation and inference-making. From a practical standpoint, variable alterations are also undertaken.
Crafting Variables or Features
This entails the creation of fresh variables by leveraging existing data variables. The primary goal is to underscore connections between concealed variables. There are various methods to craft or introduce new features, including the use of derived variables and the introduction of dummy variables.
Common mistakes and misconceptions
Data Exploration is a complex task that often is done incorrectly even by the most experienced specialists. This can make any work you have done meaningless or not effective. Here are some common mistakes to avoid:
- Skipping data cleaning: This step is extremely boring, unpleasant, and notorious. Yet it is like your clothes - you can have the most luxurious wardrobe, but if your clothes are dirty - no one will give you a compliment for them. “Dirty” data is useless data, it will screw all your efforts.
- Not understanding the data: It was mentioned at the beginning of the article that you first need to understand what you are trying to find and only then start looking. But you also need to understand the features you have. For example, you have a feature called “salary” - is it in dollars or cents? If it is in cents, but you don’t know it, you might think that every user earns tens of thousands of dollars in a month making them quite wealthy, when in fact their real salary is just average.
- Focusing too much on outliers: Outliers are important, but focusing on them too much can lead to incorrect conclusions. Don’t forget to look for them, mark or remove them depending on the need, but most importantly - follow the overall trends in the data.
- Not using visualizations: Again, we are humans, not computers. You can scroll days through your data with hundreds of features trying to find every pattern, or you can utilize your fine-tuned visual system to spot patterns in a few minutes or seconds.
- Not understanding your visualizations: Visualization is a powerful tool for our monkey brains. But you also need to understand what they show and if they are built correctly. Make sure to take the time to carefully analyze and interpret your visualizations, and don't hesitate to seek clarification or further explanation if needed.
- Confusing correlation with causation: Correlation does not always equal causation. It's important to carefully consider the relationship between variables and avoid making incorrect assumptions.
- Not documenting the process: Documenting the Data Exploration process is essential for replicating and verifying results. Many attempts can fail or you may just forget the idea behind the action. Make sure to keep detailed notes and record all the steps you take. Also, it will help you present the results to the end user.
- Overlooking Potential Outcomes: In their zeal, data aficionados sometimes neglect to consider all potential outcomes for a solution. It's a misconception to assume that input A will always yield output B. Decisions should be informed by a spectrum of potential outcomes, not just one.
- Overemphasizing Data Alone: Relying solely on data is a common pitfall. After gathering data, some jump into implementation without contemplating the broader implications for the project. Beyond just data, understanding statistical methodologies and other parameters is crucial for an effective data exploration process.
Summary
We continue our sequence of articles about data science and its components. Here we have covered the basics of one of the most important steps in any data science and data mining project - Data Exploration. This step is important because the insights generated by it are the key to any successful machine-learning model. The common wisdom says that you can’t teach another person a subject you do not understand yourself. The same truth is for us trying to teach machines - If we, the smartest creatures on the Earth with billions of years of evolution, can’t understand at least the basics of the data and the goal, then how we can ask the same from the computers who yet lack the opportunities to evolve as we did.
We are planning to continue the ‘Introduction to Data Science’ article sequence and dig deeper into other steps of the analytics project.
Check out the previous article where we have discussed the expertise required to tackle a data science project, as well as the stages and roles involved: Data Science: Bridging the Gap between Raw Data and Business Insights