The data science life cycle consists of 7 phases. In this post, we will go through each of them briefly.
The following infographic depicts different phases in the data science life cycle.
This phase is one of the most important aspects of any data science project.
Any data science and machine learning project typically begins with the problem definition. We should have a clear understanding of what we are trying to solve.
In this phase, we need to define the problem we are trying to solve using data science.
Based on the identified problem, we need to select the approach that are suitable to solve the given problem.
Now that we have clearly defined the problem, we need to collect the required data for our data science project.
We need to collect as much as relevant data as possible.
We may need to acquire data from a variety of sources like querying the database, using publicly available data, web scraping, or using web APIs.
In some circumstances, the data may not yet be available; in such cases, we need to capture and store the data.
Also referred to as data cleaning phase. Once we have gathered all the necessary data, the next step is to preprocess the acquired data.
It’s essential to understand that the data collected is rarely available in the desired form.
Some of the data we have collected may be redundant, contains errors, or may have been duplicated.
We need to clean the data and make it into a suitable form.
In most of the cases, the data we have collected may spread across multiple datasets, may be in different formats.
In such cases, the first thing we need to do is to transform the data and merge them into a single dataset.
All this being said, data preparation is one of the most tedious and time-consuming step.
A survey of data scientists found that most of the data scientists spend the majority of their time (almost 60%) cleaning and organizing the data.
Now that we have cleaned our data, it is now ready to be used for analysis and model building.
In this phase, we will extract the information enfolded in our data to find some interesting patterns and relationships.
We are trying to get an insight into how our variable relates to each other, what their distributions look like and try to get a nice overview of what’s going on with our data.
This Wikipedia page has a list of various visualization tools for exploratory data analysis.
This phase is a crucial part of data science and machine learning, as it directly influences the performance of predictive models.
Feature engineering requires expertise and excellent domain knowledge of the data as it involves transforming raw data into more informative features.
Examples of feature engineering include normalizing the numerical data so that the inputs are on the same scale, transforming text data into numerical data.
Several machine learning algorithms help us model the data. However, which model to build is another big question.
Mainly there are two types of modeling techniques exist: supervised and unsupervised.
In supervised learning, the training set is labeled, meaning it has both the input data and the desired output.
Whereas the data points are uncategorized in unsupervised learning, i.e., it does not have a target attribute associated with them.
In unsupervised learning, we work through observations and find structures in the data.
We need to experiment with different machine learning algorithms to find out which works best for our data.
Some of the factors affecting the choice of the model are
- accuracy of the model
- amount of data in hand
- time and space constraint
- scalability of the model
In this phase, we need to communicate our key findings with the stakeholders very clearly.
The findings are of little value, if only the one who made the analysis can interpret them.
The stakeholder may not always be the one who could understand the data science parlance.
Merely showing the numbers may not make any sense to them.
So we need to communicate the results visually using the most suitable visualization techniques for effective interpretation by stakeholders.
In this post, we have discussed briefly about different phases in the data science life cycle.
Let’s review all of the 7 phases,
Problem Definition: Define the problem you are trying to solve using data science.
Data Collection: Collect as much as relevant data as possible.
Data Preparation: Clean the data and make it into a desirable form.
Exploratory Data Analysis: Use visualization tools to explore the data and find interesting patterns.
Feature Engineering: Transform raw data into more informative features.
Model Building: Experiment with different machine learning algorithms to find out which works best for our data.
Communicate Results: Communicate the results visually using the most suitable visualization techniques for effective interpretation by stakeholders.