Data science Workflow

There are no fixed frameworks or defined templates for solving data science problems.
The strategy changes with every new problem sets of different projects. But the steps we applied to solve the problem are almost similar to many different problem statements.

This is the high level workflow for all type of problem statement which are used widely in the market.

Here we will list out the steps:


  • Problem Statement
  • Import Data
  • Data exploration and Data Cleaning
  • Modelling
  • Model Adequacy
  • Report Build

Step 1 : Problem statement

Problem statement is the key for your modelling. If you don’t have any idea what you want to do with your data, you can’t not proceed further with your data-set. To define your problem is the main key to move ahead.

Step 2 : Importing the data

The next step is importing your data to the platform, where you are doing analysis or building models on the data. Now here comes a question, from where you can import data, and here is the answer: you can import data from any database or from csv file which is located anywhere in you system. Data can be structured or unstructured.

Structured Data: Name itself speak for structured data, it means data is arranged in a format like date wise, size (increasing or decreasing) and which gives quick information without any time consumption. e.g. in excel data we can say its structured data.

Unstructured Data: Means data has scattered information or we can say that which is hard to read with out any pre knowledge. (e.g. mails, images, text documents, and etc) Which can not give complete information. Or you can say which can not fit to relational databases e.g. SQL.

Step 3: Data exploration and Data cleaning

Data Exploration:

Now we have data imported to the platform, lets check the data-frame what kind of variables are there in the data-frame. By exploring the data frame there can be three things which can be strike depend on the behavior of data variables, is it classification problem?, or is it regression problem? or is this a supervised learning or unsupervised learning? or is it a prediction?

Lets’ discuss little about the above problems:

Classification or Regression problem: To know this we can directly check the output variable whether it is a continuous or categorical. Categorical data is in binary form, Boolean form and Continuous data is in numerical form. If it is a categorical variable then we can go for classification, On the other hand, if the output variable is continuous then its’ a regression problem.

Supervised or unsupervised learning: For supervised learning definitely we have labeled our variable in two forms dependent or independent variables. If there are any dependent variable defined in our data-frame then definitely we are doing supervised learning because our to be model is learning from the given scenarios. Supervised learning can be regression or classification problem. In unsupervised learning, we do not have defined dependent variable which contains only independent variables, which means we can go for clustering or association model building process.

Prediction or Inference: For regression problems, supposed we want to predict the value of dependent variable on the new defined independent variable. Lets’ take an example of marketing data:

X(independent variable)Y(dependent varibale)1121324254

Here lets feed our model x=6 and we want to predict the value of y=? on the given x value. On feeding the x value our model gives value of y=7. In the Inference we want to know how X is affecting our dependent variable.

Data Cleaning

Meaning of data cleaning is check the data types, getting all the values in correct format. This can involve stripping characters from string, converting integers from float. There might be some missing values as well in the data-set. Which you have to take care by adding or deleting some values.

Step 4 : Modelling
Depends on the type of data. Continuous or categorical, if continuous apply regression modelling, categorical apply classification or logistics regression modelling.

As a data scientist you will try lots of models to get the best fitted model. As a data scientist i would prefer to build linear regression models on continuous data, logistics regression or classification on categorical data. I would also prefer to go for K- Nearest Neighbor(K-NN) for classification models. If you are not getting satisfying results, then you can go for Neural Networks iff you have unsupervised learning.

There might be some other problems can occur while building models like in regression model there can be a multicollinearity problem. There are various techniques to deal with multicollinearity problems. In classification there can be multi-class problem.

There are several problems which can occure while building models, I am not going to mention here.

Step 5 : Model Adequacy check
After building models we need to check our model that how adequate or how does this model fit to our data. e.g. for Regression model there are various types of selecting best model.

  • Forward selection method
  • Backward selection method
  • Step-wise selection method

Again I am not going in detail how these methods work.

Step 6 : Report Build
After doing all the above things we need to take the results out or need to build a report(Make presentation) where we can show our progress to the concerned person.

This is it for now, stay tuned for next big articles based on Neural Networks which are on its way. To get the latest updates follow my blog.

For any doubts please comment below or shoot me an email @ khanirfan.khan21@gmail.com. Original posted Here

featured image source


Categories: R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s