Machine Learning in Nutshell :-
According to Arthur Samuel(1959), Machine Learning is the technique that gives computers the ability to learn without explicitly programmed.
Steps in Machine Learning:
1) Data Extraction
2) Data Cleansing & Transformations
3) Data Preparations
4) Model Selection
5) Train the Model (Train data & Test data)
6) Measure accuracy of the model
7) Deploy the model
8) Tune/Rebuild the model
1) Data Extraction:
Data can from Social media, RDBMS, NOSQL, Streaming data, Data files, web crawling data, click seance data.
2) Data Cleansing:
This is the processes of removing the unwanted data like
a) Removing the Duplicates data.
b) Cleaning null values.
c) Transformations: Raw data coming from the different sources not always fit to the model , also the variables (labels) may not be fit with the model so there is a need to transform from unfit to fit variable.
3) Data Preparations:
Based on supervised or unsupervised need to prepare the data
–> Supervised model should have input variable & Target Labels
–> Un supervised model should have Only input variable without Labels
Train set, Test test, Validation set :-
Suggestions for Train set:-
1. provide as many as possible input examples
2. Don’t keep all data into train set (split as 80% or 70%)
3. remaining 20% or 30% of data in test test
4. before train the model, shuffle the train set (to reduce the over fitting problem)
5. Normalize variables ( divided by max value with all the values to make it 0 to 1)
6. Reduce dimensionality ( remove unnecessary variables)
1) Regression models :- Target variable/Labels are the continuous variable (value can be anything- numerous)
2) Classification models :- target label is a classifier (any one of the given options – ex: Male/female,yes/no)
Types of the Regression models:
1) Linear Regression
2) Non Linear Regression
3) Decision Tree
4) Random forest
5) Lasso Regression
6) Ridge Regression
Note1:- (All above models can be applied on Gradient descent algorithm (parameter tuning technique) – Batch gradient, Stochastic
Note2:- In all regression models input features/variable and target label should be continuous & numerous.
Note3:- If any variable is char that has to be transformed into scores/numbers.
Classification Models Types:- Predicting a classifier (any one of the given options).
1) Logistic Regression
2) NaiveBayes Classification
3) Decision Tree
4) Random forest
5) SVM Classifier (support Vector machines)
Note1:- If no.of classifier are only 2, then it’s Binary classification and more than 2, then it’s Multinomial or Polynomial classification
Why theer are many models for classification ?
Because each model has its own purpose and limitations let us see.
1) Logistic Regression :-
Used when:- It can classify very well if all input variable are continuous.
Challenge:- Not good when there are categorical variable in the data (accuracy will decrease)
2) decision Tree :-
When used:- This would be best one when there are categorical variable in the data and this can deal with both the categorical & continuous.
Challenge:- This is highly iterative algorithm (if we have huge data, then the processing time will be more).
3) Random-forest :-
Used when:-The decision tree accuracy is less or over-fit then we should go for random forest.
Chanllenge:- This is more and more iterative model than decision tree, it consumes more computing power (really required GPU’s, when we have huge data)
4) NaiveBayes :-
Used when:- All input variable are categorical and target variable is categorical (Above problem can be done by decision tree / random forest. But these are highly iterative algorithms).
in this case naiveBayes will be the best one, this is less iterative algorithm, which takes less computing power
5) SVM (Support Vector Machines) :-
When used:- When the features are very high and distinguished features are less, It can predict complex predictions (ex : female dog & male dog).
It will construct a big bridge b/w two classifiers. SVM can deal with non linear predictions of classifier.
Simple prediction & complex prediction means:-
In Simple prediction the common features are very less & distinguished features are very high (ex: diff b/w cat & dog)
In Complex prediction the common features are very high and distinguished features are less (ex : female dog & male dog)
Computations wise comparison :-
a) Logistic -> All input continuous & target classifier –> Less computing power
b) NaiveBayes -> All inputs are categorical & target classifier –> Less computing power
c) Decision Tree -> inputs can be both continuous & categorical & target classifier –> High computing power
d) Random-forest -> inputs can be both continuous & categorical & target classifier –> Very High computing power