Skip to main content

DATA PREPROCESSING


 Data preprocessing is were important phase in building machine learning model

where there are 6 Phase to take care of

1.IMPORTING REQUIRED LIBRARYS

2.IMPORTING DATASET

3.TAKING CARE OF MISSING VALUES

4.ENCODING CATEGORICAL DATA

  • encoding Independent variable
  • encoding dependent variables
5.SPLITTING THE DATA INTO THE TRAINING SET AND TEST SET

6.FEATURE SCALING


IMPORT THE LIBRARIES

There are three libraries which are very import to import 
1.numpy(used for manipulation list)
2.pandas(used for data manipulation and analysis.)
3.mathplot(used to plot graphs)
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

IMPORTING THE DATASETS 

For performing operations on dataset ,we import the dataset,for importing dataset we use use pandas (class)

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values #independent variable
y = dataset.iloc[:, -1].values #dependent variable

 TAKING CARE OF MISSING VALUES

we need to care of missing data in the datasets (if we ignore that it may lead to bad prediction)

there are many ways to handle missing values some of the are

  • we can replace the null values with mean, median, mode of the column 

for this, we need to import a class from a library sci-kit learn
(sci-kit learn is a famous machine learning library)
the class name is Simple Imputer

we can go through the documentation on class Simple Imputer

https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer

code:

from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(x[:,1:3])

x[:,1:3]=imputer.transform(x[:,1:3]) 

ENCODING CATEGORICAL DATA 

  1. ENCODING INDEPENDENT VARIABLE

    this is not only for independent variable we an use for dependent variable .Here  I am showing for independent variable 
     if there are three categorical variable to be encoded then we use one hot encoding ,as we already know that machine know only 0,1 so we can't encode three category at a time 
    then we use OneHotEncoding
     

Comments

Popular posts from this blog

AI (Artificial Intelligence) Governance: How To Get It Right

  AI (Artificial Intelligence) Governance: How To Get It Right AI (Artificial Intelligence) governance is about evaluating and monitoring algorithms for effectiveness, risk, bias, and ROI (Return On Investment). But there is a problem: Often not enough attention is paid to this part of the AI process. “AI projects are rarely coordinated across a company and data science teams are often isolated from application development,” said Mike Beckley, who is the CTO of  Apprine. “And now regulators are starting to ask questions businesses don’t know how to answer.” Keep in mind that AI introduces unique problems. Training data is often flawed, such as errors, duplications, and even bias. Then there is the issue with model drift. This is when the AI degrades over time because the algorithms and data do not adequately reflect the changes in the real world.  The result is that a company may make bad decisions or miss revenue opportunities. Even wor...