41**Machine Learning Foundations**

by Kaustubh M H

5.0 (1 Reviews) Discussions Start

**About this Roadmap**

Machine Learning Foundations is one such Roadmap that helps you gain a great understanding the core concepts involved in implementing some of the basic Machine Learning Algorithms from scratch.

**Milestone 1 :** Understanding Artificial Intelligence, Machine Learning & Data Science

**Milestone 2 :** Introduction to Data Preprocessing

**Milestone 3 :** Regression - Working with Continuous Data

**Milestone 4 :** Classification - Understanding grouped data

**Milestone 5 :** Unsupervised Machine Learning

Essential Math for Data Science — ‘Why’ and ‘How’

beginnerIntroduction to Python is a resource for beginners who want to learn Python.

beginnerYou can follow this roadmap to get a brief overview about the tools used in ML.

beginnerAfter completing this roadmap you will:

- Understand what Machine learning, artificial intelligence and data science mean.
- Be able to pre-process the data in the dataset.
- Build regression algorithms from scratch.
- Work with classification models and build them from scratch.
- Be able to build models for unsupervised machine learning algorithms from scratch.

Milestone: Understanding Artificial Intelligence, Machine Learning & Data Science

Can you think of what AI is? I know you all would be giving the answers as Artificial Intelligence. Even a 10th standard kid would tell me that.

But what actually is AI? Think of some examples.

If you have already thought of some, great! Here are some more examples.

- Google Translator uses AI
- so does Siri, Alexa & Cortana also
- YouTube Video Recommendation system uses AI
- Netflix Movie Recommendation System uses AI

If we go on thinking about more examples, we'll get a ton of them, but let's stop here. So what is Artificial Intelligence then?

Learning what AI is might give you an understanding of which all jobs might get automated.

First, you should know that there are different categories in Artificial Intelligence itself. They are

- Narrow Artificial Intelligence,
- Artificial General Intelligence and
- Artificial Super Intelligence

Since there are different categories in AI, it becomes difficult to define it. You can define AI as a branch of Computer Science that deals with building systems that show intelligence.

In general, AI is "** Cognitive Intelligence of Machines**"

Can you think of when was Artificial Intelligence discovered?

There is a lot of history behind the discovery of Artificial Intelligence. If you want to learn more about it, you can visit this link which gives you a brief timeline of how AI came into being.

The major discovery happened in a workshop that went on for about 2 months long. The attendees were some of the most renowned professors of that time, John McCarthy (Dartmouth College), Marvin Minsky (Harvard University), Allen Newell (CMU), Herbert Simon (CMU), Arthur Samuel (IBM) and Claude Shannon (Bell Telephone Laboratories). This workshop was conducted in July & August of 1956 at Dartmouth College.

They and their students produced programs that the press described as "astonishing". Computers were learning checkers strategies, solving word problems in algebra, proving logical theorems and speaking English.

By the middle of the 1960s, research in the U.S. was heavily funded by the Department of Defense and laboratories. AI's founders were optimistic about the future. Herbert Simon predicted, "machines will be capable, within twenty years, of doing any work a man can do". Marvin Minsky agreed, writing, "within a generation ... the problem of creating 'artificial intelligence' will substantially be solved".

From Left - **Trenchard More**, **John McCarthy**, **Marvin Minsky**, **Oliver Selfridge** and **Ray Solomonoff**

They failed to recognize the difficulty of some of the remaining tasks. To give you a clear understanding of the different types of Tasks, you can classify them into three different categories.

- Expert Tasks
- Mundane Tasks
- Formal Tasks

Some of the examples of each of these tasks are:

- Expert Tasks
- Financial Analysis
- Scientific Analysis
- Medical Diagnosis
- Engineering

- Mundane Tasks
- Perception
- Common Sense
- Reasoning
- Natural Language Processing

- Formal Tasks
- Maths
- Games

Researchers felt that it would be difficult to train computers on Expert Tasks and easy to train on Mundane Tasks. But, it became the other way round. Computers were easy to train on expert tasks, while it became difficult to train them on mundane tasks.

Thus, the progress slowed. They did not have any computation power as we have now.

During the time of the workshop, John McCarthy coined the term "** Artificial Intelligence**". He also became to be known as the father of Artificial Intelligence.

John McCarthy defined Artificial Intelligence as:

"*The science and engineering of making intelligent machines, especially intelligent computer programs.*"

This might be the definition you will learn in your college textbooks.

Earlier you saw the three broad categories of Artificial Intelligence. To get to know about them in a better way, here are their explanations in a way you might like:

This is the most basic form of AI. It perceives its situation and acts on what it sees. It doesn't have a concept of a wider world. It can't form memories or draw on past experiences to affect current decisions. It specializes only in one area.

To give an example, we all know AlphaGo defeated the best human player of the Chinese game Go. But this AI system is also considered as a Narrow AI.

Another example is the IBM's Deep Blue which defeated Kasparov in Chess.

Self Driving Vehicles, Chatbots and the examples I gave you earlier, are all part of Narrow Artificial Intelligence.

AGI is the intelligence which can be as intelligent as human beings. It can perform intellectual tasks like we humans can do.

Defence is one industry looking at the feasibility of AGI as a next step from the narrow AI systems already in use.

Though AGI is not implemented, we have seen these kinds of systems in movies. I, Robot featuring Sunny which was an AGI, Eva from ExMachina, C-3PO and R2-D2 from Star Wars, and many more such movies.

Artificial Super Intelligence is also termed as Superintelligence.

Superintelligence is an AI far surpassing that of the brightest and most gifted human minds.

Where AGI is hardly on the horizon, superintelligence is much more uncertain. A godlike AI seems like a huge leap, but many scientists caution that the moment we unlock AGI, the exponential power of AI could rocket from AGI to superintelligence. Many warn that this is something that needs to be approached carefully as we barrel down the path of ever more sophisticated AI.

Now, Artificial Intelligence has become one of the most researched domain in Computer Science. It has led to categorizing AI into different fields. They are:

- Natural Language Processing - In this field, you work on understanding the interactions between computers and human languages.
- Computer Vision - In this field, you will work with images and videos. The goal is to build a system that can work like the human vision system.
- Audio Processing - In this field, you will work with Audio. The main goal is to develop a system where computers can understand human voice and respond accordingly. It is also in close relation with NLP
- Neural Networks - You will learn about this, in this course.
- Fuzzy Logic Systems - How systems like consumer electronics work etc.,

Here are some extra resources for you to get a clear understanding on the history of Artificial Intelligence

Here's a short video on History of Artificial Intelligence.

Here's a list of articles that shows you the complete history of Artificial Intelligence came into being.

Here's a talk by Chris Bishop, Laboratory Director at Microsoft Research Cambridge, Professor of Computer Science at the University of Edinburgh and a fellow of Darwin College, Cambridge. It's a detailed one, but worth spending the hour.

Here's an article that gives you a deeper understanding on the different fields of Artificial Intelligence

Milestone: Understanding Artificial Intelligence, Machine Learning & Data Science

Now that you have understood what AI is, let's get into understanding what Machine Learning is.

Can you give an example of Machine Learning?

Let me help you with that.

- Credit Card Fraud Detection
- House Pricing Estimation
- Detecting the type of Cancer

This list can go on and on. But we don't have all to think of examples only right?

Now that you know a few examples of Machine Learning, can you define Machine Learning?

**Arthur Samuel** described machine learning as:

"**The field of study that gives computers the ability to learn without being explicitly programmed.**"

For a computer to do some task, you have to write some program specifying the functionality. But here it says, Machine Learning is a field of study, which gives computers the superpower of learning by themselves without being explicitly programmed. So what does this mean? It doesn't mean that we don't have to write any code, the computer will write by itself. I hope you never thought of this explanation.

Every learning happens because of experience. I hope you are familiar with this. When a computer builds this experience by itself, you say that it's Machine Learning. In Machine Learning, you don't give the steps of how the computer will learn. You will only give the input and output. The computer will train itself to get that output. This process is called training a model. You'll learn more about this later.

To get a better understanding of what Machine Learning is, let's go through a definition coined by Tom Mitchell.

**Tom Mitchell** describes Machine Learning as:

"**A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if the performance of tasks in T, as measured by P, improves with experience E.**"

Well, this would be the definition you'd learn in your textbooks. But it's pretty simple if you try not to get overwhelmed by the definition.

When you write a program to perform a task, in ML, you give the input and what the output should be. Earlier, I told that the computer trains itself to get the output value. So it builds some experience in every step it tries to achieve the output. The output that is shown is the performance measure.

So, to complete the definition, when a computer is trying to learn based on the inputs, it goes nearer to the output value, when the performance of the program that is written improves from the experience.

In general terms, ML involves teaching a computer to recognize patterns by example. You can find these patterns within the data. It is about creating algorithms that learn from data using functions and predicts on it.

This is a form of Narrow AI.

I understand you have a lot of questions in your head, but let us keep them aside and go further. We will understand in depth about Machine Learning in the further chapters.

**Here's an article that will help you understand more about Machine Learning - A Beginner's Guide to Machine Learning**

Machine Learning is classified into four types:

The majority of the practical machine learning algorithms are supervised learning.

Supervised Learning is where you have input variables and an output variable. You use an algorithm to learn to map a function from the input to the output. This algorithm is then represented by an equation. This equation is the model that gets trained. You'll learn more about models later.

In supervised learning, you are going to write an algorithm that gives you feedback. You get the feedback based on the predicted value by the algorithm and the actual output. Learning stops when the algorithm achieves an acceptable level of performance.

You can categorize supervised learning into two types which are:

**Regression:**Here we are trying to predict results within a continuous output. The output variable is a real-valued data.**Classification:**Here we are trying to predict the results in a discrete-valued form. The outputs are like binary numbers, Yes or No, True or False etc.

Unsupervised learning is where you only have input data (X) and no output values.

The goal of unsupervised learning is to model the distribution of the data. This helps us to learn more about the data.

In unsupervised learning, there is no feedback provided to the algorithm. Hence, the best-fit answer gets decided by the algorithm itself.

You can classify unsupervised learning problems into two types.

**Clustering**: A clustering problem is where you want to discover the inherent groupings in the data. Some examples are grouping customers by purchasing behaviour.**Association**: An association rule learning problem is where you want to discover rules that describe large portions of your data such as people that buy X also tend to buy Y.

In Reinforcement Learning, you train the model using a trial-error method. When the algorithm gets a correct answer there is a reward and a punishment if the answer is wrong.

The program learns by performing the activity a lot of times before it gives an output. We reward the program when it makes a good move. This strengthens the connections to make moves as it did. When it loses we give no reward ( or negative reward ). Over time, it learns to maximize its reward without the rules. It can lead to better performance when compared to a human's performance.

A recent event happened in the gaming world where a computer defeated a champion. It's currently one of the most complex game. It's none other than DotA 2. There is a company called OpenAI which trained a model which defeated one of the top players of DotA 2.

Problems where you have a large amount of input data (X) and only some of the data has output values (Y) are called semi-supervised learning problems.

These problems sit between both supervised and unsupervised learning.

A good example is a photo archive where only some of the images are labelled, (e.g. dog, cat, person) and the majority are unlabeled.

Many real-world machine learning problems fall into this area. This is because it can be expensive or time-consuming to label data as it may need access to domain experts. Whereas unlabeled data is cheap and easy to collect and store.

You can use unsupervised learning techniques to discover and learn the structure in the input variables.

You can also use supervised learning techniques to get good predictions for the unlabeled data. You can use this data to feed that data back into the supervised learning algorithm as training data and use the model to make predictions on new unseen data.

Milestone: Understanding Artificial Intelligence, Machine Learning & Data Science

Data is a commodity, but without ways to process it, its value is questionable. Data science is a multidisciplinary field whose goal is to extract value from data in all its forms. This article helps you explore the field of data science.

Data science is a process. That's not to say it's mechanical and void of creativity. But, when you dig into the stages of processing data, you see that unique steps are involved in transforming raw data into insight.

In exploratory data analysis, you might have a cleansed data set that's ready to import, and you visualize your result but don't deploy the model in a production environment. In another environment, you might be dealing with real-world data. You might have to process that data with of data merging and cleansing processes.

1. structured,

2.semi-structured and

3.unstructured.

Structured data is highly organized data that exists within a repository. It might be a database or a comma-separated values [CSV] file. The data is accessible. The format of the data makes it appropriate for queries and computation.

Unstructured data lacks content structure (for example, an audio stream or natural language text).

In the middle is semi-structured data. It can include metadata or data that you can process easily than unstructured data. This data is not fully structured because the lowest-level contents might still need some processing.

The **Machine Learning Engineer** will partner with the **Data Scientist** to take the ML model prototyped by the Data Scientist and make it work well in a production environment at scale (i.e. lots of concurrent users). Usually doing so by coding it in a more robust language like Scala, JAVA or C++ and utilizing faster data piping and parallel processing (Spark, MapReduce, etc.)

The Data Scientist is typically trained to be stronger in Statistics, while the ML Engineer is typically trained to be stronger in Computer Science, however the two usually know a lot about what the other does and can work together to iterate and optimize.

Milestone: Introduction to Data Preprocessing

Milestone: Regression - Working with Continuous Data

To understand linear regression, let's take an example and work it out. We are going to consider an example of House price prediction.

Let us consider the price of three different houses. These prices get affected by the number of rooms that are present in each house.

Let's say the first house has 2 rooms and its price is at 100k$. The second house has 3 rooms and its price is at 240k$ and the third house has 8 rooms and its price is at 550k$.

If we are going to plot this on a graph, the graph looks something like this.

If you want to buy a house that has about 5 rooms, how would you decide on the price?

First, you have to plot a line that covers all the three points and decide on the price, as I have shown in the graph below.

But what is the possibility that this green line is the best fit line for the three dots? How do we know you know that the price that you are going to pay is the right one?

The price of the 5 room house comes somewhere around 400k $. But is this the best price for which you can buy? How do you know about that? For that let's think it this way.

Is taking consideration of only three people who have bought houses good enough? Or do you have to consider more data?

Yes, you have to get more data. There is a possibility that the person who has paid 550k $ might be a fool to pay so much. He might be some super guy who paid very less.

Now, let's plot the line that covers all this data.

Here you can see that, you've got the best price for the 5 room house. It is also very less when you compare it with the previous price you had got.

Is this Machine Learning?

You have done everything by yourself. How is it Machine Learning? You have also not told the computer to do anything. So, this is not Machine Learning. So what is Machine Learning then?

Plotting the line that best fits the data is machine learning? Maybe. Let's look into it.

You think that the yellow line is the best fit line right? So to tell the computer to plot that line, we have to convert it into an equation. Why?

Because the computer is dumb and it can only understand numbers and equations help us to calculate numbers.

Which equation do you think would be best to represent that straight line?

Yes, the best one would the equation of a straight line. The equation is given as:

y=m*x+c

But does that represent our line? No it does not.

The equation to represent the line that we've drawn would be:

h(x)=a*x+b

You can call this equation as the **hypothesis equation**.

Why did we call the above equation as the hypothesis equation?

We called it the hypothesis equation because it represents the hypothesis that we've assumed, best fit our data.

Is this machine learning? No, it's not.

Till now you saw how you can draw the straight line. You can use that equation to draw a lot of lines. But how do you know which is the best fit line?

To understand how you can get the best fitting line, let's consider the following graph.

Here you can see that we've drawn 5 lines. Visually you might say, the blue line is best-fit one. How did you calculate that the blue line was the best fit one?

I'll simplify it for you.

If you consider a lot of data, it becomes difficult to understand this simple concept. So we are going to consider only three points which we got while we took the first three house prices.

Let's draw a line using the hypothesis equation with some initial **a** and **b** values.

You can see that this line is not the best fit one, but to help you understand you can calculate the best fit line. Let's calculate the distance of each data point from the line we've drawn. The total distance considering all the points would be huge. You can see that in the graph below.

When this total distance is the least considering all the points, you can say that the line is the best fit. To help you understand it better here's one more plot.

This total distance which is calculated from each of the points is called the **error** or the **loss**.

The expression h(x) gives the price of each house corresponding to the hypothesis equation whereas in the data we have the actual price of the houses. You can represent the actual price of houses by y.

Hence the error for one data point would be given as,

h(x)-y

As you'd want the sum of all the errors of all points, you have to perform summation over the above equation. The limits of the summation would be the total length of the data set. If there were 100 data values the limit would be from 1 to 100.

Thus the expression to calculate the error would become,

Suppose while calculating the error function, if you would get a negative value, what would you do? (You'd get a negative value if the data point was below the line).

The best solution would be to calculate the square of the error function. When you calculate the error function, the value would be high. Thus to reduce the value, we are going to take the mean of the calculated error to get the actual error value.

In Machine Learning, the error is denoted by **J.**

Thus, the final error equation would be,

This error equation is also called as Mean Squared Error.

If we plot this equation, we'd get a parabola. Each line that you plot is associated to the points on the parabola.

In the above graph, you can relate the color of the lines that we've drawn and the point that we've plotted on the parabola.

You say that the line is best to fit if it at the global minima of the parabolic curve. You can see that the blue dot is not at the global minima.

The definition for minima is, when the slope is 0 at a given point, it's said to have reached the minima. The other term for slope is "**Gradient**".

You have already figured out that each point on the parabola represents a line, that can be drawn on the data.

You can find the slope of the line by differentiating the error equation, but you have to differentiate with respect to the variables a and b.

When you differentiate the error equation with respect to a, you get, you get the following equation.

When you differentiate the error equation with respect to b, you'll get,

So with the help of this gradient equations, we can see how to move towards the minima. Well, the goal of doing all this is to reach the minima itself right so that we get the best fitting line.

Remember the definition of Machine Learning given by Tom Mitchell?

"**A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if the performance of tasks in T, as measured by P, improves with experience E.**"

Now, you are going to keep this a the basis to understand how you can use the gradient values to reach the minima.

Here's a graph that will help you understand how you are going to reach the global minima.

In the above graph that I've shown, you see the yellow points, they represent how the error is going to reduce. The error might reduce when you move from the left side or from the right side of the graph.

To reach the global minima, the error should become zero. But that's very difficult as the points are spread out, but you can reach to a place near the minima.

So how do you reach that? You do have to start at some point right. One thing that you have to keep in mind is that the slope value changes as the value of 'a' changes.

So let's start at a random point 1. You can see that the slope value is negative at point 1. If you have to move towards the global minima, you have to add the small 'a' value. Say, you have to same point 1 on the right side, the slope value will be positive. To move towards the minima, you have to subtract the small 'a' value that is calculated to move towards the global minima.

If you have to put this into an equation form, it would be

This might confuse you, but when the slope is negative, the point is on the left side of the minima. So when you subtract a negative value, essentially you will be adding the value. Likewise, you will be subtracting the positive value if the point is on the right side of the minima. This should solve the doubt in your mind.

Similarly, if we have b also, the equation for b would be,

With this, you can say that you taking steps to from initial points to reach the minima. But how do you know if your step is not too large? Suppose, if you step is too large, you might cross the minima and go from left side to right side of the minima.

With this, you can say that you taking steps to from initial points to reach the minima. But how do you know if your step is not too large? Suppose, if you step is too large, you might cross the minima and go from left side to right side of the minima.

To avoid these kinds of cases, we are going to multiply the above two equations with a parameter that will tell you how large the step you are going take will be.

This parameter is called step size or learning rate. It is denoted by alpha. Thus, the equation of a and b becomes,

These equations define something called as the **Gradient Descent**. The name is so because we are descending down the gradients to reach the minima.

So is this machine learning?

Yes! You have told the computer how it can reach the minima by itself. But there are a few things that you have to tell the computer to help it reach the minima.

You are going to define the functions for hypothesis, error function, step gradient (how each step is going to be taken) and finally the gradient descent function. You are going to give arbitrary values for a and b and then you will see how the computer reaches the minima and gives you the model.

Here you have to understand what a model is.

Model is the equation which represents the algorithm itself and the algorithm that you learnt just now is called **Linear Regression with 1 variable**. You can also call it **Simple Linear Regression**.

This diagram below gives you a glimpse of how the learning happens.

The training set is the data with which you are going to train the model.

This gives us a good explanation of how Linear Regression works. Now, let us implement a project that will help us understand this in a much more better way.

Milestone: Regression - Working with Continuous Data

You are going to work on a data set which is called the Boston House Pricing Data Set.

To give a brief description of this data set, it contains the prices of houses in the Boston area based on 11 factors. Let's see how you are going to work on this.

There are a few terms that you have to get familiarized with now.

- Features - Features are the factors that affect the output of the algorithm
- Target - Target is the output that is observed. This is usually the last columns in the data set.

First you have to perform a bit of data pre-processing to work with it.

To check for the data, first, you have to import the dataset from the sklearn library. This library contains the Boston housing prices. To work with the dataset, you also have to import other packages we discussed earlier.

`In [ 1 ]:`

```
# Importing required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
#Importing Data Set from scikit learn
from sklearn.datasets import load_boston
%matplotlib inline
```

`In [ 2 ]:`

```
# taking the content of the boston housing price dataset and loading it to a new variable boston
boston = load_boston()
```

sklean.datasets() provides a description for all the dataset that is present in their modules. To understand what the dataset is about we are going to execute the following command and study our dataset.

The contents of the data set are stored in the form of a dictionary. You will get to know the contents of the data set when performing,

`In [ 3 ]:`

```
print(boston.keys)
```

`In [ 4 ]:`

```
print(boston.DESCR)
Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
<http://archive.ics.uci.edu/ml/datasets/Housing>
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see <http://archive.ics.uci.edu/ml/datasets/Housing>)
```

Next, are going to convert the feature columns and the target column into a DataFrame with the help of pandas and concatenate them so that we can work on the dataset.

Converting the data from dictionary format to DataFrame helps us in visually seeing the relation between the features and the target.

In our data set, the feature is a separate key-value pair and the target is a separate key-value pair. To understand the relation between the features and the target, we have to concatenate them column-wise. The following code converts the dictionary into dataframes and then concatenates them together.

`In [ 5 ]:`

```
features = pd.DataFrame(boston.data, columns=boston.feature_names)
target = pd.DataFrame(boston.target, columns=['TARGET'])
data = pd.concat([features, target], axis=1)
```

To perform feature selection, we have to understand which feature is highly related to the target values. That means which all feature contributes the most to the output value. To get this relation, we are going to perform the Pearson Correlation on the dataframe.

`In [ 6 ]:`

```
data2 = data.corr('pearson')
abs(data2.loc['TARGET']).sort_values(ascending=False)
```

`Out [ 6 ]:`

Based on this correlation value, you can decide which feature to choose as the input data.

`In [ 7 ]:`

```
X = data['RM']
Y = data['TARGET']
```

The values of any features vary between a large margin. To make them come under a same scale so that comparison becomes, we are going to perform normalization. There are different ways of normalization which are

- Dividing by mean value
- Dividing by median value
- Calculating the difference and dividing the value ( It's not clear to put this in words, but it is performed below. Hopefully you understand this)

`In [ 8 ]:`

```
X = np.array((X - X.min())/(X.max() - X.min()))
Y = np.array((Y - Y.min())/(Y.max() - Y.min()))
```

After normalizing, next, you are going to split the dataset into training and testing samples. So that, we can train a model and test the data on the same model. Data splitting is usually carried out in the ratio of 4:1 or 7:3 or 3:2 or 1:1. The most preferred type of data splitting is 4:1

`In [ 9 ]:`

```
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)
```

To plot how the training data set and the testing data set looks like, you are going to implement the following code snippet.

`In [ 10 ]:`

```
plt.plot(x_train, y_train, 'r.')
```

`In [ 11 ]:`

```
plt.plot(x_test, y_test, 'r.')
```

This completes the data preprocessing part. Below, you will use the equations that we defined earlier to train a model and test it out.

`In [ 12 ]:`

```
def hypothesis(a,b,x):
return a * x + b
```

`In [ 13 ]:`

```
def error(a,b,x,y):
e = 0
m = len(y)
for i in range(m):
e += np.power((hypothesis(a,b,x[i]) - y[i]), 2)
return (1/(2 * m)) * e
```

`In [ 14 ]:`

```
def step_gradient(a,b,x,y,learning_rate):
grad_a = 0
grad_b = 0
m = len(x)
for i in range(m):
grad_a += 1/m * (hypothesis(a,b,x[i]) - y[i]) * x[i]
grad_b += 1/m * (hypothesis(a,b,x[i]) - y[i])
a = a - (grad_a * learning_rate)
b = b - (grad_b * learning_rate)
return a, b
```

`In [ 15 ]:`

```
def descend(initial_a, initial_b, x, y, learning_rate, iterations):
a = initial_a
b = initial_b
for i in range(iterations):
e = error(a, b, x ,y)
if i % 1000 == 0:
print(f"Error: {e}, a: {a}, b: {b}")
a, b = step_gradient(a, b, x, y, learning_rate)
return a, b
```

`In [ 16 ]:`

```
a = 0
b = 1
learning_rate = 0.01
iterations = 10000
final_a, final_b = descend(a, b, x_train, y_train, learning_rate, iterations)
```

When you execute the above code, you get the error value. If the error goes on reducing, it means that the hypothesis equation chosen is correct.

Next to check for the difference in error, we have to execute the code below:

`In [ 17 ]:`

```
print(error(a,b,x_train,y_train))
print(error(final_a, final_b, x_train, y_train))
print(error(final_a, final_b, x_test, y_test))
```

After checking the error, we are going to plot the hypothesis equation for both train data and test data and check if the model best fits the curve.

`In [ 18 ]:`

```
plt.plot(x_train, y_train, 'r.', x_train, hypothesis(a, b, x_train), 'g', x_train, hypothesis(final_a, final_b, x_train), 'b', )
```

`Out [ 18 ]:`

`In [ 19 ]:`

```
plt.plot(x_test, y_test, 'r.', x_test, hypothesis(final_a, final_b, x_test), 'g')
```

`Out [ 19 ]:`

`In [ 20 ]:`

`print(str((1-error(final_a, x_test, final_b, y_test))*100) + " %")`

Milestone: Regression - Working with Continuous Data

The explanation for Multivariate Linear Regression is the same as Simple Linear Regression. You will see that only a few components of the equations are going to change.

- You will use two features. You can use all the features present in the data set, but visualizing the output becomes difficult. But as the number of features increases the number of dimensions also increase.
- The Hypothesis Equation and the Step Gradient Equation changes.
- The error equation remains the same.

Since you are considering two features and one target, the hypothesis equation becomes

If we are going to consider more than two features, the hypothesis equation becomes,

The error equation remains the same.

**Step Gradient Equations:**

Now you have the hypothesis equation and the error equation defined. You will differentiate the error equation with respect to '**a**', '**b**' and '**c**'. You will get the gradient descent equations after differentiating. Those equations are:

Milestone: Regression - Working with Continuous Data

In this project, we are going to use RM and LSTAT as two input features and TARGET becomes our output feature for training our model. For multivariate linear regression, we will have to import the following modules.

`In [ 1 ]:`

```
# Importing Required Packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from mpl_toolkits import mplot3d
# Importing Dataset
from sklearn.datasets import load_boston
%matplotlib inline
```

The data preprocessing is similar to linear regression. The following code tells you about the changes that have been made comparing to linear regression

`In [ 2 ]:`

```
boston = load_boston()
features = pd.DataFrame(boston.data, columns=boston.feature_names)
target = pd.DataFrame(boston.target, columns=['TARGET'])
data = pd.concat([features, target], axis=1)
print(boston.DESCR)
Boston House Prices dataset
===========================
Notes
------
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive
:Median Value (attribute 14) is usually the target
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
<http://archive.ics.uci.edu/ml/datasets/Housing>
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
**References**
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see <http://archive.ics.uci.edu/ml/datasets/Housing>)
```

The above code, loads the data and concatenates both Data features and Target feature.

For loading the required feature,

`In [ 3 ]:`

```
X1 = data['LSTAT']
X2 = data['RM']
Y = data['TARGET']
# Normalizing the data
X1 = np.array((X1 - X1.min())-(X1.max() - X1.min()))
X2 = np.array((X2 - X2.min())-(X2.max() - X2.min()))
Y = np.array((Y - Y.min())-(Y.max() - Y.min()))
```

Next, we are going to split the data for training and testing.

`In [ 4 ]:`

```
x1_train, x1_test, x2_train, x2_test, y_train, y_test = train_test_split(X1, X2, Y, test_size=0.2)
```

To visualize the training data and test data,

`In [ 5 ]:`

```
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot(x1_train, x2_train, y_train, 'g.')
```

`Out [ 5 ]:`

`In [ 6 ]:`

```
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.plot(x1_test, x2_test, y_test, 'g.')
```

`Out [ 6 ]:`

Next we will see what changes we are going to make to our hypothesis equation, cost function and step gradient functions.

`In [ 7 ]:`

```
def hypothesis(a,b,c,x1,x2):
return a * x1 + b * x2 + c
```

Similarly, our cost function equation becomes,

`In [ 8 ]:`

```
def error(a,b,c,x1,x2,y):
e = 0
m = len(x1)
for i in range(m):
e += np.power((hypothesis(a,b,c,x1[i],x2[i]) - y[i]), 2)
return (1/(2*m)) * e
```

`In [ 9 ]:`

```
def step_gradient(a,b,c,x1,x2,y,learning_rate):
grad_a = 0
grad_b = 0
grad_c = 0
m = len(x1)
for i in range(m):
grad_a += 2/m * (hypothesis(a,b,c,x1[i],x2[i]) - y[i]) * x1[i]
grad_b += 2/m * (hypothesis(a,b,c,x1[i],x2[i]) - y[i]) * x2[i]
grad_c += 2/m * (hypothesis(a,b,c,x1[i],x2[i]) - y[i])
a = a - (grad_a * learning_rate)
b = b - (grad_b * learning_rate)
c = c - (grad_c * learning_rate)
return a, b, c
```

`In [ 10 ]:`

```
def descend(initial_a, initial_b, initial_c, x1, x2, y, learning_rate, iterations):
a = initial_a
b = initial_b
c = initial_c
for i in range(iterations):
e = error(a, b, c, x1, x2, y)
if i % 1000 == 0:
print(f"Error: {e}, a: {a}, b: {b}, c: {c}")
a, b, c = step_gradient(a, b, c, x1, x2, y, learning_rate)
return a, b, c
```

`In [ 11 ]:`

```
a = 0
b = 1
c = 1
learning_rate = 0.01
iterations = 10000
final_a, final_b, final_c = descend(a, b, c, x1_train, x2_train, y_train, learning_rate, iterations)
```

Printing the error is similar to how we found out the error in linear regression.

`In [ 12 ]:`

```
print(error(a, b, c, x1_train, x2_train, y_train))
print(error(final_a, final_b, final_c, x1_train, x2_train, y_train))
print(error(final_a, final_b, final_c, x1_test, x2_test, y_test))
```

Milestone: Regression - Working with Continuous Data

In Univariate Polynomial Regression, you are going to make our hypothesis equation a polynomial equation based on the curve. Here for the feature **'RM'**, you are going to choose a quadratic equation, therefore our hypothesis equation becomes,

The data preprocessing for univariate polynomial regression is the same as univariate linear regression, hence it would be great if you refer back to the data preprocessing in univariate linear regression.

Moving forward, our code for the hypothesis equation becomes,

`In [ 7 ]:`

```
def hypothesis(a,b,c,x):
return a * x + b * np.power(x, 2) + c
```

The cost function remains the same as univariate linear regression, but the parameters which are passed to our hypothesis changes.

`In [ 8 ]:`

```
def error(a,b,c,x,y):
e = 0
m = len(x)
for i in range(m):
e += np.power((hypothesis(a,b,c,x[i]) - y[i]), 2)
return (1/(2*m)) * e
```

The step gradient equation becomes,

Hence, implementing the code for our step gradient,

`In [ 9 ]:`

```
def step_gradient(a,b,c,x,y,learning_rate):
grad_a = 0
grad_b = 0
grad_c = 0
m = len(x)
for i in range(m):
grad_a += 1/m * (hypothesis(a,b,c,x[i]) - y[i]) * x[i]
grad_b += 1/m * (hypothesis(a,b,c,x[i]) - y[i]) * np.power(x[i], 2)
grad_c += 1/m * (hypothesis(a,b,c,x[i]) - y[i])
a = a - (grad_a * learning_rate)
b = b - (grad_b * learning_rate)
c = c - (grad_c * learning_rate)
return a, b, c
```

`In [ 10 ]:`

```
def descend(initial_a, initial_b, initial_c, x, y, learning_rate, iterations):
a = initial_a
b = initial_b
c = initial_c
for i in range(iterations):
e = error(a, b, c, x, y)
if i % 1000 == 0:
print(f"Error: {e}, a: {a}, b: {b}, c: {c}")
a, b, c = step_gradient(a, b, c, x, y, learning_rate)
return a, b, c
```

`In [ 11 ]:`

```
a = 0
b = 1
c = 1
learning_rate = 0.5
iterations = 10000
final_a, final_b, final_c = descend(a, b, c, x_train, y_train, learning_rate, iterations)
```

`In [ 12 ]:`

```
print(error(a, b, c, x_train, y_train))
print(error(final_a, final_b, final_c, x_train, y_train))
print(error(final_a, final_b, final_c, x_test, y_test))
```

`In [ 13 ]:`

```
plt.plot(x_train, y_train, 'r.', x_train, hypothesis(a, b, c, x_train), 'g.', x_train, hypothesis(final_a, final_b, final_c, x_train), 'b.', )
```

`Out [ 13 ]:`

`In [ 14 ]:`

```
plt.plot(x_test, y_test, 'r.', x_test, hypothesis(final_a, final_b, final_c, x_test), 'g.')
```

`Out [ 14 ]:`

Milestone: Regression - Working with Continuous Data

In multivariate polynomial regression, the changes are similar to univariate polynomial regression, but the hypothesis equation is going to have more than two features. To relate this to our multivariate linear regression, we are going to use the two same parameters which are RM and LSTAT.

The data preprocessing for multivariate polynomial regression is the same as multivariate linear regression. I'd want you to go back to that algorithm and refer the data preprocessing part.

To start with the hypothesis equation,

`In [ 7 ]:`

```
def hypothesis(a,b,c,x1,x2):
return a * x1 + b * np.power(x2, 2) + c
```

The error function is also going to be same as that of the error function in multivariate linear regression

`In [ 8 ]:`

```
def error(a,b,c,x1,x2,y):
e = 0
m = len(x1)
for i in range(m):
e += np.power((hypothesis(a,b,c,x1[i], x2[i]) - y[i]), 2)
return (1/(2*m)) * e
```

The gradient descent equations are going to become,

`In [ 9 ]:`

```
def step_gradient(a,b,c,x1,x2,y,learning_rate):
grad_a = 0
grad_b = 0
grad_c = 0
m = len(x1)
for i in range(m):
grad_a += 1/m * (hypothesis(a,b,c,x1[i],x2[i]) - y[i]) * x1[i]
grad_b += 1/m * (hypothesis(a,b,c,x1[i],x2[i]) - y[i]) * x2[i]
grad_c += 1/m * (hypothesis(a,b,c,x1[i],x2[i]) - y[i])
a = a - (grad_a * learning_rate)
b = b - (grad_b * learning_rate)
c = c - (grad_c * learning_rate)
return a, b, c
```

`In [ 10 ]:`

```
a = 0
b = 1
c = 1
learning_rate = 0.5
iterations = 10000
final_a, final_b, final_c = descend(a, b, c, x1_train, x2_train, y_train, learning_rate, iterations)
```

`In [ 11 ]:`

```
print(error(a, b, c, x1_train, x2_train, y_train))
print(error(final_a, final_b, final_c, x1_train, x2_train, y_train))
print(error(final_a, final_b, final_c, x1_test, x2_test, y_test))
```

Milestone: Classification - Understanding grouped data

Logistic Regression was first used in the 20th century in biological sciences. It was also used in many social science applications. In today's world, you see that there are a lot of problems that deal with binary answers, like,

If you want to classify an email as either as **spam** or **not spam**,

If you want to classify a cancer tumor as **benign** or **malignant**,

If you want to classify a transaction as **fraudulent** or **not**.

If the output is either a **YES** or a **NO**, in such scenarios you are going to use classification algorithms. Logistic Regression is one such algorithm which helps you in classifying outputs.

The logistic function is also called a sigmoid function. The sigmoid function is defined as,

The graph of a sigmoid function looks like the letter "S". The graph is shown below.

The equation of the sigmoid function is,

To understand how logistic regression works, you are going to classify the different types of breast cancer.

You know that any a cancer tumour has got two states which are Malignant and Benign. Benign means that a cancer tumour is not harmful and Malignant means that a cancer tumour is harmful.

To perform logistic regression, you should first see how the data gets visualized. For that, you are going to plot the data with respect to one feature and the output class.

Imagine, you were trying to fit a straight line to this data, then the model would have looked something like this.

But if there was a blue point farther away, the plot of the best-fit model would look something like the green line.

The purpose of a model is that it should represent any kind of data that you'd see. But here you see that the model keeps changing. That means that the straight line is not a great representation of our model. So you need something better.

For that, you will use the logistic function to depict the data.

This logistic function seems to be a better representation of our data. So how would you get such a curve to represent our model?

Let's start by taking simple equations that will help us achieve this.

As you did for the linear regression algorithm, you should have a hypothesis equation, an error equation, the step gradient equation and the gradient descent function.

The explanations below will help you get these equations.

We learnt that the hypothesis function is a function of the sigmoid function. Also, the hypothesis equation lies between 0 and 1.

The hypothesis function is defined as

Here the theta represents our data. The sigmoid function as you saw earlier is defined as,

Therefore the hypothesis equation can be put forth as,

where θx is defined as,

Substituting the value of θx in our sigmoid function, we get the hypothesis to be,

The hypothesis function gives us the probability that the output is 1. If the output of the hypothesis function is 0.7, that means there is a 70% chance that our output is 1. If the output of the hypothesis function is 0.2, that means there is a 20% chance that our output is 1. Based on these outputs, you can decide to which class an output belongs to.

To decide, which class the output belongs to, you have to consider a parameter called the decision boundary. This will help you in classifying the outputs you will get.

If the hypothesis function has a value greater than 0.5, you can say it belongs to class 1. If the value of the hypothesis function is less than 0.5, it belongs to class 0.

Thus, the decision boundary is the line that separates the area where y = 0 and where y = 1. It gets created by our hypothesis function.

When you take the hypothesis function to be a straight line equation, we get the graph of the cost function to look like this,

But when you consider the hypothesis function to be the output of a sigmoid function, you get many minima. These minima are called the local minima. The plot for our hypothesis equation will be:

When you get such a graph, it's difficult for the computer to decide which is the best fit model. To overcome this problem, you have to use the log function. When you feed the hypothesis function to the log function, the output gets normalized. The graphs are shown below.

When the output of the hypothesis equation is 1, then you consider the graph below. This also tells you that the class of the output is 1

When the output of the hypothesis function is 0, if you consider the log function as log(h(x)), the value is not defined. So you are going to consider the log function as log(1-h(x)).

When you combine both the graphs you will get the cost function for logistic regression. So the cost function becomes,

But the problem is that you can write these kinds of equations in our program. So, you have to simplify it further. To do that, you have to combine both the equations together.

You can see that the output states vary depending on the value of y. So, if you multiply by y for the first expression and the second expression with (1-y), you can put together both the equations.

When you consider the above point, the cost function becomes

The gradient descent is the same as that of the linear regression. When we differentiate the cost function, we get the slope values for each parameter. The gradient descent equation for each parameter becomes,

Milestone: Classification - Understanding grouped data

Here we are going to work with Wisconsin Breast Cancer Data Set. Our goal is to group the target into two groups which are the two phases of breast cancer namely malignant and benign.

First we are going to perform the usual data preprocessing and then proceed for the training of the model.

After the model is trained we are going to use that model to predict the output for a particular set of data and check for the accuracy.

`In [ 1 ]:`

```
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
%matplotlib inline
```

`In [ 2 ]:`

```
cancer = load_breast_cancer()
print(cancer.DESCR)
Breast Cancer Wisconsin (Diagnostic) Database
=============================================
Notes
-----
Data Set Characteristics:
:Number of Instances: 569
:Number of Attributes: 30 numeric, predictive attributes and the class
:Attribute Information:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" - 1)
The mean, standard error, and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.
- class:
- WDBC-Malignant
- WDBC-Benign
:Summary Statistics:
===================================== ====== ======
Min Max
===================================== ====== ======
radius (mean): 6.981 28.11
texture (mean): 9.71 39.28
perimeter (mean): 43.79 188.5
area (mean): 143.5 2501.0
smoothness (mean): 0.053 0.163
compactness (mean): 0.019 0.345
concavity (mean): 0.0 0.427
concave points (mean): 0.0 0.201
symmetry (mean): 0.106 0.304
fractal dimension (mean): 0.05 0.097
radius (standard error): 0.112 2.873
texture (standard error): 0.36 4.885
perimeter (standard error): 0.757 21.98
area (standard error): 6.802 542.2
smoothness (standard error): 0.002 0.031
compactness (standard error): 0.002 0.135
concavity (standard error): 0.0 0.396
concave points (standard error): 0.0 0.053
symmetry (standard error): 0.008 0.079
fractal dimension (standard error): 0.001 0.03
radius (worst): 7.93 36.04
texture (worst): 12.02 49.54
perimeter (worst): 50.41 251.2
area (worst): 185.2 4254.0
smoothness (worst): 0.071 0.223
compactness (worst): 0.027 1.058
concavity (worst): 0.0 1.252
concave points (worst): 0.0 0.291
symmetry (worst): 0.156 0.664
fractal dimension (worst): 0.055 0.208
===================================== ====== ======
:Missing Attribute Values: None
:Class Distribution: 212 - Malignant, 357 - Benign
:Creator: Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
:Donor: Nick Street
:Date: November, 1995
This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
<https://goo.gl/U2Uwz2>
Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass. They describe
characteristics of the cell nuclei present in the image.
Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree. Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.
The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].
This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/
References
----------
- W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
San Jose, CA, 1993.
- O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
prognosis via linear programming. Operations Research, 43(4), pages 570-577,
July-August 1995.
- W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
163-171.
```

`In [ 3 ]:`

```
features = pd.DataFrame(cancer.data, columns=cancer.feature_names)
target = pd.DataFrame(cancer.target, columns=["TARGET"])
data = pd.concat([features, target], axis=1)
```

`In [ 4 ]:`

```
a = data.corr('pearson')
abs(a.loc['TARGET']).sort_values(ascending=False)
```

`Out [ 4 ]:`

`In [ 5 ]:`

```
x = np.array(data['worst concave points'])
y = np.array(data['TARGET'])
x = x/x.mean()
plt.plot(x, y, 'r.')
```

`Out [ 5 ]:`

`In [ 6 ]:`

```
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
```

The Hypothesis Function is defined below:

`In [ 7 ]:`

```
def hypothesis(z):
return 1 / (1 + np.exp(-z))
```

The error function is written below:

`In [ 8 ]:`

```
def error(a, b, x ,y):
error = 0
m = len(x)
for i in range(m):
z = a*x[i] + b
error += y[i]*np.log(hypothesis(z)) + (1-y[i])*np.log(1-hypothesis(z))
return (-1/m) * error
```

The step gradient is described below:

`In [ 9 ]:`

```
def step_gradient(a, b, x, y, learning_rate):
grad_a = 0
grad_b = 0
m = len(x)
for i in range(m):
z = a*x[i] + b
grad_a += 1/m * (hypothesis(z) - y[i]) * x[i]
grad_b += 1/m * (hypothesis(z) - y[i])
a = a - (grad_a * learning_rate)
b = b - (grad_b * learning_rate)
return a, b
```

The descend function is written below:

`In [ 10 ]:`

```
def descend(initial_a, initial_b, x, y, learning_rate, iterations):
a = initial_a
b = initial_b
for i in range(iterations):
e = error(a, b, x, y)
if i % 1000 == 0:
print(f'Error:{e}')
a, b = step_gradient(a, b, x, y, learning_rate)
return a, b
```

The accuracy function which is written below helps us understand how accurate our algorithm is:

`In [ 11 ]:`

```
def accuracy(theta, a, b, x, y):
count = 0
for j in range(len(x)):
test = sigmoid(theta)
if test[j] > 0.9:
z = 1
else:
z = 0
if y[j] == z:
count += 1
acc = count/len(y)
print(f"Error is {100-(acc*100)}")
```

`In [ 12 ]:`

```
a = 1
b = 1
learning_rate = 0.01
iterations = 10000
final_a, final_b = descend(a, b, x_train, y_train, learning_rate, iterations)
```

`In [ 13 ]:`

```
f = final_a * x_train + final_b
plt.plot(x_train, y_train, 'r.', x_train, sigmoid(f), 'b+')
```

`Out [ 13 ]:`

`In [ 14 ]:`

```
g = final_a * x_test + final_b
plt.plot(x_test, y_test, 'g.', x_test, sigmoid(g), 'co')
```

`Out [ 14 ]:`

`In [ 15 ]:`

```
accuracy(f, final_a, final_b, x_train, y_train)
accuracy(g, final_a, final_b, x_test, y_test)
```

This completes the univariate logistic regression.

Milestone: Classification - Understanding grouped data

In Multivariate Logistic Regression, you are going to use more than one feature to get our hypothesis equation. As you already know, the hypothesis equation is,

and defining θx you get,

Therefore our hypothesis equation becomes,

This defines our hypothesis equation.

The cost function remains the same as there are no parameters that are affecting the equation.

The gradient descent equations become,

Since you have understood the flow of what all changes are to be made, go ahead with the code now.

Milestone: Classification - Understanding grouped data

We will be again working on the same data set which is the Wisconsin Breast Cancer Data Set, but we are going to using a multivariate hypothesis equation as we discussed above.

`In [ 1 ]:`

```
# Importing required Packages
import numpy as np
import pandas as pd
from numpy import math
from math import exp
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Importing the dataset
from sklearn.datasets import load_breast_cancer
%matplotlib inline
```

`In [ 2 ]:`

```
# Loading the data into a variable
cancer=load_breast_cancer()
```

`In [ 3 ]:`

```
# Performing the convertion of dictionary type of the data to dataframe and then concatenating features and target for comparison.
features=pd.DataFrame(cancer.data,columns=cancer.feature_names)
target=pd.DataFrame(cancer.target,columns=['TARGET'])
data=pd.concat([features,target],axis=1)
```

`In [ 4 ]:`

```
# Figuring out which feature is highly correlated with the target
data2=data.corr('pearson')
data2.loc['TARGET'].sort_values(ascending=False)
```

`In [ 5 ]:`

```
x1=np.array(data['worst concave points'])
x2=np.array(data['worst perimeter'])
y=np.array(data['TARGET'])
# Normalizing data to bring the values to a same scale
x1=(x1-x1.min())/(x1.max()-x1.min())
x2=(x2-x2.min())/(x2.max()-x2.min())
```

`In [ 6 ]:`

```
fig=plt.figure()
ax=fig.add_subplot(1,1,1,projection='3d')
ax.plot(x2,x1,y,'r.')
plt.show()
```

`Out [ 6 ]:`

`In [ 7 ]:`

```
x1_train, x1_test, x2_train, x2_test, y_train, y_test = train_test_split(x1, x2, y, test_size = 0.2)
```

`In [ 8 ]:`

```
def hypothesis(f):
return 1/(1+np.exp(-f))
```

`In [ 9 ]:`

```
def error(a,x1,x2,b,y,c):
e=0
m=len(y)
for i in range(m):
f= a*x1[i]+b*x2[i]+c
e+=-y[i]*np.log(hypothesis(f))-(1-y[i])*np.log(1-hypothesis(f))
return (1/m)*e
```

`In [ 10 ]:`

```
def step_gradient(a,x1,x2,b,y,learning_rate,c):
grad_a=0
grad_b=0
grad_c=0
m=len(x1)
for i in range(m):
f= a*x1[i]+b*x2[i]+c
grad_a += (hypothesis(f)-y[i])*x1[i]
grad_b += (hypothesis(f)-y[i])*x2[i]
grad_c += (hypothesis(f)-y[i])
a=a-(grad_a*learning_rate)
b=b-(grad_b*learning_rate)
c=c-(grad_c*learning_rate)
return a,b,c
```

`In [ 11 ]:`

```
def descend(initial_a,initial_b,initial_c,x1,x2,y,learning_rate,iterations):
a=initial_a
b=initial_b
c=initial_c
for i in range(iterations):
e=error(a,x1,x2,b,y,c)
a,b,c=step_gradient(a,x1,x2,b,y,learning_rate,c)
if i % 1000 == 0:
print(f"Error: {e},a:{a},b:{b},c:{c}")
return a,b,c
```

`In [ 12 ]:`

```
a,b,c=0,0,0
learning_rate=0.0001
iterations=10000
final_a,final_b,final_c=descend(a,b,c,x1_train,x2_train,y_train,learning_rate,iterations)
```

`In [ 13 ]:`

```
f=(final_a*x1_train)+(final_b*x2_train)+final_c
fig=plt.figure()
ax=fig.add_subplot(1,1,1,projection='3d')
ax.view_init(45,45)
ax.plot3D(x2_train,x1_train,y_train,'r.')
ax.plot3D(x2_train,x1_train, sigmoid(f),'g.')
ax.set_xlabel("worst concave points")
ax.set_ylabel("worst perimeter")
ax.set_zlabel("TARGET")
plt.show()
```

`Out [ 13 ]:`

`In [ 14 ]:`

```
def accuracy(a,x1,x2,b,y,c):
correct=0
for i in range(len(x1)):
f=a*x1[i]+b*x2[i]+c
prediction=sigmoid(f)
if prediction>0.5:
z=1
else:
z=0
if y[i]==z:
correct+=1
print("Accuracy: {}".format(correct/len(y)))
```

`In [ 15 ]:`

```
accuracy(final_a,x1_train,x2_train,final_b,y_train,final_c)
accuracy(final_a,x1_test,x2_test,final_b,y_test,final_c)
```

Milestone: Classification - Understanding grouped data

You have a problem where you have to figure out if a given sample of wine is a type of red wine or is a type of white wine. In the problem, you would have also got the data related to the wine sample that you have received.

So, how would you classify the wine sample? Here we are considering only two types, but in real life, there are 100's of different types of wine. We can't use algorithms like logistic regression as classifies only into two categories.

There is another classification algorithm called K-Nearest Neighbour. It helps to classify data into many groups.

For now, let's take two different types of wine, whose categories we know. For example, Red Wine and White Wine.

So, how can we use K-Nearest Neighbour algorithm to figure out the category of an unknown wine?

KNN (K-Nearest Neighbour) works in a very simple way. It calculates the Euclidean distance of the sample data with all the known data points. It then classifies the data based on the principle of the least distance. That means, it considers a few nearest data points and then checks their classes. Then, it classifies our data point based on which classes' samples are nearer to our sample data.

This might seem a bit confusing right now, but as we go further it will go on becoming clearer and clearer to you.

Consider that the wine data is spread as you see in the plot below:

On the graph, the x-axis represents one feature and the y-axis represents a second feature. The data points are also colour coded for the sake of simplicity. The red dots indicate wine while the blue dots indicate white wine.

Now that you have the sample of the unknown wine, plot it on the graph.

In the graph above, the star represents the category of the unknown wine. With visual inspection, you can say that the sample belongs to the category of white wines. But our computers cannot see them. For that purpose, we calculate the Euclidean distances.

What you have to do is to tell the computer to calculate the distance of the unknown wine sample to all the samples. To help you out, here's the formula for Euclidean distance.

After calculating the distances, you have to sort all the values in ascending order. That way you get the least distances at the top.

The name of the algorithm is K-Nearest Neighbour. So what is K? We know the nearest neighbour is. It is the nearest neighbour to our sample data. What K tells us is how many nearest neighbours we have to consider.

So what is the advantage of this K? K helps you to get a better scope to classify the sample data.

Here's an example. Let's consider the K-value to be 5. When you consider the 5 nearest data points, you get 3 white wine points and 2 red wine points. Thus, you can say that the sample data belongs to white wine. It's because 3 of the 5 points near our sample data are white wine points.

This is what the K-NN algorithm does.

Here, I told you what K value to choose, but how do you know what K value to choose?

Here's an explanation for it.

There might be cases where we are going to have an equal number of data points. Let us take K value to be 6. If there are 3 white wine data points and 3 red wine data points, you can't classify our unknown data point. To overcome this problem we are going to choose K value in such a way that it is not a multiple of the number of categories we have. For example, if we have 3 categories, we should choose k value in such a way that the k value is not a multiple of 3. Similarly, if the number of categories is 4 we should avoid k values which are multiples of 4.

K- value is also decided based on the number of data points we have. Suppose we have 200 data points and we have 4 classes, the k-value should be not more than 7. If we put values more than 7, the prediction might become biased.

To give a mathematical understanding,

**NOTE**: This equation is just to help you out. It's not a standard equation

Next, let's write the code for our K - Nearest Neighbor Algorithm.

Milestone: Classification - Understanding grouped data

We are going use the wine data set that is present in the sklearn data sets library to implement the K-NN Algorithm from Scratch.

`In [ 1 ]:`

```
# Importing required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_wine
%matplotlib inline
```

`In [ 2 ]:`

```
# Function to calculate the euclidian distance
def euclid_distance(train_point, given_point):
distance = np.sum((train_point-given_point)**2)
return np.sqrt(distance)
```

`In [ 3 ]:`

```
# Function to calculate the distance of each point from the set of remaining points in the dataset
def calc_distance_from_all(all_points, given_point, predictions):
all_distances = []
for i, each in enumerate(all_points):
distance = euclid_distance(each, given_point)
all_distances.append((distance,int(predictions[i])))
all_distances.sort(key=lambda tup: tup[0])
return all_distances
```

`In [ 4 ]:`

```
# Function to obtain the neighbours based on the K-Value
def get_neighbours(distances, count):
return distances[:count]
```

`In [ 5 ]:`

```
# Function to predict the output of a given point
def predict(all_points, given_point, predictions):
distances = calc_distance_from_all(all_points,given_point,predictions)
neighbours = get_neighbours(distances, 4)
op = [row[-1] for row in neighbours]
prediction = max(set(op), key=op.count)
return prediction
```

`In [ 6 ]:`

```
# Function to calculate the accuracy of
def accuracy(basex, basey, testx, testy):
correct = 0
for i in range(len(testx)):
p = predict(basex, testx[i], basey)
if p == testy[i]:
correct += 1
return f"Accuracy: {correct*100/len(testy)}%"
```

`In [ 7 ]:`

```
# Data Preprocessing
wine = load_wine()
X = pd.DataFrame(wine.data, columns=wine.feature_names)
Y = pd.DataFrame(wine.target, columns=['target'])
```

`In [ 8 ]:`

```
# Normalization of data
X = (X-X.min()) / (X.max()-X.min())
# Splitting the dataset for train and test values
xtrain, xtest, ytrain, ytest = train_test_split(X,Y, test_size=0.3)
wine.feature_names
```

`Out [ 8 ]:`

['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

`In [ 9 ]:`

```
f1 = 'hue'
f2 = 'proline'
basex = np.array(xtrain[[f1, f2]])
basey = np.array(ytrain)
xtest = np.array(xtest[[f1, f2]])
ytest = np.array(ytest)
```

`In [ 10 ]:`

```
x = pd.DataFrame(basex)
y = basey
plt.scatter(x.iloc[:,0], x.iloc[:,1], c=y, s=15)
plt.scatter(0.25, 0.2, c='red', marker='x', s=100)
```

`Out [ 10 ]:`

`In [ 11 ]:`

```
# Prints the accuracy of the test data set
print(accuracy(basex,basey,xtest,ytest))
```

This completes the code for K-NN from scratch

Milestone: Unsupervised Machine Learning

Till now, you learnt about supervised learning. In all the algorithms, you knew what will be the output was. So, you had to a train model which would give good accuracy. But in unsupervised learning, you won't be having any output values. You have to train a model using only the feature values. You also can't compare it with anything to get the accuracy of the model that you train.

There are two types of unsupervised learning:

- Clustering
- Dimensionality Reduction

Let's understand in brief, what clustering is all about.

Clustering is the process of grouping similar data together. The goal of clustering is to find similarities in the data point and group them together.

**Why Clustering?** Grouping similar data together help you to generate attributes for the groups. It will give you insights and patterns on the different groups that get created. There are many applications of grouping unlabeled data. For example, you can identify different groups of people to help a company to do better marketing. Another example is grouping documents together which belong to similar topics etc.

Clustering helps you reduce the number of features in data when you have lots of data points.

There are many algorithms that do clustering. But for now, let's stick to popular algorithms: K-mean Clustering Hierarchical Clustering

Milestone: Unsupervised Machine Learning

K-Means clustering is one of the most simple types of the clustering algorithm. It's like K-NN but the only difference is that we don't have the output classes in K-Means. Well, that is the purpose of using K-Means clustering right!

K-Means works on the principle of centroids. If you remember centroids in triangles we studied in 7th-grade math, its simple to understand K Means. K-Means groups data into different classes. When a group of data get formed, that group has a centroid which tells us that it's the centre of the group. Using this principle, you have to write the code for K-Means.

To perform a Clustering algorithm, we have to go through the following steps.

- First, you have to select how many clusters you want to create. Before you decide on that, it's better if you plot the data and decide after taking a look at it. If you want to create 3 classes after you have looked at the data, select three random points from the data set. Initialize these data points as the centroid of the three different classes.
- You have to calculate the Euclidean distance of all the data points from the three centroids. After you calculate the distance, you have to check which centroid is the nearest to all data.
- After you group all the data to their respective classes, you move the centroids to their new positions.
- You have to repeat step 2 until you reach a point where centroids move the least. That means the movement of the centroids should be negligible.

Now, you might be thinking that how do I decide the value of K in the first step.

There a process called the Elbow Methods that you can use to decide the number of classes you want to classify your data.

Here's a graph that might help you understand more about elbow method.

In the picture below, you would notice that as we add more clusters after 3 it doesn't give much better modelling on the data. The first cluster adds much information, but at some point, the marginal gain will start dropping.

The advantage of K-Means is that it’s pretty fast. But it has a linear complexity.

But, K-Means has a couple of disadvantages. You have to select how many groups you want to create. With a clustering algorithm we’d want the algorithm to figure out insights for us because the point of it is to gain some insight from the data. K-means also starts with a random choice of cluster centers. So it may yield different clustering results on different runs of the algorithm. Thus, the results may not be repeatable and lack consistency. Other cluster methods are more consistent.

To understand how to implement this in code, first you are going to take the starting n values based of the data as the beginning centroids. The value of n depends on the number of classes.

Next, we are going start moving the centroid towards the actual centroid.

To figure out to which class a data point belongs to is done by calculating the euclidean distance. By comparing the distance values, the class is decided.

To have a better understanding, let's understand this in context.

Milestone: Unsupervised Machine Learning

We are going to use the wine data set to apply K-Means Clustering

Let us write the code along with it I'll explain the code.

`In [ 1 ]:`

```
# Importing required Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
#Importing dataset
from sklearn.datasets import load_wine
%matplotlib inline
```

`In [ 2 ]:`

```
k = 3 # defining the number of classes
tol = 0.001 #tolerance - This represents the degree of movement that is allowed for the centroids
max_iter = 300 # maximum number of iterations
colors = 10*["g","r","c","b","k",'y'] # color values to plot the graph
```

`In [ 3 ]:`

```
# Fit function is written to train the centroids. Basically we are going to find the centroid position with the help of this function
def fit(data,k,max_iter,tol):
centroids = {} # creating an empty dictionary to keep track of the centroid values
for i in range(k):
centroids[i] = data[i] # Assigning the initial centroid values based on the input data. Here, since the k value is 3, the first 3 values of the data is assigned as centroid
for i in range(max_iter):
classifications = {} # creating an empty dictionary to keep track of which all points belong to which centroid.
for i in range(k):
classifications[i] = [] # we are creating 3 empty lists for each class. The classification dictionary is going to look something like this classifications = {0:[], 1:[], 2:[]}
for featureset in data: # We are going to go through every data point present.
# Here we are calculating the distance of each data point from all the three centroids
# and then we are finding the minimum value amongst the calculated distance. The centroid
# which gives us the minimum value is assigned with that data point
distances = [np.linalg.norm(featureset-centroids[centroid]) for centroid in centroids]
classification = distances.index(min(distances)) # This gives the index value of the position in the list, where the minimum value is present. This index value gives us the index value
classifications[classification].append(featureset) # Appending the minimum value to the respective index value
prev_centroids = dict(centroids) # Assigning the current centroid values to a new dictionary called prev_centroid as the value of the centroid is going to change
# Here we are going to update the centroid value by calculating the average of all the data values that are assigned to that particular class
for classification in classifications:
centroids[classification] = np.average(classifications[classification],axis=0)
optimized = True
# In this for loop we are going to calculate by what degree or by what value has the centroid moved.
for c in centroids: # Calculating the degree of change for each centroid
original_centroid = prev_centroids[c]
current_centroid = centroids[c]
# the sum operation is summing up the values in the list. We have to perform the sum to basically
# calculate the total degree of change and it becomes easy for us to compare with the tolerance value.
# This sum value is compared with the tolerance value. If the tolerance value is less than the sum, then the cycle is continued.
# This cycle of operation is carried out until the sum value goes less than the tolerance value.
if np.sum((current_centroid-original_centroid)/original_centroid*100.0) > tol:
optimized = False
if optimized:
break
return centroids, classifications
# Here we are going to predict to which a particular given data point belongs to
def predict(data,centroids):
distances = [np.linalg.norm(data-centroids[centroid]) for centroid in centroids]
classification = distances.index(min(distances))
return classification
```

In [ 4 ]:

```
# Data initialization
wine = load_wine()
X = pd.DataFrame(wine.data, columns=wine.feature_names)
Y = pd.DataFrame(wine.target, columns=['target'])
```

In [ 5 ]:

```
# Normalization
X = (X-X.min()) / (X.max()-X.min())
# Data splitting for train and test
xtrain, xtest, ytrain, ytest = train_test_split(X,Y, test_size=0.3)
```

In [ 6 ]:

```
# choosing a feature
f1 = 'hue'
f2 = 'proline'
basex = np.array(xtrain[[f1, f2]])
basey = np.array(ytrain['target'])
xtest = np.array(xtest[[f1, f2]])
ytest = np.array(ytest['target'])
```

In [ 7 ]:

```
x = pd.DataFrame(basex)
y = basey
```

In [ 8 ]:

```
centroids, classifications = fit(basex,k,max_iter,tol)
```

In [ 9 ]:

```
# Code to plot the graph below
for centroid in centroids:
plt.scatter(centroids[centroid][0], centroids[centroid][1], marker='o', color='k', s=50)
for classification in classifications:
color = colors[classification]
for featureset in classifications[classification]:
plt.scatter(featureset[0], featureset[1], marker='.', color=color, s=50)
for unknown in xtest:
classification = predict(unknown,centroids)
plt.scatter(unknown[0], unknown[1], marker='*', color=colors[classification], s=50)
```

Out [ 9 ]: