<![CDATA[Hemath Deunei]]>https://blog.hemath.comhttps://cdn.hashnode.com/res/hashnode/image/upload/v1682855767520/7j7sF2XB-.pngHemath Deuneihttps://blog.hemath.comRSS for NodeMon, 14 Oct 2024 12:37:46 GMT60<![CDATA[Let's break down - Neural Network]]>https://blog.hemath.com/lets-break-down-neural-networkhttps://blog.hemath.com/lets-break-down-neural-networkThu, 08 Dec 2022 11:19:24 GMT<![CDATA[<h3 id="heading-cost-functions"><strong>Cost Functions</strong></h3><p>The cost function measures how far away a particular solution is from an optimal solution to the problem in hand. The goal of every machine learning model pertains to minimizing this very function, tuning the parameters and using the available functions in the solution space. In other words, a cost function, is a measure of how good a network did with respect to its training sample. As cost function is a function, it returns just a value of measure.</p><h3 id="heading-feedforward-neural-network"><strong>Feedforward Neural Network</strong></h3><p>A feedforward neural network is basically a multi-layer (of neurons) connected with each other. It takes an input, traverses through its hidden layer and finally reaches the output layer. Let 𝓪<sub>j</sub><sup>(i) </sup> be the output of the j<sup>th</sup> neuron in the i<sup>th</sup> layer. In that sense, 𝓪<sub>j</sub><sup>(1) </sup> is the j<sup>th</sup> neuron in the input layer. So, the subsequent layers inputs will be:</p><p>a<sup>i</sup><sub>j</sub>\=<strong><em></em></strong>(<strong><em></em></strong><sub>k</sub>(<strong><em>w</em></strong><sup>i</sup><sub>jk</sub><strong><em>a</em></strong><sup>i1</sup><sub>k</sub>)+b<sup>i</sup><sub>j</sub>)</p><p>Where:</p><p><strong><em></em></strong> is the activation function,<br /><strong><em>w</em></strong><sup>i</sup><sub>jk</sub> is the weight from the <strong><em>k</em></strong><sup>th</sup> neuron in the <strong><em>(i 1)</em></strong><sup>th</sup> layer to the <strong><em>j</em></strong><sup>th</sup> neuron in the <strong><em>i</em></strong><sup>th</sup> layer,<br />b<sup>i</sup><sub>j</sub> is the bias of the <strong><em>j</em></strong><sup>th</sup> neuron in the <strong><em>i</em></strong><sup>th</sup> layer, and<br />a<sup>i</sup><sub>j </sub> represents the activation value of the <strong><em>j</em></strong><sup>th </sup> neuron in the <strong><em>i</em></strong><sup>th</sup> layer.<br />Sometimes, the input to the activation function, is written as z<sup>ij</sup></p><h3 id="heading-bias"><strong>Bias</strong></h3><p>A bias value allows you to shift the activation function to the left or right. It plays the same role which a coefficient b plays in the following linear equation: y=ax+b. Essentially, it helps in shift the output of the activation function so as to fit the prediction with data better. In case sigmoid function is used as an activation function, bias is used to adjust the steepness of the curve.</p><p>Lets go back to cost function. So the cost function of a neural network generally depends on weights, biases, inputs of the training samples and the desired output from the model. We will learn cost functions for both feedforward (classification) and regularized (logistic regression).</p><h3 id="heading-regularized-amp-feedforward-cost-function"><strong>Regularized & feedforward cost function</strong></h3><p>In case of normal regularized or logistic regression, where we just have one output, the cost function is defined as:</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-1.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>This cost function is derived from maximum likelihood estimation concept. 𝜭 is the parameter which we need to find such that J(𝜭) is minimized. The same is done using gradient descent. <strong>( x <sup>i</sup>, y <sup>i</sup>)</strong> s are the inputs here.</p><p>For feedforward neural network, our cost function will be the generalization of this as it involves in multiple outputs, k, in the following function.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-2.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>The first half of the sum is nothing but summation over the k output units that we had just one in case of logistic regression above. Because of multiple outputs, all the expressions are written in vectorised form now. The second half of the sum is a triple nested summation of the regularized term, also known as weight decay term. The here, adjusts the importance of the terms. We need to minimize this very cost function for the model to perform better. For the same, lets discuss back-propagation algorithm.</p><h3 id="heading-backpropagation-learning"><strong>Backpropagation Learning</strong></h3><p>Back propagation basically compares the real value with the output of the network and checks the efficiency of the parameters. It back-propagates to the previous layers and calculates the error associated with each unit present in them, till it reaches the input layer, where there is no error. The errors measured at each unit is used to calculate partial derivatives which in turn is used by the gradient descent algorithm to minimize the cost function. Gradient descent uses these values to minimize and adjust the 𝜭 values till it converges.</p><p>Let 𝛅<sub>j</sub><sup>l</sup> be the error for node j in layer l and a<sub>j</sub><sup>l </sup> be the totally calculated activation value.<br />Then for any layer l, the error, 𝛅<sub>j</sub><sup>l</sup> will be: 𝛅<sub>j</sub><sup>l</sup> = a<sub>j</sub><sup>l </sup> - y<sub>j</sub><br />Where, y<sub>j</sub> is the actual value observed in the training sample. In terms of vector, the same can be re-written as: 𝛅<sup>l</sup> = a<sup>l </sup> - y and so on.<br />So, for example, we have 𝛅<sup>4 </sup> calculated, then according to backpropagation algorithm, the error in previous layers can be calculated as:</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-3.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>Where,<br />𝜭<sup>(i)</sup> is the vector of parameters for layer i to i+1,<br />𝛅<sup>4 </sup> is the error vector,<br /><strong><em>g</em></strong> is the activation function chosen, for simplicity, we will stick to sigmoid function in this case.<br /><strong><em>Z</em></strong> <sup>i</sup> are nothing but the product of activation values and parameter values of the previous layer, <em>i-1</em></p><p>Here comes the stage, when we will see as to why choosing sigmoid function as activation function makes sense as its partial derivative calculation is simple. It can be proven that:</p><p><strong>g</strong>(<em>z</em><sup>3</sup>)=<em>a</em><sup>3</sup>.*(1-<em>a</em><sup>3</sup>)</p><p><em>(</em> <strong><em>.*</em></strong> <em>is the element wise multiplication between the two vectors)</em></p><p>Following this procedure, we get all the 𝛅 values for our calculation. Now, straightforward, we can write:</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-4.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p><em>Quick Note: To find the activation values, we actually need to perform forward propagation.</em></p><h3 id="heading-random-weights-initialization"><strong>Random Weights Initialization</strong></h3><p>First of all, let us see why we need random numbers as initial weights and why we shouldnt use 0s in place of them. It is anything but difficult to see instating weights with zero would for all intents and purposes "deactivate" the weights since weights by output of parent layer will be zero, subsequently, in the next learning steps, our information would not be perceived, the information would be neglected completely. So the learning would just have the information supplied by the bias in the input layer. What we should do to initialize these weights is, choose a decimal between 0 and 1 randomly and then scale it with a constant factor.</p><ul><li><strong>Training Methods</strong></li></ul><p>Once the structure of a network is decided and weights are initialized randomly, we train the network next. There are many methods of training a neural network. Often, training is substituted as learning, but they are the same in essence. Training is the process by which our model learns on the already available data with results to align itself to predict on fresh data. Neural Networks have been a great area of research and a lot have been done and a lot is being done. Researchers have developed multiple techniques of training neural networks.</p><p>Essentially, there are two buckets in which these techniques are divided. Supervised and Unsupervised. As the name suggests, the former deals with those techniques in which there is some manual intervention or supervision in training the network or the model, and in the latter, the network has to learn itself. In other words, Supervised training involves a system of providing the expected output manually and evaluating the networks performance or of simply providing the network with inputs and outputs. Lets look at these techniques in detail.</p><ul><li><strong>Supervised learning</strong></li></ul><p>As discussed, here, both inputs and outputs are provided to the network. The network then processes the inputs and matches the desired output with the computed outputs. The error in the output is then back-propagated in the system to adjust the weights and biases. The same set of training data is fed into the network again and again and the weights are refined and improved.</p><p>There is always a case when this technique wont work. Even after refining the weights again and again do not give us the optimized solution. In that case, the modeller has to review the network architecture, the initial weights, the training functions, the number of layers and the number of nodes in them, etc. This is where, the art of building a neural network comes into picture, which the reader can learn with experience. One more aspect of supervised training of neural network is that the modeller should not over train the model. On overtraining, the network starts learning the data and tends to over fit.</p><p>Once the weights are trained and optimized, we are ready with our neural network model and can be put to use now. For industrial purposes, these coordinates are then frozen and turned into some user interface or hardware to be put to consumption and faster usage.</p><ul><li><strong>Unsupervised learning</strong></li></ul><p>Unsupervised learning is also known as adaptive learning or self-organization sometimes as the network adapts itself without any human intervention. Only inputs are provided to the network in this technique. The system itself decides what features to use to group the inputs.</p><p>For example, Robots are trained through unsupervised learning techniques. Unsupervised learning techniques are the ones which are making it possible for the robots to persistently learn all alone as they experience new circumstances and new situations.</p><h3 id="heading-regression-involving-single-or-multiple-gaussian-targets"><strong>Regression involving single or multiple Gaussian targets</strong></h3><p>Linear Regression is the simplest form of regression. We need to model our network to compute the linear combination of the inputs keeping weights and biases in context to get an output. The output is then compared with the desired output to get the error. Normally, the principal of least squares is used to calculate the error in regression. The challenge in this task is finding the most optimized weights that fits well in our data.</p><p><em>Quick fact: The simplest neural network actually performs least squares regression.</em></p><ul><li>How to train our network for this?</li></ul><p>The network takes an input with one or multiple features, takes the product with corresponding weights and give its sum as the output.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-5.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-6.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>Here,<br />𝑥<sub>i</sub>s are the features of an input,<br />𝔀<sub>i</sub>s are the corresponding weights,<br />𝒉 is the activation function,<br />𝓨<sub>i </sub> is the output, and<br /><strong><em>L</em></strong> is the loss function calculated by the principle of least squares of the networks calculated outputs on the entire training data.</p><p>Gradient descent is used to minimize this very loss function which refines and optimizes our system. For a particular weight corresponding to one connection, we find the gradient of the loss.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-7.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>Since the example at hand is simple enough, the gradient for the weights are simply the corresponding input features. So, the gradient is:</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-8.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>Thus, the weights are updated likewise. Now that we have seen this case of 1 layer neural network, this can be extended and generalized for multi-layer neural network. The reader is advised to try it out themselves.</p><h3 id="heading-xor-logic-function-using-using-a-3-layered-neural-network"><strong>XOR Logic function using using a 3 layered Neural Network</strong></h3><p>Boolean functions of two inputs are amongst the least complex of all functions, and the development of basic neural networks that figure out how to learn such functions is one of the main subjects talked about in records of neural processing. Of these functions, as it were two represent any trouble: these are XOR and its complement. A lot of problems in circuit design were solved with the advent of the XOR gate. XOR can be viewed as addition modulo 2. As a result, XOR gates are used to implement binary addition in computers. A half adder consists of an XOR gate and an AND gate. Other uses include subtractors, comparators, and controlled inverters.</p><p>Lets understand XOR logic function to start with. XOR is also known as exclusive OR. In simple words, either A or B but not both. This function takes two input arguments with values in {-1,1} and returns one output in {-1,1}, as specified in the following truth table:</p><table><tbody><tr><td><p><strong>A</strong></p></td><td><p><strong>B</strong></p></td><td><p><strong>A XOR B</strong></p></td></tr><tr><td><p>0</p></td><td><p>0</p></td><td><p>0</p></td></tr><tr><td><p>0</p></td><td><p>1</p></td><td><p>1</p></td></tr><tr><td><p>1</p></td><td><p>0</p></td><td><p>1</p></td></tr><tr><td><p>1</p></td><td><p>1</p></td><td><p>0</p></td></tr></tbody></table><p>Sometimes the inputs are {-1,1} or {true, false} as well, but the essence of the logic is the same, Either A or B but not both.In other words, XOR computes the logical exclusive or, which yields true if and only if the two inputs have different truth values.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-9.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>The task at hand is, we need to implement this very function using neural networks. Basically, it should take the inputs, A and B, in our context and output A XOR B, accordingly.</p><p>Whatever solution we will have to give should be efficient. We will try to implement this using 3 layers of neural network. One input, one hidden and one output. One of the most interesting features of neural network to recall now, is that the tweaking of the inputs is learned during the training process. Yes, we are talking about weights here. Once random weights are initialized, all other manipulations are done taking the performance on the training inputs into account. What, essentially, we have to create here is a network which takes input and gives an output. In case the output is not the desired one, we back-propagate the signal that there is an error. This should adjust the architecture to minimize the error. Lets understand this further.</p><p>First, we should initialize weights other than in {-1,1}, reset the threshold of each node and initialize a bias. Just to recall, bias is a quantity that we add to the total input to calculate the activation at each node. Also, a node in a neural network becomes active when the activation crosses a certain threshold. Now that we have reset the threshold, the activation function must be chosen wisely.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-10.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>The activation function that we chose is 1 over 1 minus e raised to the power of negative net input. Here, net input is sum of the products of inputs from the input layer and the corresponding weights of the connection. The next step is to train our neural network using the entries in the truth table.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-11.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>The nodes in the hidden layer acts like a two-layer Perceptron. This is also called a 2-2-1 fully connected topology. In simple language, this is what we have to do:</p><ol><li><p>Initialize the network with random weights and biases.</p></li><li><p>Apply the inputs x1 and x2 from the truth table and run the network.</p></li><li><p>Compare the output of the network with the desired output mentioned in the truth table and calculate the error.</p></li><li><p>Adjust the weights of the connections to the output nodes and biases. This is done by recursively computing the local gradient of each weight.</p></li><li><p>The error is then back-propagated to the hidden layer and adjustment is done in the weights and biases corresponding to this particular layer.</p></li></ol><p>This cycle repeats till the average error across the 4 entries in the truth table approaches zero. The activation function chosen is such that it can never reach 0 or 1, because of the presence of exponential <strong>e</strong> in the denominator. So, a workaround can be made as the activated value, if it is less than 0.1, we approximate it as 0 and if it is more than 0.9, we approximate it as 1.</p><p>The networks stops when a local minimum is reached. In the event that the system is by all accounts stuck, it has hit what is known as a 'local minima'. The bias of the hidden node will in the end head towards zero. As it approaches zero, the system will move on from the local minimum and will finish. This is a result of an 'momentum turn' that is used as a part of the estimation of the weights. We have seen, how beautifully back-propagation algorithm helps in implementing XOR logic function using neural nets. This can be further generalized for any logical function whose truth table is well defined. The reader should try and get their hands dirty on one other type of logic function and implement it using neural networks. But do remember, proposing an optimal solutions, i.e., the one which uses the minimum resources is always preferable.</p><p>The implementation of XOR logic functions using neural nets actually paved the way to the popularity of multi-layer perceptrons that we used in our explanation. And this actually led to many more interesting neural network and machine learning designs.</p>]]><![CDATA[<h3 id="heading-cost-functions"><strong>Cost Functions</strong></h3><p>The cost function measures how far away a particular solution is from an optimal solution to the problem in hand. The goal of every machine learning model pertains to minimizing this very function, tuning the parameters and using the available functions in the solution space. In other words, a cost function, is a measure of how good a network did with respect to its training sample. As cost function is a function, it returns just a value of measure.</p><h3 id="heading-feedforward-neural-network"><strong>Feedforward Neural Network</strong></h3><p>A feedforward neural network is basically a multi-layer (of neurons) connected with each other. It takes an input, traverses through its hidden layer and finally reaches the output layer. Let 𝓪<sub>j</sub><sup>(i) </sup> be the output of the j<sup>th</sup> neuron in the i<sup>th</sup> layer. In that sense, 𝓪<sub>j</sub><sup>(1) </sup> is the j<sup>th</sup> neuron in the input layer. So, the subsequent layers inputs will be:</p><p>a<sup>i</sup><sub>j</sub>\=<strong><em></em></strong>(<strong><em></em></strong><sub>k</sub>(<strong><em>w</em></strong><sup>i</sup><sub>jk</sub><strong><em>a</em></strong><sup>i1</sup><sub>k</sub>)+b<sup>i</sup><sub>j</sub>)</p><p>Where:</p><p><strong><em></em></strong> is the activation function,<br /><strong><em>w</em></strong><sup>i</sup><sub>jk</sub> is the weight from the <strong><em>k</em></strong><sup>th</sup> neuron in the <strong><em>(i 1)</em></strong><sup>th</sup> layer to the <strong><em>j</em></strong><sup>th</sup> neuron in the <strong><em>i</em></strong><sup>th</sup> layer,<br />b<sup>i</sup><sub>j</sub> is the bias of the <strong><em>j</em></strong><sup>th</sup> neuron in the <strong><em>i</em></strong><sup>th</sup> layer, and<br />a<sup>i</sup><sub>j </sub> represents the activation value of the <strong><em>j</em></strong><sup>th </sup> neuron in the <strong><em>i</em></strong><sup>th</sup> layer.<br />Sometimes, the input to the activation function, is written as z<sup>ij</sup></p><h3 id="heading-bias"><strong>Bias</strong></h3><p>A bias value allows you to shift the activation function to the left or right. It plays the same role which a coefficient b plays in the following linear equation: y=ax+b. Essentially, it helps in shift the output of the activation function so as to fit the prediction with data better. In case sigmoid function is used as an activation function, bias is used to adjust the steepness of the curve.</p><p>Lets go back to cost function. So the cost function of a neural network generally depends on weights, biases, inputs of the training samples and the desired output from the model. We will learn cost functions for both feedforward (classification) and regularized (logistic regression).</p><h3 id="heading-regularized-amp-feedforward-cost-function"><strong>Regularized & feedforward cost function</strong></h3><p>In case of normal regularized or logistic regression, where we just have one output, the cost function is defined as:</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-1.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>This cost function is derived from maximum likelihood estimation concept. 𝜭 is the parameter which we need to find such that J(𝜭) is minimized. The same is done using gradient descent. <strong>( x <sup>i</sup>, y <sup>i</sup>)</strong> s are the inputs here.</p><p>For feedforward neural network, our cost function will be the generalization of this as it involves in multiple outputs, k, in the following function.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-2.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>The first half of the sum is nothing but summation over the k output units that we had just one in case of logistic regression above. Because of multiple outputs, all the expressions are written in vectorised form now. The second half of the sum is a triple nested summation of the regularized term, also known as weight decay term. The here, adjusts the importance of the terms. We need to minimize this very cost function for the model to perform better. For the same, lets discuss back-propagation algorithm.</p><h3 id="heading-backpropagation-learning"><strong>Backpropagation Learning</strong></h3><p>Back propagation basically compares the real value with the output of the network and checks the efficiency of the parameters. It back-propagates to the previous layers and calculates the error associated with each unit present in them, till it reaches the input layer, where there is no error. The errors measured at each unit is used to calculate partial derivatives which in turn is used by the gradient descent algorithm to minimize the cost function. Gradient descent uses these values to minimize and adjust the 𝜭 values till it converges.</p><p>Let 𝛅<sub>j</sub><sup>l</sup> be the error for node j in layer l and a<sub>j</sub><sup>l </sup> be the totally calculated activation value.<br />Then for any layer l, the error, 𝛅<sub>j</sub><sup>l</sup> will be: 𝛅<sub>j</sub><sup>l</sup> = a<sub>j</sub><sup>l </sup> - y<sub>j</sub><br />Where, y<sub>j</sub> is the actual value observed in the training sample. In terms of vector, the same can be re-written as: 𝛅<sup>l</sup> = a<sup>l </sup> - y and so on.<br />So, for example, we have 𝛅<sup>4 </sup> calculated, then according to backpropagation algorithm, the error in previous layers can be calculated as:</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-3.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>Where,<br />𝜭<sup>(i)</sup> is the vector of parameters for layer i to i+1,<br />𝛅<sup>4 </sup> is the error vector,<br /><strong><em>g</em></strong> is the activation function chosen, for simplicity, we will stick to sigmoid function in this case.<br /><strong><em>Z</em></strong> <sup>i</sup> are nothing but the product of activation values and parameter values of the previous layer, <em>i-1</em></p><p>Here comes the stage, when we will see as to why choosing sigmoid function as activation function makes sense as its partial derivative calculation is simple. It can be proven that:</p><p><strong>g</strong>(<em>z</em><sup>3</sup>)=<em>a</em><sup>3</sup>.*(1-<em>a</em><sup>3</sup>)</p><p><em>(</em> <strong><em>.*</em></strong> <em>is the element wise multiplication between the two vectors)</em></p><p>Following this procedure, we get all the 𝛅 values for our calculation. Now, straightforward, we can write:</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-4.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p><em>Quick Note: To find the activation values, we actually need to perform forward propagation.</em></p><h3 id="heading-random-weights-initialization"><strong>Random Weights Initialization</strong></h3><p>First of all, let us see why we need random numbers as initial weights and why we shouldnt use 0s in place of them. It is anything but difficult to see instating weights with zero would for all intents and purposes "deactivate" the weights since weights by output of parent layer will be zero, subsequently, in the next learning steps, our information would not be perceived, the information would be neglected completely. So the learning would just have the information supplied by the bias in the input layer. What we should do to initialize these weights is, choose a decimal between 0 and 1 randomly and then scale it with a constant factor.</p><ul><li><strong>Training Methods</strong></li></ul><p>Once the structure of a network is decided and weights are initialized randomly, we train the network next. There are many methods of training a neural network. Often, training is substituted as learning, but they are the same in essence. Training is the process by which our model learns on the already available data with results to align itself to predict on fresh data. Neural Networks have been a great area of research and a lot have been done and a lot is being done. Researchers have developed multiple techniques of training neural networks.</p><p>Essentially, there are two buckets in which these techniques are divided. Supervised and Unsupervised. As the name suggests, the former deals with those techniques in which there is some manual intervention or supervision in training the network or the model, and in the latter, the network has to learn itself. In other words, Supervised training involves a system of providing the expected output manually and evaluating the networks performance or of simply providing the network with inputs and outputs. Lets look at these techniques in detail.</p><ul><li><strong>Supervised learning</strong></li></ul><p>As discussed, here, both inputs and outputs are provided to the network. The network then processes the inputs and matches the desired output with the computed outputs. The error in the output is then back-propagated in the system to adjust the weights and biases. The same set of training data is fed into the network again and again and the weights are refined and improved.</p><p>There is always a case when this technique wont work. Even after refining the weights again and again do not give us the optimized solution. In that case, the modeller has to review the network architecture, the initial weights, the training functions, the number of layers and the number of nodes in them, etc. This is where, the art of building a neural network comes into picture, which the reader can learn with experience. One more aspect of supervised training of neural network is that the modeller should not over train the model. On overtraining, the network starts learning the data and tends to over fit.</p><p>Once the weights are trained and optimized, we are ready with our neural network model and can be put to use now. For industrial purposes, these coordinates are then frozen and turned into some user interface or hardware to be put to consumption and faster usage.</p><ul><li><strong>Unsupervised learning</strong></li></ul><p>Unsupervised learning is also known as adaptive learning or self-organization sometimes as the network adapts itself without any human intervention. Only inputs are provided to the network in this technique. The system itself decides what features to use to group the inputs.</p><p>For example, Robots are trained through unsupervised learning techniques. Unsupervised learning techniques are the ones which are making it possible for the robots to persistently learn all alone as they experience new circumstances and new situations.</p><h3 id="heading-regression-involving-single-or-multiple-gaussian-targets"><strong>Regression involving single or multiple Gaussian targets</strong></h3><p>Linear Regression is the simplest form of regression. We need to model our network to compute the linear combination of the inputs keeping weights and biases in context to get an output. The output is then compared with the desired output to get the error. Normally, the principal of least squares is used to calculate the error in regression. The challenge in this task is finding the most optimized weights that fits well in our data.</p><p><em>Quick fact: The simplest neural network actually performs least squares regression.</em></p><ul><li>How to train our network for this?</li></ul><p>The network takes an input with one or multiple features, takes the product with corresponding weights and give its sum as the output.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-5.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-6.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>Here,<br />𝑥<sub>i</sub>s are the features of an input,<br />𝔀<sub>i</sub>s are the corresponding weights,<br />𝒉 is the activation function,<br />𝓨<sub>i </sub> is the output, and<br /><strong><em>L</em></strong> is the loss function calculated by the principle of least squares of the networks calculated outputs on the entire training data.</p><p>Gradient descent is used to minimize this very loss function which refines and optimizes our system. For a particular weight corresponding to one connection, we find the gradient of the loss.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-7.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>Since the example at hand is simple enough, the gradient for the weights are simply the corresponding input features. So, the gradient is:</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-8.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>Thus, the weights are updated likewise. Now that we have seen this case of 1 layer neural network, this can be extended and generalized for multi-layer neural network. The reader is advised to try it out themselves.</p><h3 id="heading-xor-logic-function-using-using-a-3-layered-neural-network"><strong>XOR Logic function using using a 3 layered Neural Network</strong></h3><p>Boolean functions of two inputs are amongst the least complex of all functions, and the development of basic neural networks that figure out how to learn such functions is one of the main subjects talked about in records of neural processing. Of these functions, as it were two represent any trouble: these are XOR and its complement. A lot of problems in circuit design were solved with the advent of the XOR gate. XOR can be viewed as addition modulo 2. As a result, XOR gates are used to implement binary addition in computers. A half adder consists of an XOR gate and an AND gate. Other uses include subtractors, comparators, and controlled inverters.</p><p>Lets understand XOR logic function to start with. XOR is also known as exclusive OR. In simple words, either A or B but not both. This function takes two input arguments with values in {-1,1} and returns one output in {-1,1}, as specified in the following truth table:</p><table><tbody><tr><td><p><strong>A</strong></p></td><td><p><strong>B</strong></p></td><td><p><strong>A XOR B</strong></p></td></tr><tr><td><p>0</p></td><td><p>0</p></td><td><p>0</p></td></tr><tr><td><p>0</p></td><td><p>1</p></td><td><p>1</p></td></tr><tr><td><p>1</p></td><td><p>0</p></td><td><p>1</p></td></tr><tr><td><p>1</p></td><td><p>1</p></td><td><p>0</p></td></tr></tbody></table><p>Sometimes the inputs are {-1,1} or {true, false} as well, but the essence of the logic is the same, Either A or B but not both.In other words, XOR computes the logical exclusive or, which yields true if and only if the two inputs have different truth values.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-9.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>The task at hand is, we need to implement this very function using neural networks. Basically, it should take the inputs, A and B, in our context and output A XOR B, accordingly.</p><p>Whatever solution we will have to give should be efficient. We will try to implement this using 3 layers of neural network. One input, one hidden and one output. One of the most interesting features of neural network to recall now, is that the tweaking of the inputs is learned during the training process. Yes, we are talking about weights here. Once random weights are initialized, all other manipulations are done taking the performance on the training inputs into account. What, essentially, we have to create here is a network which takes input and gives an output. In case the output is not the desired one, we back-propagate the signal that there is an error. This should adjust the architecture to minimize the error. Lets understand this further.</p><p>First, we should initialize weights other than in {-1,1}, reset the threshold of each node and initialize a bias. Just to recall, bias is a quantity that we add to the total input to calculate the activation at each node. Also, a node in a neural network becomes active when the activation crosses a certain threshold. Now that we have reset the threshold, the activation function must be chosen wisely.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-10.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>The activation function that we chose is 1 over 1 minus e raised to the power of negative net input. Here, net input is sum of the products of inputs from the input layer and the corresponding weights of the connection. The next step is to train our neural network using the entries in the truth table.</p><p><img src="https://dezyre.gumlet.io/files.dezyre.com/images/Tutorials/Neural+Network+Training-11.jpg?w=1000&dpr=1.0" alt class="image--center mx-auto" /></p><p>The nodes in the hidden layer acts like a two-layer Perceptron. This is also called a 2-2-1 fully connected topology. In simple language, this is what we have to do:</p><ol><li><p>Initialize the network with random weights and biases.</p></li><li><p>Apply the inputs x1 and x2 from the truth table and run the network.</p></li><li><p>Compare the output of the network with the desired output mentioned in the truth table and calculate the error.</p></li><li><p>Adjust the weights of the connections to the output nodes and biases. This is done by recursively computing the local gradient of each weight.</p></li><li><p>The error is then back-propagated to the hidden layer and adjustment is done in the weights and biases corresponding to this particular layer.</p></li></ol><p>This cycle repeats till the average error across the 4 entries in the truth table approaches zero. The activation function chosen is such that it can never reach 0 or 1, because of the presence of exponential <strong>e</strong> in the denominator. So, a workaround can be made as the activated value, if it is less than 0.1, we approximate it as 0 and if it is more than 0.9, we approximate it as 1.</p><p>The networks stops when a local minimum is reached. In the event that the system is by all accounts stuck, it has hit what is known as a 'local minima'. The bias of the hidden node will in the end head towards zero. As it approaches zero, the system will move on from the local minimum and will finish. This is a result of an 'momentum turn' that is used as a part of the estimation of the weights. We have seen, how beautifully back-propagation algorithm helps in implementing XOR logic function using neural nets. This can be further generalized for any logical function whose truth table is well defined. The reader should try and get their hands dirty on one other type of logic function and implement it using neural networks. But do remember, proposing an optimal solutions, i.e., the one which uses the minimum resources is always preferable.</p><p>The implementation of XOR logic functions using neural nets actually paved the way to the popularity of multi-layer perceptrons that we used in our explanation. And this actually led to many more interesting neural network and machine learning designs.</p>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1687174216398/7899c9a8-4a02-479c-821b-0a0caebfd9de.png<![CDATA[Linear Regression Model using English Premier League (EPL) Soccer Data]]>https://blog.hemath.com/linear-regression-model-using-english-premier-league-epl-soccer-datahttps://blog.hemath.com/linear-regression-model-using-english-premier-league-epl-soccer-dataMon, 05 Dec 2022 10:41:06 GMT<![CDATA[<h2 id="heading-overview"><strong>Overview</strong></h2><p>The English Premier League is one of the world's most-watched soccer leagues, with an estimated audience of 12 million people per game. With the substantial financial benefits, all significant teams of EPL are interested in Analytics and AI. Regarding sports analytics, machine learning and artificial intelligence (AI) have become extremely popular. The sports entertainment sector and the relevant stakeholders extensively use sophisticated algorithms to improve earnings and reduce business risk associated with selecting or betting on the wrong players.</p><p><img src="https://cdn.pixabay.com/photo/2016/04/15/20/28/football-1331838__340.jpg" alt="image" class="image--center mx-auto" /></p><p>Regression is one of the foundational techniques in Machine Learning. As one of the most well-understood algorithms, linear regression plays a vital role in solving real-life problems. In this project, we wish to use Linear Regression to predict the scores of EPL soccer players. With the business implications cleared. Let's get into the project's technical details.</p><p>This project is part of the Linear Regression Beginner Project Series, and it consists of discussing and implementing the fundamentals of Linear Regression in Python on the EPL Soccer Player Dataset.</p><h2 id="heading-execution-instructions"><strong>Execution Instructions</strong></h2><h3 id="heading-option-1-running-on-your-computer-locally">Option 1: Running on your computer locally</h3><p>To run the notebook on your local system set up a <a target="_blank" href="https://www.python.org/">python</a> environment. Set up the <a target="_blank" href="https://jupyter.org/install">jupyter notebook</a> with Python or by using <a target="_blank" href="https://anaconda.org/anaconda/jupyter">Anaconda distribution</a>. <a target="_blank" href="https://github.com/HemathDeunei/1_LR_model_using_EPL_soccer_dat">Download the notebook</a> and open a jupyter notebook to run the code on the local system.</p><p>The notebook can also be executed by using <a target="_blank" href="https://code.visualstudio.com/">Visual Studio Code</a>, and <a target="_blank" href="https://www.jetbrains.com/pycharm/">PyCharm</a>.</p><h3 id="heading-option-2-executing-with-colab">Option 2: Executing with Colab</h3><p>Colab, or "Collaboratory", allows you to write and execute Python in your browser, with access to GPUs free of charge and easy sharing.</p><p>You can run the code using <a target="_blank" href="https://colab.research.google.com/">Google Colab</a> by uploading the <a target="_blank" href="https://github.com/HemathDeunei/1_LR_model_using_EPL_soccer_dat">ipython notebook</a>.</p><h2 id="heading-approach"><strong>Approach</strong></h2><ul><li><p>Install Packages</p></li><li><p>Import Libraries</p></li><li><p>Data Reading from Different Sources</p></li><li><p>Exploratory Data Analysis</p></li><li><p>Correlation</p></li><li><p>Relationship between Cost and Score</p></li><li><p>Train - Test Split</p></li><li><p>Linear Regression</p></li><li><p>Model Summary</p></li><li><p>Prediction of Test Data</p></li><li><p>Diagnostics and Remedies</p></li></ul><h2 id="heading-important-libraries"><strong>Important Libraries</strong></h2><ul><li><p><strong>pandas</strong>: pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool built on top of the Python programming language. Refer to <a target="_blank" href="https://pandas.pydata.org/">documentation</a> for more information.</p></li><li><p><strong>NumPy</strong>: The fundamental package for scientific computing with Python. Fast and versatile, the NumPy vectorization, indexing, and broadcasting concepts are the de-facto standards of array computing today. NumPy offers comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more. Refer to <a target="_blank" href="https://numpy.org/">documentation</a> for more information. pandas and NumPy are together used for most of the data analysis and manipulation in Python.</p></li><li><p><strong>Matplotlib</strong>: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Refer to <a target="_blank" href="https://matplotlib.org/">documentation</a> for more information.</p></li><li><p><strong>seaborn</strong>: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Refer to <a target="_blank" href="https://seaborn.pydata.org/">documentation</a> for more information.</p></li><li><p><strong>scikit-learn</strong>: Simple and efficient tools for predictive data analysis accessible to everybody and reusable in various contexts. It is built on NumPy, SciPy, and matplotlib to support machine learning in Python. Refer to <a target="_blank" href="https://scikit-learn.org/stable/">documentation</a> for more information.</p></li><li><p><strong>statsmodels</strong>: statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration. Refer to <a target="_blank" href="https://www.statsmodels.org/stable/index.html">documentation</a> for more information.</p></li><li><p><strong>SciPy</strong>: SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics, and many other classes of problems. Refer to <a target="_blank" href="https://scipy.org/">documentation</a> for more information.</p></li></ul><h2 id="heading-install-packages"><strong>Install Packages</strong></h2><pre><code class="lang-python"><span class="hljs-keyword">import</span> warningswarnings.filterwarnings(<span class="hljs-string">'ignore'</span>)</code></pre><pre><code class="lang-python"><span class="hljs-keyword">import</span> sys!{sys.executable} -m pip install numpy!{sys.executable} -m pip install seaborn!{sys.executable} -m pip install matplotlib!{sys.executable} -m pip install statsmodels!{sys.executable} -m pip install pandas!{sys.executable} -m pip install scipy!{sys.executable} -m pip install scikit_learn</code></pre><h2 id="heading-data-reading-from-different-sources"><strong>Data Reading from Different Sources</strong></h2><h4 id="heading-1-files"><strong>1. Files</strong></h4><p>In many cases, the data is stored in the local system. To read the data from the local system, specify the correct path and filename.</p><ul><li><strong>CSV format</strong></li></ul><p>Comma-separated values, also known as CSV, are a specific way to store data in a table structure format. The data used in this project is stored in a CSV file. Click <a target="_blank" href="https://data.hemath.com/access/file_csv/1_Soccer_Data.csv">here</a> to download the data used in this project.</p><p>Use the following code to read data from CSV file using pandas.</p><pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pdcsv_file_path= <span class="hljs-string">"C:\Users\Hemath\Desktop\1_Soccer_Data.csv"</span>df = pd.read_csv(csv_file_path)</code></pre><p>With appropriate csv_file_path, <a target="_blank" href="http://pd.read">pd.read</a>_csv() function will read the data and store it in df variable.</p><p>If you get <em>FileNotFoundError or No such file or directory</em>, try checking the path provided in the function. It's possible that python is not able to find the file or directory at a given location.</p><ul><li><strong>Public URL</strong></li></ul><p><a target="_blank" href="http://pandas.read">pandas.read</a>_csv() method also works if the data is available on any public URL.</p><pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pddata_url=<span class="hljs-string">"https://data.hemath.com/access/file_csv/1_Soccer_Data.csv"</span>df = pd.read_csv(data_url)</code></pre><h4 id="heading-2-database"><strong>2. Database</strong></h4><p>Most organization store their data in databases such as <a target="_blank" href="https://www.mysql.com/">MySQL</a> or <a target="_blank" href="https://www.postgresql.org/">Postgres</a>. The data can be accessed by secret credentials, which will be in the following format.</p><pre><code class="lang-python">host = <span class="hljs-string">"localhost"</span>database= <span class="hljs-string">"db_name"</span>user = <span class="hljs-string">"root"</span>password = <span class="hljs-string">"password"</span></code></pre><p>MySQL is an open-source relational database management system.</p><p>In this project, we will demonstrate how to connect python to a MySQL server to fetch the data. We will use the <a target="_blank" href="https://pypi.org/project/PyMySQL/">pymysql</a> library to connect to the MySQL server.</p><p>Convert <a target="_blank" href="https://data.hemath.com/access/file_csv/1_Soccer_Data.csv">CSV</a> into SQL Insert statement using this <a target="_blank" href="https://www.convertcsv.com/csv-to-sql.htm">link</a>. Create a database in local using MySQL or Postgres and execute the sql query.</p><p>Use this code to connect Python to MySQL and fetch the data.</p><pre><code class="lang-python"><span class="hljs-comment">#installing pymysql library</span><span class="hljs-keyword">import</span> sys!{sys.executable} -m pip install pymysql<span class="hljs-keyword">import</span> pymysqlconnection = pymysql.connect( host = <span class="hljs-string">"localhost"</span>, user = <span class="hljs-string">"user"</span>, password = <span class="hljs-string">"mypassword"</span>, database = <span class="hljs-string">"db_name"</span> )df = pd.read_sql(<span class="hljs-string">"SELECT * FROM table_name"</span>,connection)df.head()</code></pre><pre><code class="lang-python"><span class="hljs-comment">#import required packages</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split<span class="hljs-keyword">import</span> statsmodels.api <span class="hljs-keyword">as</span> sm<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns<span class="hljs-keyword">from</span> scipy <span class="hljs-keyword">import</span> stats<span class="hljs-keyword">import</span> scipy<span class="hljs-keyword">from</span> matplotlib.pyplot <span class="hljs-keyword">import</span> figure</code></pre><pre><code class="lang-python"><span class="hljs-comment"># Load the data as a data frame by using URL</span>soccer_data_url=<span class="hljs-string">"https://data.hemath.com/access/file_csv/1_Soccer_Data.csv"</span>df = pd.read_csv(soccer_data_url)</code></pre><pre><code class="lang-python"><span class="hljs-comment">#view top 3 entries from the soccer data</span>df.head(<span class="hljs-number">3</span>)</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685263008036/be03bb1d-0980-41c9-a3ef-36a3b6074055.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python">df.columns</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685263041462/f71ef19d-270b-4860-96b9-14d2f0a60c43.png" alt class="image--center mx-auto" /></p><h2 id="heading-data-dictionary"><strong>Data Dictionary</strong></h2><ul><li><p>PlayerName : Player Name</p></li><li><p>Club : Club of the player</p><ol><li><p>MUN:Manchester United F.C.</p></li><li><p>CHE: Chelsea F.C.</p></li><li><p>LIV: Liverpool F.C.</p></li></ol></li><li><p>DistanceCovered(InKms): Average Kms distance covered by the player in each game</p></li><li><p>Goals: Average Goals per match</p></li><li><p>MinutestoGoalRatio: Minutes</p></li><li><p>ShotsPerGame: Average shots taken per game</p></li><li><p>AgentCharges: Agent Fees in h</p></li><li><p>BMI: Body-Mass index</p></li><li><p>Cost: Cost of each player in hundread thousand dollars</p></li><li><p>PreviousClubCost: Previous club cost in hundread thousand dollars</p></li><li><p>Height: Height of player in cm</p></li><li><p>Weight: Weight of player in kg</p></li><li><p>Score: Average score per match</p></li></ul><h2 id="heading-exploratory-data-analysis"><strong>Exploratory Data Analysis</strong></h2><p>Exploratory Data Analysis, commonly known as EDA, is a technique to analyze the data with visuals. It involves using statistics and visual techniques to identify particular trends in data.</p><p>It is used to understand data patterns, spot anomalies, check assumptions, etc. The main purpose of EDA is to help look into the data before making any hypothesis about it.</p><h3 id="heading-dataframe-information"><strong>Dataframe Information</strong></h3><p>The <a target="_blank" href="http://dataframe.info">dataframe.info</a><a target="_blank" href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html">()</a> method prints information about the DataFrame, including the index dtype and columns, non-null values, and memory usage.</p><p>It can be used to get basic info, look for missing values, and get a sense of each variable's format.</p><pre><code class="lang-python">df.info()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685265186187/fb153482-3ba0-49b4-8a4d-135446314c2c.png" alt class="image--center mx-auto" /></p><p>There are total 202 rows and 13 columns in EPL Soccer Dataset.</p><p>Observe that there are no null values in the dataset.</p><p>Out of 13 columns 10 are float type and 1 is integer type The remaining 2 have object dtype.</p><p>Learn about Essential basic functionality for pandas dataframe <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes">here</a>.</p><h2 id="heading-basic-statistical-concepts"><strong>Basic Statistical Concepts</strong></h2><ul><li><strong>Mean</strong>: The mean is one of the measures of central tendency. Simply put, the mean is the average of the values in the given set. The observed values are totaled and divided by the total number of observations to determine the mean. If \(x_i\) is \(i^{th}\) observation then mean of all \(x_i\) ranging from \(1\leq i\leq n\) denoted by \(\bar x\) is given as</li></ul><p>$$\bar{x} = \sum_{i=1}^{n}\frac{x_i}{n}$$</p><ul><li><strong>Variance</strong>: Variance is a measure of variation. It is calculated by averaging the squared deviations from the mean. The degree of spread in your data set is indicated by variation. The greater the spread of the data, the greater the variance in proportion to the mean. Here's the formula for variance of a sample.</li></ul><p>$$S^2 = \frac{\sum_{i=1}^{n}(x_i-\bar x)^2}{n-1}$$</p><ul><li><strong>Standard Deviation</strong>: The standard deviation is a measure that shows how much variation (such as spread, dispersion, and spread) exists from the mean. The standard deviation represents a "typical" departure from the mean. It is a popular measure of variability since it returns to the data set's original units of measurement. Here's the formula for standard deviation of a sample.</li></ul><p>$$S = \sqrt \frac{\sum_{i=1}^{n}(x_i-\bar x)^2}{n-1}$$</p><h2 id="heading-dataframe-description"><strong>Dataframe Description</strong></h2><p>To generate descriptive statistics <a target="_blank" href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html">pandas.dataframe.describe()</a> function is used.</p><p>Descriptive statistics include those that summarize the central tendency, dispersion, and shape of a datasets distribution, excluding NaN values.</p><p>It is used to get a basic description of the data, looking at the spread of the different variables, along with abrupt changes between the minimum, 25th, 50th, 75th, and max for the different variables.</p><p>The quartiles provide an excellent insight into the range of a set of data. You may easily establish where your data sits in the range and which quartile you fall into by knowing the percentile points 25th, 50th, and 75th, as well as your own data point.</p><ul><li><p>The 25th percentile is also referred to as the first, or lower, quartile. The 25th percentile is the figure at which 25% of the data falls below it and 75% of the answers fall above it.</p></li><li><p>The Median is also known as the 50th percentile. The median divides the set of data in half. Half of the data points are below the median, while the other half are above it.</p></li><li><p>The 75th percentile is often referred to as the third, or upper, quartile. The 75th percentile is the value at which 25% of the responses are higher and 75% of the answers are lower.</p></li></ul><h3 id="heading-descriptive-statistic-for-quantitative-variables"><strong>Descriptive statistic for quantitative variables</strong></h3><p>DataFrame.count: Count number of non-NA/null observations</p><p>DataFrame.max: Maximum of the values in the object</p><p>DataFrame.min: Minimum of the values in the object</p><p>DataFrame.mean: Mean of the values</p><p>DataFrame.std: Standard deviation of the observations</p><p><a target="_blank" href="http://DataFrame.select">DataFrame.select</a>_dtypes: Subset of a DataFrame including/excluding columns based on their dtype</p><pre><code class="lang-python"><span class="hljs-comment"># descriptive statistics</span>df.describe()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685265493825/f401ad0b-cf8a-4947-9b92-a5ee5f25638b.png" alt class="image--center mx-auto" /></p><h3 id="heading-did-you-know-i"><strong>Did you know - I</strong></h3><p>To get summary of all the columns you can provide include = 'all' in the describe function</p><pre><code class="lang-python">df.describe(include=<span class="hljs-string">'all'</span>)</code></pre><p>For object data (e.g. strings or timestamps), the results index will include count, unique, top, and freq. The top is the most common value. The freq is the most common values frequency. Timestamps also include the first and last items.</p><h2 id="heading-correlation"><strong>Correlation</strong></h2><p>Correlation coefficient is used to measure the strength of relationship between two variables. It indicates that as the value of one variable changes the other variable changes in a specific direction with some magnitude. There are various ways to find correlation between two variables, one of which is Pearson correlation coefficient. It measures the linear relationship between two continuous variables.</p><p>Let's say \(x\) and \(y\) are two continuous variables, the Pearson correlation coefficient between them can be found by the following formula.</p><p>$$r = \frac{ \sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}) }{% \sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}$$</p><p>where \(x_i\) and \(y_i\) represents the \(i^{th}\) value of the variables. The value of \(r\) ranges between \(-1\) and \(1\).</p><p>Their strength of relationship is measured by the absolute value of coefficient, whereas the sign of the coefficient indicates the direction of the relationship.</p><h2 id="heading-graphs-of-different-correlation-coefficients"><strong>Graphs of Different Correlation Coefficients</strong></h2><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685269765291/bbb311ae-bebd-4acc-b1dd-bd8416179deb.png" alt class="image--center mx-auto" /></p><ol><li><p>𝑟=1 indicates a perfect negative relationship between the variables</p></li><li><p>𝑟=0 indicates no relationship between the variables</p></li><li><p>𝑟=1 indicates a perfect positive relationship between the variables</p></li></ol><p>To find correlation between variables from the soccer data we will use <a target="_blank" href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html">pandas.dataframe.corr()</a> method.</p><p>It computes pairwise correlation between two columns by excluding NA or NaN values if any. The default method used to calculate correlation coefficient is pearson correlation.</p><pre><code class="lang-python">df.corr()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685265812330/cebfafe4-3421-4891-97ee-135b87d62c23.png" alt class="image--center mx-auto" /></p><p>The correlation between DistanceCovered(InKms) and the target variable score \(-0.49\) indicates negative correlation. The variable cost is related to the target variable with correlation coefficient \(0.96\) which indicates strong positive relationship.</p><h3 id="heading-did-you-know-ii"><strong>Did you know - II</strong></h3><p>Pearson correlation coefficient can only measure linear relationship between data. The following data shows non linear relationship which can not be found by using pearson correlation coefficient. In such cases Spearman's correlation coefficient is used. It can be used to find nonlinear, monotonic relationships and for ordinal data.</p><blockquote><h3 id="heading-think-about-it-i"><strong>Think about it - I</strong></h3><p>Can Pearson or Spearman correlation coefficient be used to find correlation between categorical variables?</p></blockquote><h2 id="heading-correlation-does-not-imply-causation"><strong>Correlation does not imply Causation!!</strong></h2><p>Some studies show that people in the UK spend more money on shopping when it's cold which shows correlation between two variables. Does this imply cold weather causes people to spend more money? The answer is NO. One of the possible explanations is that cold weather coincides with Christmas and new year sales, hence people shop more.</p><p>Correlation between two variables indicates association between two variables but it does not mean change in one variable is caused by another.</p><h2 id="heading-relationship-between-cost-and-score"><strong>Relationship between Cost and Score</strong></h2><p>Score and Cost have a 96% correlation, making it a significant variable. Cost can be selected as the predictor variable for simple linear regression since the scatter plot between them will demonstrate a linear relationship.</p><p>To see this relationship visually, let's plot the scatter plot for Cost and Score.</p><pre><code class="lang-python"><span class="hljs-comment">#Let's plot cost vs. score</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(df[<span class="hljs-string">'Cost'</span>], df[<span class="hljs-string">'Score'</span>])<span class="hljs-comment"># label</span>plt.xlabel(<span class="hljs-string">"Cost"</span>)plt.ylabel(<span class="hljs-string">"Score"</span>)plt.title(<span class="hljs-string">"Scatter plot between Cost and Score"</span>)<span class="hljs-comment"># Strong linear association between cost and score, maybe some concern with model after a cost of 125 or so!</span></code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685266118270/53eb9f2d-bbd1-4494-b5cb-4ea21a69309e.png" alt class="image--center mx-auto" /></p><p>The correlation between Cost and Score is easily visible here.</p><p>The Pearson correlation and scatter plot demonstrate that as the cost increases, so does the score. But what can we do with this knowledge?</p><p>How can we know how much money should be spent to achieve a specific score? This is where Linear Regression comes in. It assists us in modeling the linear relationship between two or more variables so that we may foresee the results using the model.</p><p>Let's figure out how.</p><h2 id="heading-train-test-split"><strong>Train - Test Split</strong></h2><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685266485582/5563bd5b-0fc1-46c5-b60e-3cb1a152125b.png" alt class="image--center mx-auto" /></p><p>The data points are divided into two datasets, train and test, in a train test split method. The train data is used to train the model, and the model is then used to predict on the test data to see how the model performs on unseen data and whether it is overfitting or underfitting.</p><h3 id="heading-underfitting-and-overfitting"><strong>Underfitting and Overfitting</strong></h3><ul><li><p><strong>Underfitting</strong>: Underfitting occurs when a statistical model or machine learning algorithm fails to capture the underlying trend of the data, i.e., it performs well on training data but poorly on testing data. Its occurrence merely indicates that our model or method does not adequately suit the data. It frequently occurs when we select a simpler model yet the data contains complicated non-linear patterns or when there is insufficient data to develop a linear model. The obvious approach is to build a complex model or increase the number of linear features in the data.</p></li><li><p><strong>Overfitting</strong>: When a statistical model fails to produce correct predictions on testing data, it is said to be overfitted. When a model is trained with a large amount of data, it begins to learn from the noise and incorrect data entries in our data set. It usually occurs when we build a complex model on a simpler dataset. An overfitted model performs well on training data because it has memorized the patterns in the data, but it performs poorly on testing data. An under-fitted model, on the other hand, will perform worse on both datasets because it is unable to capture the trends and patterns underlying the dataset when training.</p></li></ul><blockquote><h3 id="heading-think-about-it-ii"><strong>Think about it - II</strong></h3><p>Assume you have a dataset with both categorical and numerical variables. When you create a linear regression model on it, you notice that it performs poorly on training data and even worse on testing data.</p><p>You conclude that the model is underfitting and that a complex model is needed, so you use polynomial regression with a high degree. Your model now performs extremely well on training data but significantly poorly on testing data. It has now overfitted.</p><p>What do you believe happened? Do you think you'd have to find a sweet spot between simple and complicated models? How can you do it?</p></blockquote><pre><code class="lang-python"><span class="hljs-comment"># Assign x, y then do training testing split</span>x=df[<span class="hljs-string">'Cost'</span>]y=df[<span class="hljs-string">'Score'</span>]<span class="hljs-comment"># Splitting with 75% training, 25% testing data</span>x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = <span class="hljs-number">0.75</span>, test_size = <span class="hljs-number">0.25</span>, random_state = <span class="hljs-number">100</span>)</code></pre><p>The data is first assigned to input variable (x) and output variable (y) accordingly, then the train test split function from sklearn is used to perform splitting into a ratio of 75:25 with a random state of 100. The random state is a seed given to randomly generate indices for train and test sets.</p><h2 id="heading-linear-regression"><strong>Linear Regression</strong></h2><p>Linear Regression is a statistical approach to modeling the linear relationship between predictor variables and the target variable.</p><p>These variables are known as the independent and dependent variables, respectively.</p><p>When there is one independent variable, it is known as <strong>simple linear regression</strong>. When there are more independent variables, it is called <strong>multiple linear regression</strong>.</p><p><strong>Simple Linear Regression</strong>: \( \hat y = \beta_0+\beta_1x+\epsilon\)</p><p><strong>Multiple Linear Regression</strong>: \(\hat y = \beta_0+\beta_1x_1+\dots \beta_px_p+\epsilon\) where \(p\) is... number of features in the model</p><p>Linear regression serves two primary functions: understanding variable relationships and forecasting:</p><ul><li><p>The coefficients represent the estimated magnitude and direction (positive/negative) of each independent variable's relationship with the dependent variable.</p></li><li><p>A linear regression equation predicts the mean value of the dependent variable given the values of the independent variables. So, it enables us to forecast.</p></li></ul><p><strong>Example:</strong> Assume your father owns an ice cream shop. Sometimes there is too much ice cream in the store, and other times there isn't enough to sell. You notice that ice cream sales are much higher on hot days than on cold days. There appears to be some correlation between the temperature and the sale of ice cream.</p><p>Now you must determine the optimal number of ice creams to store in order to sell enough and have little left over at the end of the day.</p><p>How can you forecast the sale for the next few days?</p><p>Is there any way to predict the sale of the next day given the temperature of the last few days?</p><p>Yes, you can use simple linear regression to model the relationship between temperature and sales.</p><p>Now that we are clear on the why let's go ahead to the "how" part of linear regression.</p><h2 id="heading-mathematics-behind-linear-regression"><strong>Mathematics behind Linear Regression</strong></h2><p>Here's the formula for simple linear regression.</p><p>$$y=\beta_0+\beta_1x+\epsilon$$</p><p>Let's understand each of the terms involved:</p><ul><li><p>For any given value of the independent variable (x), y is the predicted value for the dependent variable (y).</p></li><li><p>\(\beta_0\) represents the intercept, or the predicted value of y when x is 0.</p></li><li><p>\(\beta_1\) is the regression coefficient, which tells us how much y will change as x increases.</p></li><li><p>x is the independent or predictor variable that helps us predict y</p></li><li><p>$\epsilon$ is the error left due to not so accurate calculation of the regression coefficients.</p></li></ul><p>Linear regression determines the best fit line across your data by looking for the regression coefficient (B1) that minimizes the model's total error (e).</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685266456833/373dc767-fa4f-4e7b-87dc-fa00a5530086.png" alt class="image--center mx-auto" /></p><p>Let's understand the regression line with the example graph above.</p><ul><li><p>The \(\beta_0\) parameter indicates the intercept or the constant value of y when x is 0.</p></li><li><p>The \(\beta_1\) parameter is the slope or steepness of the regression line.</p></li><li><p>The distance between the predicted value of y on the regression line and the corresponding true value of y is basically the error.</p></li></ul><h2 id="heading-errors-in-regression"><strong>Errors in Regression</strong></h2><p>The regression line regress towards the mean to create the best fit which essentially means that the errors are at the lowest. In the above plot, it is visible that the regression line is not able to exactly predict the true values. There is always going to be some space for errors.</p><p>Let's understand the various errors in Regression:</p><ol><li>The mean absolute error (MAE) is the most basic regression error statistic to grasp. We'll compute the residual for each data point individually, using only the absolute value of each so that negative and positive residuals don't cancel out. The average of all these residuals is then calculated. MAE essentially describes the typical magnitude of the residuals.</li></ol><p>$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y-\hat y|$$</p><ol><li>The mean square error (MSE) is identical to the mean absolute error (MAE) but squares the difference before aggregating all of them. The MSE will nearly always be greater than the MAE because we are squaring the difference. Because of this, we are unable to directly compare the MAE and MSE. We are limited to comparing the error metrics of our model to those of a rival model. The presence of outliers in our data makes the square term's impact on the MSE equation very clear. In MAE, each residual adds proportionally to the overall error, whereas in MSE, the error increases quadratically. As a result, our data outliers will ultimately result in a considerably bigger total error in the MSE than they will in the MAE. Similarly to this, our model will suffer more if it predicts values that are significantly different from the matching actual value. This means that in MSE as opposed to MAE, substantial disparities between actual and predicted values are punished more severely.</li></ol><ul><li>If we wish to limit the importance of outliers, we should use MAE because outlier residuals do not contribute as much to overall error as MSE. Finally, the decision between MSE and MAE is application-specific and depends on how large errors need to be handled.</li></ul><p>$$MSE= \frac{1}{n}\sum_{i=1}^{n}(y-\hat y)^2$$</p><ul><li>The root mean squared error (RMSE) is another error statistic you may come upon. It is the square root of the MSE, as the name implies. Because the MSE is squared, its units differ from the original output. RMSE is frequently used to transform the error metric back into comparable units, making interpretation easier. Outliers have a comparable effect on the MSE and RMSE because they both square the residual.</li></ul><p>$$RMSE= \sqrt(\frac{1}{n}\sum_{i=1}^{n}(y-\hat y)^2)$$</p><ul><li>The percentage counterpart of MAE is the mean absolute percentage error (MAPE). Just as MAE is the average amount of error created by your model, MAPE is the average distance between the model's predictions and their associated outputs. MAPE, like MAE, has a clear meaning because percentages are easier for people to understand. Because of the use of absolute value, MAPE and MAE are both resistant to the effects of outliers.</li></ul><p>$$MAPE= \frac{100\%}{n}\ \sum_{i=1}^{n}\left| \frac{y-\hat y}{y} \right|$$</p><h2 id="heading-finding-the-best-fit-line"><strong>Finding the Best Fit Line</strong></h2><p>We can proceed with the Linear Regression model after determining the correlation between the variables, independent variable, and target variable, and if the variables are linearly correlated. Finding coefficients of linear regression is the process of determining the values of the coefficients ( \(\beta_0\) and \(\beta_1\) ) in the equation of a linear regression model. The objective of finding the coefficients is to minimize the difference between the actual values of the target variable and the predicted values.</p><p>The Linear Regression model will determine the best fit line for the scatter of data points.</p><p>The equation of the regression line is :</p><p>$$y=\beta_0+\beta_1x$$</p><p>where \(\beta_0\) and \(\beta_1\) are regression coefficients.</p><p>\(\beta_1\): This is basically the slope of the line which shows how steep the regression line would be. The slope is calculated by a change in y divided by a change in x.</p><p>$$\beta_1 = \frac{dy}{dx}$$</p><p>\(\beta_0\): This is the intercept. It is the value of y when x is 0. When the straight line passes through the origin intercept is 0.</p><p>We can have infinite possibilities for the values of the regression coefficients. How do you find the best fit line out of all the possible lines?</p><p>The best fit line should have the minimum errors.</p><h3 id="heading-cost-function"><strong>Cost Function</strong></h3><p>The cost function assesses how well a machine learning model performs. The cost function calculates the difference between predicted and actual values as a single real number.</p><p>The following is the distinction between the cost function and the loss function:</p><p>The loss function is the error for individual data points, while the cost function is the average error of n-samples in the data.</p><h3 id="heading-residual-sum-of-squares-rss-or-sum-of-squared-errorssse"><strong>Residual Sum of Squares (RSS) or Sum of Squared Errors(SSE)</strong></h3><p>Ordinary least square or Residual Sum of squares (RSS) or Sum of Squared Errors (SSE) is minimized to find the value of 0 and 1, to find the best fit of the predicted line.</p><p>$$MSE = \frac{1}{N} RSS = \frac{1}{N}\sum_{i=1}^{n}(y-\hat y)^2$$</p><p>Hence,</p><p>$$SSE = \sum_{i=1}^{n}(y-\hat y)^2$$</p><p>There are two main methods to find the coefficients of linear regression: least squares and gradient descent.</p><h3 id="heading-least-squares-estimators"><strong>Least Squares Estimators</strong></h3><p>One of the methods to optimize the Linear Regression equation for the minimum SSE is using Least Squares Estimators. These are the following steps involved in finding out the best fit line parameters:</p><ol><li><p>Differentiate the SSE with respect to 0 and 1</p></li><li><p>Setting the partial derivatives equal to zero yields <strong>normal equations</strong> which can then be manipulated to find 0 and 1 for the minimum SSE.</p></li></ol><h3 id="heading-gradient-descent"><strong>Gradient Descent</strong></h3><p>Gradient descent is an optimization algorithm that iteratively adjusts the coefficients to minimize the cost function. The cost function measures the difference between the actual and predicted values. The gradient descent algorithm updates the coefficients using the gradient of the cost function. The gradient of the cost function gives us the direction of the steepest descent, which we use to adjust the coefficients. The process is repeated until the cost function reaches a minimum.</p><h3 id="heading-did-you-know-iii"><strong>Did you know - III</strong></h3><p>Scikit-learn's LinearRegression and Statsmodels' OLS (Ordinary Least Squares) are two popular libraries for linear regression in Python. While both can be used to perform linear regression, there are some differences between them:</p><ul><li><p>Model Fitting: Scikit-learn provides a simple API for model fitting. The fit method of the LinearRegression class takes in the input features and target variable, and returns the fitted model. On the other hand, Statsmodels provides a more detailed and statistically rigorous approach to model fitting with its OLS class, which allows users to specify various model assumptions, summary statistics and hypothesis testing.</p></li><li><p>Model Summary: Scikit-learn provides only the coefficients and their standard errors, while Statsmodels provides a more detailed summary of the regression results, including R-squared, F-statistic, p-values, and confidence intervals for the coefficients. This can be useful for hypothesis testing and model selection.</p></li><li><p>Model Evaluation: Scikit-learn provides several evaluation metrics for regression models, such as mean squared error, mean absolute error, and R-squared. Statsmodels provides similar metrics, but also offers the ability to run hypothesis tests on the coefficients, such as t-tests and F-tests, which are useful for model selection and inference.</p></li><li><p>Speed: Scikit-learn's LinearRegression is optimized for speed, making it a good choice for large datasets. Statsmodels' OLS is less optimized for speed, and can be slower for large datasets.</p></li></ul><p>In conclusion, when it comes to choosing between the two, it depends on the specific requirements of the project. If a simple, fast, and flexible linear regression model is needed, scikit-learn's LinearRegression is a good choice. If a more detailed statistical analysis of the regression results is needed, with the ability to perform hypothesis tests and perform detailed evaluation, then Statsmodels' OLS may be a better choice.</p><h2 id="heading-point-estimator-of-the-mean-response"><strong>Point Estimator of the Mean Response</strong></h2><p>Point estimators of the mean response in linear regression refer to estimates of the expected value of the response variable for a given predictor variable. These estimates are calculated using the estimated coefficients of the regression line, which are obtained through regression analysis.</p><p>In linear regression, the mean response is modeled as a linear combination of the predictor variables, where the coefficients represent the effect of each predictor on the response. Given a set of predictor variables, the point estimator for the mean response can be calculated by plugging in the values of the predictors into the regression equation and solving for the expected value of the response.</p><p>For example, if we have a simple linear regression model with one predictor, \(x\) , and a response, \(y\), the point estimator for the mean response would be given by \(\hat{y} = \hat{\beta_0} + \hat{\beta_1}x\), where \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are the estimated intercept and slope coefficients, respectively.</p><p>Point estimators are useful because they provide a quick and straightforward way to make predictions about the mean response for a given set of predictor values. However, it is important to keep in mind that point estimators are just estimates and are subject to sampling variability and other sources of error. Therefore, it is common to also provide confidence intervals or prediction intervals along with point estimators to account for uncertainty in the predictions.</p><h2 id="heading-point-estimator-of-the-variance"><strong>Point Estimator of the Variance</strong></h2><p>In linear regression, the variance of the error (also called residual variance) is a measure of the spread of the residuals around the fitted line. The residuals are the differences between the observed values and the values predicted by the regression model.</p><p>The point estimator of the variance of the error is the mean squared error (MSE) divided by the degrees of freedom (n-p-1), where n is the number of observations and p is the number of predictors in the model. The formula for the point estimator of the variance of the error is:</p><p>$$\hat{\sigma}^2 = \frac{1}{n-p-1} \sum_{i=1}^n (y_i - \hat{y_i})^2$$</p><p>where \(\hat{y_i}\) is the predicted value for the i-th observation, and \(y_i\) is the observed value. The MSE is a measure of the overall fit of the regression model. The smaller the MSE, the better the fit of the model. The variance of the error provides information on the spread of the residuals, which can be used to determine the reliability of the regression model.</p><p>In summary, the point estimator of the variance of the error is an important quantity in linear regression as it provides information on the spread of the residuals, which can be used to evaluate the fit of the regression model.</p><h2 id="heading-sampling-distribution-for-regression-coefficients"><strong>Sampling Distribution for Regression Coefficients</strong></h2><p>In linear regression, the coefficients (also known as parameters) represent the relationship between the independent variable(s) and the dependent variable. The coefficients can be estimated using different methods, such as the method of least squares, maximum likelihood estimation, or Bayesian inference.</p><p>The sampling distribution of the coefficients is an important aspect of linear regression analysis. It provides information about the variability and uncertainty of the estimates, and allows us to make inferences about the population parameters based on the sample estimates.</p><p>For example, consider a simple linear regression model with one independent variable (x) and one dependent variable (y):</p><p>$$y = \beta_0 + \beta_1 x + \epsilon$$</p><p>where \(\beta_0\) and \(\beta_1\) are the intercept and slope coefficients, respectively, and \(\epsilon\) is the error term.</p><h3 id="heading-t-distribution-and-hypothesis-testing"><strong>T-Distribution and Hypothesis Testing</strong></h3><p>A common way to make inferences about the coefficients is through hypothesis testing. This involves setting up a null hypothesis, \(H_0\), and an alternate hypothesis, \(H_1\). For example, the null hypothesis may be that \(\beta_1=0\), meaning that there is no relationship between the predictor variable and the response variable. The alternate hypothesis may be that \(\beta_10\), meaning that there is a relationship.</p><p>The point estimator of \(\beta_1\) is the sample regression coefficient, denoted by \(\hat{\beta_1}\). It is the value that minimizes the sum of squared differences between the observed y values and the predicted y values. The sampling distribution of \(\hat{\beta_1}\) is approximately normal, with mean \(\beta_1\) and variance \(\sigma^2_{\hat{\beta_1}}\), where \(\sigma^2_{\hat{\beta_1}} = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\).</p><p>The t-distribution is used to model the sampling distribution of the coefficients in linear regression. The t-distribution takes into account the sample size and the degrees of freedom (df), which is the number of independent observations in the sample minus the number of parameters estimated. For example, in simple linear regression with one predictor, there are 2 parameters (the intercept and the slope) and n - 2 degrees of freedom.</p><p>The <strong>t-distribution</strong> is used to test the hypothesis that the population mean of a coefficient is equal to some value. For example, we may want to test the hypothesis that the population slope, \(\beta_1\), is equal to zero, which means that there is no relationship between the predictor and the response. The hypothesis test is performed by calculating the t-statistic and the corresponding p-value. The t-statistic is calculated as:</p><p>$$t = \frac{\hat{\beta_1} - \beta_{1}}{SE(\hat{\beta_1})}$$</p><p>where \(\hat{\beta_1}\) is the sample estimate of the slope, \(\beta_{1}\) is the hypothesized value of the population slope, and \(SE(\hat{\beta_1})\) is the standard error of the estimate of the slope. The standard error can be calculated as:</p><p>$$SE(\hat{\beta_1}) = \sqrt{\frac{s^2}{n-2}}$$</p><p>where \(s^2\) is the residual sum of squares divided by the degrees of freedom and \(n\) is the sample size.</p><p>The t-statistic is compared to the critical value of the t-distribution to determine if the null hypothesis should be rejected. A small p-value indicates that the data provides strong evidence against the null hypothesis, while a large p-value suggests that the data does not provide strong evidence against the null hypothesis.</p><p>The p-value is the probability of observing a t-statistic as extreme or more extreme than the calculated value, assuming the null hypothesis is true. If the p-value is less than a significance level (e.g. 0.05), we reject the null hypothesis and conclude that there is evidence that the population slope is not equal to zero.</p><p>In conclusion, the t-distribution is used to model the sampling distribution of the coefficients in linear regression and to perform hypothesis testing. It allows us to make inferences about the population parameters and assess the strength of the relationship between the predictor and the response.</p><h3 id="heading-confidence-intervals"><strong>Confidence Intervals</strong></h3><p>Confidence intervals can also be calculated for the coefficients. A confidence interval is a range of values that is likely to contain the population parameter with a certain level of confidence. For example, a 95% confidence interval means that if we repeated the sampling process many times, 95% of the intervals calculated would contain the true population parameter.</p><p>The confidence interval for a coefficient is calculated as:</p><p>$$\hat{\beta_1} t_{critical} * SE(\hat{\beta_1})$$</p><p>where \(t_{critical}\) is the critical value of the t-distribution at a certain level of confidence.</p><p>In summary, sampling distributions of coefficients in linear regression help us make inferences about the population parameters. Through hypothesis testing and confidence intervals, we can determine if the coefficients are statistically significant and estimate their values with a certain level of confidence.</p><h2 id="heading-comparing-regression-models"><strong>Comparing Regression Models</strong></h2><p>We now have the best possible Linear Regression equation parameters. The question is, how do we assess our model? Is it possible to use SSE to say that our model has this much SSE and thus is good? What criteria would you use to compare one regression model to another?</p><p>One of the major drawbacks of SSE is that the SSE will change if the units of the actual y and predicted y change. As a result, we introduce a relative term called \(R^2\)</p><p>, which creates consistency among the models. But, before we get any further into \(R^2\), let's take a step back and look at TSS.</p><h3 id="heading-total-sum-of-squares-tss"><strong>Total Sum of Squares (TSS)</strong></h3><p>The total Sum of Squares is similar to SSE but instead of adding the actual values difference from the predicted value, in the TSS, we find the difference from the mean y.</p><p>$$TSS = \sum_{i=1}^{n}(y-\bar y)^2$$</p><p>TSS functions as a cost function for a model with no independent variables and only the intercept ( \(\bar y\) ). This indicates how good the model is in the absence of any independent variables.</p><p>SSE provides the model performance when an independent variable is added. The ratio \(\frac{SSE}{TSS}\) indicates how good the model is in comparison to the mean value without variance. The residual error with actual values (SSE) is smaller, while the residual error with the mean (TSS) is larger. Hence, the overall ratio is lower for a robust model.</p><p>Lets move to our case, we are going to model the relationship between Cost and Scores using Ordinary Least Squares of the statsmodels library.</p><pre><code class="lang-python"><span class="hljs-comment">#statsmodels approach to regression</span><span class="hljs-comment"># fit the model</span>lr = sm.OLS(y_train, x_train).fit()<span class="hljs-comment"># Printing the parameters</span>lr.paramslr.summary()<span class="hljs-comment">#force intercept term</span>x_train_with_intercept = sm.add_constant(x_train)lr = sm.OLS(y_train, x_train_with_intercept).fit()print(lr.summary())</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685278603308/54906975-b583-41c9-a108-78df60ca21fe.png" alt class="image--center mx-auto" /></p><h2 id="heading-model-summary"><strong>Model Summary</strong></h2><p>Now that we have successfully modeled let's evaluate the results and summary of the model:</p><ul><li>\(R^2\): The \(R^2\) or the coefficient of determination is the proportion of the variance in the dependent variable that is explained from the independent variable(s). \(R^2\) is expressed between 0 and 1 for the level of variance explained. As we learned in the previous section, the ratio \(\frac{SSE}{TSS}\) should be low for a robust model, this ratio signifies the error or unexplained variance by the independent variable(s). Mathematically, \(R^2\) or explained variance can be expressed as:</li></ul><p>$$R^2 = 1 - \frac{SSE}{TSS}$$</p><ul><li><p>We got an \(R^2\) of <strong>0.93</strong> which is pretty good.</p></li><li><p>Adjusted \(R^2\): For linear models, adjusted \(R^2\) is a corrected goodness-of-fit statistic. It determines the proportion of variance in the target field explained by the input or inputs. \(R^2\) tends to overestimate the goodness-of-fit of the linear regression. It always grows as the number of independent variables in the model grows. It happens because we tend to deduct a large amount (due to multiple variables) to calculate error as the number of independent variables increases. Hence, the ratio \(\frac{SSE}{TSS}\) is even lower than it should be and \(R^2\) seems to be high but it might not be an appropriate model for production data. It is adjusted to account for this overestimation. Considering N as the total sample size and p as the number of independent variables, adjusted \(R^2\) can be expressed as:</p></li></ul><p>$$\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(N - 1)}{N - p - 1}$$</p><ul><li><p>F-Statistic: F-Statistic can be used for hypothesis testing about whether the slope is meaningful or not. F-statistics is a statistic used to test the significance of regression coefficients in linear regression models. F-statistics can be calculated as MSR/MSE where MSR represents the mean sum of squares regression and MSE represents the mean sum of squares error. The null hypothesis is that the slope is 0 or there is no relationship between the predictor and target variables. If the value of F-statistics is greater than the critical value, we can reject the null hypothesis and conclude that theres a significant relationship between the predictor variables and the response variable.</p></li><li><p>Prob (F-Statistic): The p-value of the f statistic is very small, which basically means what are the odds that the null hypothesis is true and we observe the same result due to random chance, and the odds are very small that \(H_0 : \beta_1\) is 0, highly unlikely that the model is not good, and highly likely that the slope is not zero.</p></li></ul><h3 id="heading-did-you-know-iv"><strong>Did you know - IV</strong></h3><h3 id="heading-r2-can-take-negative-values">\(R^2\) <strong>can take negative values!</strong></h3><p>It compares the fit of the chosen model with that of a horizontal straight line (the null hypothesis). If the chosen model fits worse than a horizontal line, then \(R^2\) is negative which will essentially mean that model is making no sense and is predicting randomly.</p><blockquote><h3 id="heading-think-about-it-iii"><strong>Think about it - III</strong></h3><p>What would happen if we do not include constant while building the Linear Regression model?</p></blockquote><pre><code class="lang-python"><span class="hljs-comment">#Extract the B0, B1</span>print(lr.params)b0=lr.params[<span class="hljs-number">0</span>]b1=lr.params[<span class="hljs-number">1</span>]<span class="hljs-comment">#Plot the fitted line on training data</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(x_train, y_train)plt.plot(x_train, b0+ b1*x_train, <span class="hljs-string">'r'</span>)plt.xlabel(<span class="hljs-string">"Cost"</span>)plt.ylabel(<span class="hljs-string">"Score"</span>)plt.title(<span class="hljs-string">"Regression line through the Training Data"</span>)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685279414878/bf1fc3ea-3b8c-4aae-b9ef-2ad400fd2d4f.png" alt class="image--center mx-auto" /></p><p>In this plot, we are extracting the values of the intercept \(\beta_0\) and coefficient/slope \(\beta_1\) and plotting the regression line over the scatter plot of the Cost and Score of training data.</p><p>The regression line has a good fitting, it probably deviates a little after a cost of 125 or so, let's see if we can improve it in the later sections when we diagnose and remedy but first let's see how our model performs on the test data.</p><h2 id="heading-prediction-on-test-data"><strong>Prediction on Test Data</strong></h2><pre><code class="lang-python"><span class="hljs-comment">#Plot the fitted line on test data</span>x_test_with_intercept = sm.add_constant(x_test)y_test_fitted = lr.predict(x_test_with_intercept)<span class="hljs-comment"># scatter plot on test data</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(x_test, y_test)plt.plot(x_test, y_test_fitted, <span class="hljs-string">'r'</span>)plt.xlabel(<span class="hljs-string">"Cost"</span>)plt.ylabel(<span class="hljs-string">"Score"</span>)plt.title(<span class="hljs-string">"Regression line through the Testing Data"</span>)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685279561861/f204f7b7-0410-4ef2-8bee-fb764ab8d5aa.png" alt class="image--center mx-auto" /></p><p>Here we can see that the model has built a good regression fit as it is passing through the middle of all the points to get the minimum error.</p><p>Observe that all the data points in the test data lie in the range of the training data. This is called interpolation. What if we analyze a data point with a cost say 560? This is extrapolation and the model probably won't be robust to it.</p><h2 id="heading-assumptions-of-linear-regression"><strong>Assumptions of Linear Regression</strong></h2><p>Linear regression is a parametric model which means it requires the specification of some parameters before they can be used to make predictions. These parameters or assumptions are:</p><ul><li><p><strong>The relationship between the independent and dependent variables is linear</strong>: the line of best fit through the data points is a straight line.</p></li><li><p><strong>Homoscedasticity</strong>: means homogeneity of variance of residuals across the values of the independent variable.</p></li><li><p><strong>Independence of observations:</strong> the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among observations.</p></li><li><p><strong>Normality</strong>: The data follows a normal distribution.</p></li></ul><h2 id="heading-diagnostics-and-remedies"><strong>Diagnostics and Remedies</strong></h2><p>As we learned in the previous section, Linear Regression follows some assumptions. The section Diagnostics and Remedies is evaluating if the data follows the assumptions or not, whether Linear Regression is a good fit for the patterns in the data, and simply includes the things we do in order to assess how well the model performs. The following are the things we look for in the data to diagnose Linear Regression as an unfit model:</p><ul><li><p><strong>Non-Linearity</strong>: First thing to look for is non-linearity, for example, your data might look linear for some time, and then it shows non-linearity and a parabola would fit better than a straight line.</p></li><li><p><strong>Heteroscedasticity</strong>: meaning non-constant variance, variance in one region may not be the same as in the second region.</p></li><li><p><strong>Independence</strong>: Errors are not independent and identically distributed.</p></li><li><p><strong>Outliers</strong>: Outliers can have a large impact on the model, for example, if there's a slow-growing regression line and there is an outlier up in the center, it will pull the regression line upwards than most of the data.</p></li><li><p><strong>Missing Features</strong>: Missing predictor variables, no need in a simple linear regression, which simply means losing on variables that can be useful but are not included.</p></li></ul><p>How do we begin to assess all these things?</p><p><strong>Residual Analysis</strong></p><p>Residual analysis is used to study residuals in data and to understand what needs to be done to improve our model performance. Residual is the error we get by subtracting the prediction value from the true value of the dependent variable.</p><p>$$R_i = y_i-\hat y_i$$</p><ul><li><p>First, plot the residual versus predictor; if the scatter plot shows a departure from linearity (parabola), reevaluate the model; if not, try a modification to make the data linear.</p></li><li><p>This plot also shows indications of non-constant variance; if the data points scatter in the shape of a megaphone, we can claim the variance is not constant. We can also use transformations to overcome heteroscedasticity, or we can use weighted least squares.</p></li><li><p>Another plot that can be used is a sequence plot or residuals versus time order. We may want to search for a cyclical pattern or a straight trend, which indicates when linear regression would be useful and when it would not.</p></li><li><p>Box plot of residuals: if we have a lovely and centered box plot, we are fine; if we have a little bias to one side of the box plot, we clearly lack some normalcy; we can also check this with normal probability plot if it is skewed to the right and skewed to the left.</p></li><li><p>Next is to check for outliers, don't eliminate outliers unless you absolutely have to such as in a scenario when a data point is simply incorrect.</p></li></ul><pre><code class="lang-python"><span class="hljs-comment">#DIAGNOSTICS</span><span class="hljs-comment">#CHECKLIST:</span><span class="hljs-comment"># NON-LINEARITY</span><span class="hljs-comment"># NON-CONSTANT VARIANCE</span><span class="hljs-comment"># DEVIATIONS FROM NORMALITY</span><span class="hljs-comment"># ERRORS NOT IID</span><span class="hljs-comment"># OUTLIERS</span><span class="hljs-comment"># MISSING PREDICTORS</span><span class="hljs-comment">#Build predictions on training data</span>predictions_y = lr.predict(x_train_with_intercept)<span class="hljs-comment">#Find residuals</span>r_i = (y_train - predictions_y)<span class="hljs-comment">#Residuals vs. predictor in training data</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.title(<span class="hljs-string">' Residuals vs. Cost'</span>)plt.xlabel(<span class="hljs-string">'Cost'</span>,fontsize=<span class="hljs-number">15</span>)plt.scatter(x_train, r_i)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685279815215/b2e9d966-4f14-449f-9847-1aaf25db0cba.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python"><span class="hljs-comment">#Absolute residuals against predictor</span>abs_r_i = np.abs(y_train - predictions_y)figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.title(<span class="hljs-string">'Absolute Residuals vs. Cost'</span>)plt.xlabel(<span class="hljs-string">'Cost'</span>,fontsize=<span class="hljs-number">15</span>)plt.scatter(x_train, abs_r_i)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685279894722/5c4886f9-e637-4aa6-9cd2-864a78a68c30.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python"><span class="hljs-comment">#Normality plot</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)scipy.stats.probplot(r_i,plot=plt)</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685279966102/fc31c122-86b8-47c3-b4fd-b714876ef3c7.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python"><span class="hljs-comment">#Tails might be a little heavy, but overall no clear reason to reject normality expectations</span><span class="hljs-comment"># Evaluate normality through histogram of residuals</span><span class="hljs-comment"># Plotting the histogram using the residual values</span>fig = plt.figure()figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)sns.distplot(r_i, bins = <span class="hljs-number">15</span>)plt.title(<span class="hljs-string">'Error Terms'</span>, fontsize = <span class="hljs-number">15</span>)plt.xlabel(<span class="hljs-string">'y_train - y_train_pred'</span>, fontsize = <span class="hljs-number">15</span>)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280042922/033cc19f-70e1-479b-835f-60b98fb3d015.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python"><span class="hljs-comment">#Boxplot for outliers</span><span class="hljs-comment"># plot</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.boxplot(r_i, boxprops=dict(color=<span class="hljs-string">'red'</span>))plt.title(<span class="hljs-string">'Residual Boxplot'</span>);</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280096007/c9608650-70de-4b5b-bfcb-3defc9c2b995.png" alt class="image--center mx-auto" /></p><p>At the beginning of the section - Diagnostics and Remedies, we saw the steps we can take to understand the patterns in data and if we need to do some transformations to make the assumptions of linear regression hold true. Here are the observations that we made from the above plots:</p><ul><li><p>The residuals vs cost plot shows a good scatter of residuals and no pattern is observed up until 125 or 150 costs. We can say we have some heteroscedasticity in the higher costs. We'll see how we can tackle it.</p></li><li><p>The normality of the errors can be seen in the normal probability plot and the histogram. It is more or less normal or bell shaped.</p></li><li><p>The residual boxplot shows no obvious outliers.</p></li></ul><h3 id="heading-transformations-to-avoid-non-constant-variance"><strong>Transformations to avoid non-constant variance</strong></h3><p>Non-constant variance can be a problem in linear regression, as the assumption of constant variance of the errors is a key requirement for the ordinary least squares (OLS) method to be unbiased and efficient. When this assumption is violated, the regression coefficients can be inefficient and/or the predictions can be biased. To avoid non-constant variance, there are different data transformations that can be applied.</p><ul><li><p>Log transformation: This transformation is often used when the variance of the data increases with the mean. A log transformation can be used to stabilize the variance by converting the data into logarithmic values.</p></li><li><p>Square root transformation: This transformation is also used to stabilize the variance by converting the data into square root values.</p></li><li><p>Box-Cox transformation: This is a statistical transformation that is used to stabilize the variance by transforming the data into values that are closer to a normal distribution. The Box-Cox transformation is a more flexible and powerful transformation compared to the log and square root transformations.</p></li><li><p>Yeo-Johnson transformation: This is a newer and more flexible version of the Box-Cox transformation that can handle both positive and negative data values.</p></li></ul><pre><code class="lang-python"><span class="hljs-comment">#Demo of how to deal with non-constant variance through transformations</span>test_residuals=(y_test-y_test_fitted)len(y_test)len(y_test_fitted)len(test_residuals)<span class="hljs-comment">#Residuals vs. predictor in test set</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.title(<span class="hljs-string">'Test Residuals vs. Cost'</span>)plt.xlabel(<span class="hljs-string">'Cost'</span>,fontsize=<span class="hljs-number">15</span>)plt.ylabel(<span class="hljs-string">'Residuals'</span>,fontsize=<span class="hljs-number">15</span>)plt.scatter(x_test, test_residuals)plt.show()<span class="hljs-comment">#Some evidence of non-constant variance</span></code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280221649/a330203b-1f4d-4204-b1ac-57a204d9f9f6.png" alt class="image--center mx-auto" /></p><p>We can see the scatter of data points increases as we increase the cost. This is evidence of Heteroscedasticity.</p><p>We'll try different transformations such as square root, log, and box-cox to see if we can introduce linearity with these transformations.</p><pre><code class="lang-python"><span class="hljs-comment">#Try sqrt</span>sqrt_y=np.sqrt(y)figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(x, sqrt_y,color=<span class="hljs-string">'red'</span>);<span class="hljs-comment">#Try ln</span>ln_y=np.log(y)plt.scatter(x, ln_y,color=<span class="hljs-string">'blue'</span>);<span class="hljs-comment">#Let's try a BC transformation</span><span class="hljs-comment">#Box Cox procedure on all cost</span>bc_y=list(stats.boxcox(y))bc_y=bc_y[<span class="hljs-number">0</span>]plt.scatter(x, bc_y,color=<span class="hljs-string">'orange'</span>);<span class="hljs-comment">#Overall, most satisfied with the sqrt transformation</span></code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280280936/e23140e8-639b-4566-ab48-1983a8d88298.png" alt class="image--center mx-auto" /></p><p>We can observe that the square root transformation denoted by red dots gives the most linear scatter of data points. Let's try to run the linear regression model on the transformed variable and analyze the change in results.</p><pre><code class="lang-python"><span class="hljs-comment">#Run regression on this set</span>x_train, x_test, y_train, y_test = train_test_split(x, sqrt_y, train_size = <span class="hljs-number">0.75</span>, test_size = <span class="hljs-number">0.25</span>, random_state = <span class="hljs-number">100</span>)<span class="hljs-comment">#force intercept term</span>x_train_with_intercept = sm.add_constant(x_train)lr = sm.OLS(y_train, x_train_with_intercept).fit()print(lr.summary())</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280352331/9b77ca6c-1e5a-4389-ad27-945f3893def8.png" alt class="image--center mx-auto" /></p><p>We can see the change in \(R^2\) and Adjusted \(R^2\) after the transformation. They are almost similar which suggests that \(R^2\) is no longer overestimating the variance explained by the predictor variable.</p><pre><code class="lang-python"><span class="hljs-comment">#Extract the B0, B1</span>print(lr.params)b0=lr.params[<span class="hljs-number">0</span>]b1=lr.params[<span class="hljs-number">1</span>]<span class="hljs-comment">#Plot the fitted line on training data</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(x_train, y_train)plt.plot(x_train, b0+ b1*x_train, <span class="hljs-string">'r'</span>)plt.xlabel(<span class="hljs-string">"Cost"</span>)plt.ylabel(<span class="hljs-string">"Score"</span>)plt.title(<span class="hljs-string">"Regression line through the Training Data"</span>)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280504708/d54e1dac-e68c-4a1a-b384-4ba12750d949.png" alt class="image--center mx-auto" /></p><p>We extracted the linear regression coefficients and plotted the regression line on the Cost vs Score scatter plot.</p><pre><code class="lang-python"><span class="hljs-comment">#Plot the fitted line on test data</span>x_test_with_intercept = sm.add_constant(x_test)y_test_fitted = lr.predict(x_test_with_intercept)figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(x_test, y_test)plt.plot(x_test, y_test_fitted, <span class="hljs-string">'r'</span>)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280573937/b5e7116d-ed82-4a6e-a9e9-ce5acbba02f9.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python"><span class="hljs-comment">#Evaluate variance</span><span class="hljs-comment">#Diagnostics</span>test_residuals=(y_test-y_test_fitted)len(y_test)len(y_test_fitted)len(test_residuals)<span class="hljs-comment">#Residuals vs. predictor</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.title(<span class="hljs-string">'Residuals vs. Cost'</span>)plt.xlabel(<span class="hljs-string">'Cost'</span>,fontsize=<span class="hljs-number">15</span>)plt.scatter(x_test, test_residuals)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280731077/bac23250-7323-4dec-a29f-e40c91c6be9e.png" alt class="image--center mx-auto" /></p><h2 id="heading-conclusion"><strong>Conclusion</strong></h2><p>This project is coming to a close, but let's go through what we learned and what we can do next. With a Simple Linear Regression problem statement, we learned the fundamentals of Linear Regression by predicting the scores of soccer players based on their cost.</p><p>We grasped the assumptions for Linear Regression and how to diagnose and correct data errors related to the assumptions.</p><p>The following phase in the learning process should be to introduce multiple linear regression and the issues that come with developing a multiple linear regression model, such as multicollinearity.</p><p>In the next section, we will explore some real-world applications of Linear Regression to give you a taste of this fascinating concept.</p><h2 id="heading-linear-regression-in-real-life"><strong>Linear Regression in Real Life</strong></h2><p>There are many real-world applications of linear regression. Exploring some of the best Linear Regression real-world applications will help us comprehend the concept more clearly.</p><ol><li><p>Humans are not an exception; everything has a shelf life. We can store vast amounts of information about a person's medical history and estimate how long they will live thanks to ongoing improvements in medical science technology and diagnostic tools. The term "life expectancy" describes the number of years one can expect to live. This application is frequently used by insurance companies and public healthcare organizations to increase their productivity and achieve organizational goals.</p></li><li><p>A common method used by agricultural scientists to assess how fertilizer and water affect crop yields is linear regression. For instance, researchers may vary the water and fertilizer applications in various fields to observe the effects on crop yield. A multiple linear regression model can be used with crop yield as the target variable and fertilizer and water as the predictor variables.</p></li><li><p>For professional sports teams, analysts use linear regression to gauge the impact of various training schedules on player performance. For instance, data scientists in the NBA may examine how various frequencies of yoga and weightlifting sessions each week affect a player's point total. With yoga and weightlifting sessions as the predictor variables and total points earned as the response variable, they could fit a multiple linear regression model.</p></li></ol>]]><![CDATA[<h2 id="heading-overview"><strong>Overview</strong></h2><p>The English Premier League is one of the world's most-watched soccer leagues, with an estimated audience of 12 million people per game. With the substantial financial benefits, all significant teams of EPL are interested in Analytics and AI. Regarding sports analytics, machine learning and artificial intelligence (AI) have become extremely popular. The sports entertainment sector and the relevant stakeholders extensively use sophisticated algorithms to improve earnings and reduce business risk associated with selecting or betting on the wrong players.</p><p><img src="https://cdn.pixabay.com/photo/2016/04/15/20/28/football-1331838__340.jpg" alt="image" class="image--center mx-auto" /></p><p>Regression is one of the foundational techniques in Machine Learning. As one of the most well-understood algorithms, linear regression plays a vital role in solving real-life problems. In this project, we wish to use Linear Regression to predict the scores of EPL soccer players. With the business implications cleared. Let's get into the project's technical details.</p><p>This project is part of the Linear Regression Beginner Project Series, and it consists of discussing and implementing the fundamentals of Linear Regression in Python on the EPL Soccer Player Dataset.</p><h2 id="heading-execution-instructions"><strong>Execution Instructions</strong></h2><h3 id="heading-option-1-running-on-your-computer-locally">Option 1: Running on your computer locally</h3><p>To run the notebook on your local system set up a <a target="_blank" href="https://www.python.org/">python</a> environment. Set up the <a target="_blank" href="https://jupyter.org/install">jupyter notebook</a> with Python or by using <a target="_blank" href="https://anaconda.org/anaconda/jupyter">Anaconda distribution</a>. <a target="_blank" href="https://github.com/HemathDeunei/1_LR_model_using_EPL_soccer_dat">Download the notebook</a> and open a jupyter notebook to run the code on the local system.</p><p>The notebook can also be executed by using <a target="_blank" href="https://code.visualstudio.com/">Visual Studio Code</a>, and <a target="_blank" href="https://www.jetbrains.com/pycharm/">PyCharm</a>.</p><h3 id="heading-option-2-executing-with-colab">Option 2: Executing with Colab</h3><p>Colab, or "Collaboratory", allows you to write and execute Python in your browser, with access to GPUs free of charge and easy sharing.</p><p>You can run the code using <a target="_blank" href="https://colab.research.google.com/">Google Colab</a> by uploading the <a target="_blank" href="https://github.com/HemathDeunei/1_LR_model_using_EPL_soccer_dat">ipython notebook</a>.</p><h2 id="heading-approach"><strong>Approach</strong></h2><ul><li><p>Install Packages</p></li><li><p>Import Libraries</p></li><li><p>Data Reading from Different Sources</p></li><li><p>Exploratory Data Analysis</p></li><li><p>Correlation</p></li><li><p>Relationship between Cost and Score</p></li><li><p>Train - Test Split</p></li><li><p>Linear Regression</p></li><li><p>Model Summary</p></li><li><p>Prediction of Test Data</p></li><li><p>Diagnostics and Remedies</p></li></ul><h2 id="heading-important-libraries"><strong>Important Libraries</strong></h2><ul><li><p><strong>pandas</strong>: pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool built on top of the Python programming language. Refer to <a target="_blank" href="https://pandas.pydata.org/">documentation</a> for more information.</p></li><li><p><strong>NumPy</strong>: The fundamental package for scientific computing with Python. Fast and versatile, the NumPy vectorization, indexing, and broadcasting concepts are the de-facto standards of array computing today. NumPy offers comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more. Refer to <a target="_blank" href="https://numpy.org/">documentation</a> for more information. pandas and NumPy are together used for most of the data analysis and manipulation in Python.</p></li><li><p><strong>Matplotlib</strong>: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Refer to <a target="_blank" href="https://matplotlib.org/">documentation</a> for more information.</p></li><li><p><strong>seaborn</strong>: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Refer to <a target="_blank" href="https://seaborn.pydata.org/">documentation</a> for more information.</p></li><li><p><strong>scikit-learn</strong>: Simple and efficient tools for predictive data analysis accessible to everybody and reusable in various contexts. It is built on NumPy, SciPy, and matplotlib to support machine learning in Python. Refer to <a target="_blank" href="https://scikit-learn.org/stable/">documentation</a> for more information.</p></li><li><p><strong>statsmodels</strong>: statsmodels is a Python module that provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests and statistical data exploration. Refer to <a target="_blank" href="https://www.statsmodels.org/stable/index.html">documentation</a> for more information.</p></li><li><p><strong>SciPy</strong>: SciPy provides algorithms for optimization, integration, interpolation, eigenvalue problems, algebraic equations, differential equations, statistics, and many other classes of problems. Refer to <a target="_blank" href="https://scipy.org/">documentation</a> for more information.</p></li></ul><h2 id="heading-install-packages"><strong>Install Packages</strong></h2><pre><code class="lang-python"><span class="hljs-keyword">import</span> warningswarnings.filterwarnings(<span class="hljs-string">'ignore'</span>)</code></pre><pre><code class="lang-python"><span class="hljs-keyword">import</span> sys!{sys.executable} -m pip install numpy!{sys.executable} -m pip install seaborn!{sys.executable} -m pip install matplotlib!{sys.executable} -m pip install statsmodels!{sys.executable} -m pip install pandas!{sys.executable} -m pip install scipy!{sys.executable} -m pip install scikit_learn</code></pre><h2 id="heading-data-reading-from-different-sources"><strong>Data Reading from Different Sources</strong></h2><h4 id="heading-1-files"><strong>1. Files</strong></h4><p>In many cases, the data is stored in the local system. To read the data from the local system, specify the correct path and filename.</p><ul><li><strong>CSV format</strong></li></ul><p>Comma-separated values, also known as CSV, are a specific way to store data in a table structure format. The data used in this project is stored in a CSV file. Click <a target="_blank" href="https://data.hemath.com/access/file_csv/1_Soccer_Data.csv">here</a> to download the data used in this project.</p><p>Use the following code to read data from CSV file using pandas.</p><pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pdcsv_file_path= <span class="hljs-string">"C:\Users\Hemath\Desktop\1_Soccer_Data.csv"</span>df = pd.read_csv(csv_file_path)</code></pre><p>With appropriate csv_file_path, <a target="_blank" href="http://pd.read">pd.read</a>_csv() function will read the data and store it in df variable.</p><p>If you get <em>FileNotFoundError or No such file or directory</em>, try checking the path provided in the function. It's possible that python is not able to find the file or directory at a given location.</p><ul><li><strong>Public URL</strong></li></ul><p><a target="_blank" href="http://pandas.read">pandas.read</a>_csv() method also works if the data is available on any public URL.</p><pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pddata_url=<span class="hljs-string">"https://data.hemath.com/access/file_csv/1_Soccer_Data.csv"</span>df = pd.read_csv(data_url)</code></pre><h4 id="heading-2-database"><strong>2. Database</strong></h4><p>Most organization store their data in databases such as <a target="_blank" href="https://www.mysql.com/">MySQL</a> or <a target="_blank" href="https://www.postgresql.org/">Postgres</a>. The data can be accessed by secret credentials, which will be in the following format.</p><pre><code class="lang-python">host = <span class="hljs-string">"localhost"</span>database= <span class="hljs-string">"db_name"</span>user = <span class="hljs-string">"root"</span>password = <span class="hljs-string">"password"</span></code></pre><p>MySQL is an open-source relational database management system.</p><p>In this project, we will demonstrate how to connect python to a MySQL server to fetch the data. We will use the <a target="_blank" href="https://pypi.org/project/PyMySQL/">pymysql</a> library to connect to the MySQL server.</p><p>Convert <a target="_blank" href="https://data.hemath.com/access/file_csv/1_Soccer_Data.csv">CSV</a> into SQL Insert statement using this <a target="_blank" href="https://www.convertcsv.com/csv-to-sql.htm">link</a>. Create a database in local using MySQL or Postgres and execute the sql query.</p><p>Use this code to connect Python to MySQL and fetch the data.</p><pre><code class="lang-python"><span class="hljs-comment">#installing pymysql library</span><span class="hljs-keyword">import</span> sys!{sys.executable} -m pip install pymysql<span class="hljs-keyword">import</span> pymysqlconnection = pymysql.connect( host = <span class="hljs-string">"localhost"</span>, user = <span class="hljs-string">"user"</span>, password = <span class="hljs-string">"mypassword"</span>, database = <span class="hljs-string">"db_name"</span> )df = pd.read_sql(<span class="hljs-string">"SELECT * FROM table_name"</span>,connection)df.head()</code></pre><pre><code class="lang-python"><span class="hljs-comment">#import required packages</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt<span class="hljs-keyword">from</span> sklearn.model_selection <span class="hljs-keyword">import</span> train_test_split<span class="hljs-keyword">import</span> statsmodels.api <span class="hljs-keyword">as</span> sm<span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns<span class="hljs-keyword">from</span> scipy <span class="hljs-keyword">import</span> stats<span class="hljs-keyword">import</span> scipy<span class="hljs-keyword">from</span> matplotlib.pyplot <span class="hljs-keyword">import</span> figure</code></pre><pre><code class="lang-python"><span class="hljs-comment"># Load the data as a data frame by using URL</span>soccer_data_url=<span class="hljs-string">"https://data.hemath.com/access/file_csv/1_Soccer_Data.csv"</span>df = pd.read_csv(soccer_data_url)</code></pre><pre><code class="lang-python"><span class="hljs-comment">#view top 3 entries from the soccer data</span>df.head(<span class="hljs-number">3</span>)</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685263008036/be03bb1d-0980-41c9-a3ef-36a3b6074055.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python">df.columns</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685263041462/f71ef19d-270b-4860-96b9-14d2f0a60c43.png" alt class="image--center mx-auto" /></p><h2 id="heading-data-dictionary"><strong>Data Dictionary</strong></h2><ul><li><p>PlayerName : Player Name</p></li><li><p>Club : Club of the player</p><ol><li><p>MUN:Manchester United F.C.</p></li><li><p>CHE: Chelsea F.C.</p></li><li><p>LIV: Liverpool F.C.</p></li></ol></li><li><p>DistanceCovered(InKms): Average Kms distance covered by the player in each game</p></li><li><p>Goals: Average Goals per match</p></li><li><p>MinutestoGoalRatio: Minutes</p></li><li><p>ShotsPerGame: Average shots taken per game</p></li><li><p>AgentCharges: Agent Fees in h</p></li><li><p>BMI: Body-Mass index</p></li><li><p>Cost: Cost of each player in hundread thousand dollars</p></li><li><p>PreviousClubCost: Previous club cost in hundread thousand dollars</p></li><li><p>Height: Height of player in cm</p></li><li><p>Weight: Weight of player in kg</p></li><li><p>Score: Average score per match</p></li></ul><h2 id="heading-exploratory-data-analysis"><strong>Exploratory Data Analysis</strong></h2><p>Exploratory Data Analysis, commonly known as EDA, is a technique to analyze the data with visuals. It involves using statistics and visual techniques to identify particular trends in data.</p><p>It is used to understand data patterns, spot anomalies, check assumptions, etc. The main purpose of EDA is to help look into the data before making any hypothesis about it.</p><h3 id="heading-dataframe-information"><strong>Dataframe Information</strong></h3><p>The <a target="_blank" href="http://dataframe.info">dataframe.info</a><a target="_blank" href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html">()</a> method prints information about the DataFrame, including the index dtype and columns, non-null values, and memory usage.</p><p>It can be used to get basic info, look for missing values, and get a sense of each variable's format.</p><pre><code class="lang-python">df.info()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685265186187/fb153482-3ba0-49b4-8a4d-135446314c2c.png" alt class="image--center mx-auto" /></p><p>There are total 202 rows and 13 columns in EPL Soccer Dataset.</p><p>Observe that there are no null values in the dataset.</p><p>Out of 13 columns 10 are float type and 1 is integer type The remaining 2 have object dtype.</p><p>Learn about Essential basic functionality for pandas dataframe <a target="_blank" href="https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes">here</a>.</p><h2 id="heading-basic-statistical-concepts"><strong>Basic Statistical Concepts</strong></h2><ul><li><strong>Mean</strong>: The mean is one of the measures of central tendency. Simply put, the mean is the average of the values in the given set. The observed values are totaled and divided by the total number of observations to determine the mean. If \(x_i\) is \(i^{th}\) observation then mean of all \(x_i\) ranging from \(1\leq i\leq n\) denoted by \(\bar x\) is given as</li></ul><p>$$\bar{x} = \sum_{i=1}^{n}\frac{x_i}{n}$$</p><ul><li><strong>Variance</strong>: Variance is a measure of variation. It is calculated by averaging the squared deviations from the mean. The degree of spread in your data set is indicated by variation. The greater the spread of the data, the greater the variance in proportion to the mean. Here's the formula for variance of a sample.</li></ul><p>$$S^2 = \frac{\sum_{i=1}^{n}(x_i-\bar x)^2}{n-1}$$</p><ul><li><strong>Standard Deviation</strong>: The standard deviation is a measure that shows how much variation (such as spread, dispersion, and spread) exists from the mean. The standard deviation represents a "typical" departure from the mean. It is a popular measure of variability since it returns to the data set's original units of measurement. Here's the formula for standard deviation of a sample.</li></ul><p>$$S = \sqrt \frac{\sum_{i=1}^{n}(x_i-\bar x)^2}{n-1}$$</p><h2 id="heading-dataframe-description"><strong>Dataframe Description</strong></h2><p>To generate descriptive statistics <a target="_blank" href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html">pandas.dataframe.describe()</a> function is used.</p><p>Descriptive statistics include those that summarize the central tendency, dispersion, and shape of a datasets distribution, excluding NaN values.</p><p>It is used to get a basic description of the data, looking at the spread of the different variables, along with abrupt changes between the minimum, 25th, 50th, 75th, and max for the different variables.</p><p>The quartiles provide an excellent insight into the range of a set of data. You may easily establish where your data sits in the range and which quartile you fall into by knowing the percentile points 25th, 50th, and 75th, as well as your own data point.</p><ul><li><p>The 25th percentile is also referred to as the first, or lower, quartile. The 25th percentile is the figure at which 25% of the data falls below it and 75% of the answers fall above it.</p></li><li><p>The Median is also known as the 50th percentile. The median divides the set of data in half. Half of the data points are below the median, while the other half are above it.</p></li><li><p>The 75th percentile is often referred to as the third, or upper, quartile. The 75th percentile is the value at which 25% of the responses are higher and 75% of the answers are lower.</p></li></ul><h3 id="heading-descriptive-statistic-for-quantitative-variables"><strong>Descriptive statistic for quantitative variables</strong></h3><p>DataFrame.count: Count number of non-NA/null observations</p><p>DataFrame.max: Maximum of the values in the object</p><p>DataFrame.min: Minimum of the values in the object</p><p>DataFrame.mean: Mean of the values</p><p>DataFrame.std: Standard deviation of the observations</p><p><a target="_blank" href="http://DataFrame.select">DataFrame.select</a>_dtypes: Subset of a DataFrame including/excluding columns based on their dtype</p><pre><code class="lang-python"><span class="hljs-comment"># descriptive statistics</span>df.describe()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685265493825/f401ad0b-cf8a-4947-9b92-a5ee5f25638b.png" alt class="image--center mx-auto" /></p><h3 id="heading-did-you-know-i"><strong>Did you know - I</strong></h3><p>To get summary of all the columns you can provide include = 'all' in the describe function</p><pre><code class="lang-python">df.describe(include=<span class="hljs-string">'all'</span>)</code></pre><p>For object data (e.g. strings or timestamps), the results index will include count, unique, top, and freq. The top is the most common value. The freq is the most common values frequency. Timestamps also include the first and last items.</p><h2 id="heading-correlation"><strong>Correlation</strong></h2><p>Correlation coefficient is used to measure the strength of relationship between two variables. It indicates that as the value of one variable changes the other variable changes in a specific direction with some magnitude. There are various ways to find correlation between two variables, one of which is Pearson correlation coefficient. It measures the linear relationship between two continuous variables.</p><p>Let's say \(x\) and \(y\) are two continuous variables, the Pearson correlation coefficient between them can be found by the following formula.</p><p>$$r = \frac{ \sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y}) }{% \sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2}\sqrt{\sum_{i=1}^{n}(y_i-\bar{y})^2}}$$</p><p>where \(x_i\) and \(y_i\) represents the \(i^{th}\) value of the variables. The value of \(r\) ranges between \(-1\) and \(1\).</p><p>Their strength of relationship is measured by the absolute value of coefficient, whereas the sign of the coefficient indicates the direction of the relationship.</p><h2 id="heading-graphs-of-different-correlation-coefficients"><strong>Graphs of Different Correlation Coefficients</strong></h2><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685269765291/bbb311ae-bebd-4acc-b1dd-bd8416179deb.png" alt class="image--center mx-auto" /></p><ol><li><p>𝑟=1 indicates a perfect negative relationship between the variables</p></li><li><p>𝑟=0 indicates no relationship between the variables</p></li><li><p>𝑟=1 indicates a perfect positive relationship between the variables</p></li></ol><p>To find correlation between variables from the soccer data we will use <a target="_blank" href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html">pandas.dataframe.corr()</a> method.</p><p>It computes pairwise correlation between two columns by excluding NA or NaN values if any. The default method used to calculate correlation coefficient is pearson correlation.</p><pre><code class="lang-python">df.corr()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685265812330/cebfafe4-3421-4891-97ee-135b87d62c23.png" alt class="image--center mx-auto" /></p><p>The correlation between DistanceCovered(InKms) and the target variable score \(-0.49\) indicates negative correlation. The variable cost is related to the target variable with correlation coefficient \(0.96\) which indicates strong positive relationship.</p><h3 id="heading-did-you-know-ii"><strong>Did you know - II</strong></h3><p>Pearson correlation coefficient can only measure linear relationship between data. The following data shows non linear relationship which can not be found by using pearson correlation coefficient. In such cases Spearman's correlation coefficient is used. It can be used to find nonlinear, monotonic relationships and for ordinal data.</p><blockquote><h3 id="heading-think-about-it-i"><strong>Think about it - I</strong></h3><p>Can Pearson or Spearman correlation coefficient be used to find correlation between categorical variables?</p></blockquote><h2 id="heading-correlation-does-not-imply-causation"><strong>Correlation does not imply Causation!!</strong></h2><p>Some studies show that people in the UK spend more money on shopping when it's cold which shows correlation between two variables. Does this imply cold weather causes people to spend more money? The answer is NO. One of the possible explanations is that cold weather coincides with Christmas and new year sales, hence people shop more.</p><p>Correlation between two variables indicates association between two variables but it does not mean change in one variable is caused by another.</p><h2 id="heading-relationship-between-cost-and-score"><strong>Relationship between Cost and Score</strong></h2><p>Score and Cost have a 96% correlation, making it a significant variable. Cost can be selected as the predictor variable for simple linear regression since the scatter plot between them will demonstrate a linear relationship.</p><p>To see this relationship visually, let's plot the scatter plot for Cost and Score.</p><pre><code class="lang-python"><span class="hljs-comment">#Let's plot cost vs. score</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(df[<span class="hljs-string">'Cost'</span>], df[<span class="hljs-string">'Score'</span>])<span class="hljs-comment"># label</span>plt.xlabel(<span class="hljs-string">"Cost"</span>)plt.ylabel(<span class="hljs-string">"Score"</span>)plt.title(<span class="hljs-string">"Scatter plot between Cost and Score"</span>)<span class="hljs-comment"># Strong linear association between cost and score, maybe some concern with model after a cost of 125 or so!</span></code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685266118270/53eb9f2d-bbd1-4494-b5cb-4ea21a69309e.png" alt class="image--center mx-auto" /></p><p>The correlation between Cost and Score is easily visible here.</p><p>The Pearson correlation and scatter plot demonstrate that as the cost increases, so does the score. But what can we do with this knowledge?</p><p>How can we know how much money should be spent to achieve a specific score? This is where Linear Regression comes in. It assists us in modeling the linear relationship between two or more variables so that we may foresee the results using the model.</p><p>Let's figure out how.</p><h2 id="heading-train-test-split"><strong>Train - Test Split</strong></h2><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685266485582/5563bd5b-0fc1-46c5-b60e-3cb1a152125b.png" alt class="image--center mx-auto" /></p><p>The data points are divided into two datasets, train and test, in a train test split method. The train data is used to train the model, and the model is then used to predict on the test data to see how the model performs on unseen data and whether it is overfitting or underfitting.</p><h3 id="heading-underfitting-and-overfitting"><strong>Underfitting and Overfitting</strong></h3><ul><li><p><strong>Underfitting</strong>: Underfitting occurs when a statistical model or machine learning algorithm fails to capture the underlying trend of the data, i.e., it performs well on training data but poorly on testing data. Its occurrence merely indicates that our model or method does not adequately suit the data. It frequently occurs when we select a simpler model yet the data contains complicated non-linear patterns or when there is insufficient data to develop a linear model. The obvious approach is to build a complex model or increase the number of linear features in the data.</p></li><li><p><strong>Overfitting</strong>: When a statistical model fails to produce correct predictions on testing data, it is said to be overfitted. When a model is trained with a large amount of data, it begins to learn from the noise and incorrect data entries in our data set. It usually occurs when we build a complex model on a simpler dataset. An overfitted model performs well on training data because it has memorized the patterns in the data, but it performs poorly on testing data. An under-fitted model, on the other hand, will perform worse on both datasets because it is unable to capture the trends and patterns underlying the dataset when training.</p></li></ul><blockquote><h3 id="heading-think-about-it-ii"><strong>Think about it - II</strong></h3><p>Assume you have a dataset with both categorical and numerical variables. When you create a linear regression model on it, you notice that it performs poorly on training data and even worse on testing data.</p><p>You conclude that the model is underfitting and that a complex model is needed, so you use polynomial regression with a high degree. Your model now performs extremely well on training data but significantly poorly on testing data. It has now overfitted.</p><p>What do you believe happened? Do you think you'd have to find a sweet spot between simple and complicated models? How can you do it?</p></blockquote><pre><code class="lang-python"><span class="hljs-comment"># Assign x, y then do training testing split</span>x=df[<span class="hljs-string">'Cost'</span>]y=df[<span class="hljs-string">'Score'</span>]<span class="hljs-comment"># Splitting with 75% training, 25% testing data</span>x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = <span class="hljs-number">0.75</span>, test_size = <span class="hljs-number">0.25</span>, random_state = <span class="hljs-number">100</span>)</code></pre><p>The data is first assigned to input variable (x) and output variable (y) accordingly, then the train test split function from sklearn is used to perform splitting into a ratio of 75:25 with a random state of 100. The random state is a seed given to randomly generate indices for train and test sets.</p><h2 id="heading-linear-regression"><strong>Linear Regression</strong></h2><p>Linear Regression is a statistical approach to modeling the linear relationship between predictor variables and the target variable.</p><p>These variables are known as the independent and dependent variables, respectively.</p><p>When there is one independent variable, it is known as <strong>simple linear regression</strong>. When there are more independent variables, it is called <strong>multiple linear regression</strong>.</p><p><strong>Simple Linear Regression</strong>: \( \hat y = \beta_0+\beta_1x+\epsilon\)</p><p><strong>Multiple Linear Regression</strong>: \(\hat y = \beta_0+\beta_1x_1+\dots \beta_px_p+\epsilon\) where \(p\) is... number of features in the model</p><p>Linear regression serves two primary functions: understanding variable relationships and forecasting:</p><ul><li><p>The coefficients represent the estimated magnitude and direction (positive/negative) of each independent variable's relationship with the dependent variable.</p></li><li><p>A linear regression equation predicts the mean value of the dependent variable given the values of the independent variables. So, it enables us to forecast.</p></li></ul><p><strong>Example:</strong> Assume your father owns an ice cream shop. Sometimes there is too much ice cream in the store, and other times there isn't enough to sell. You notice that ice cream sales are much higher on hot days than on cold days. There appears to be some correlation between the temperature and the sale of ice cream.</p><p>Now you must determine the optimal number of ice creams to store in order to sell enough and have little left over at the end of the day.</p><p>How can you forecast the sale for the next few days?</p><p>Is there any way to predict the sale of the next day given the temperature of the last few days?</p><p>Yes, you can use simple linear regression to model the relationship between temperature and sales.</p><p>Now that we are clear on the why let's go ahead to the "how" part of linear regression.</p><h2 id="heading-mathematics-behind-linear-regression"><strong>Mathematics behind Linear Regression</strong></h2><p>Here's the formula for simple linear regression.</p><p>$$y=\beta_0+\beta_1x+\epsilon$$</p><p>Let's understand each of the terms involved:</p><ul><li><p>For any given value of the independent variable (x), y is the predicted value for the dependent variable (y).</p></li><li><p>\(\beta_0\) represents the intercept, or the predicted value of y when x is 0.</p></li><li><p>\(\beta_1\) is the regression coefficient, which tells us how much y will change as x increases.</p></li><li><p>x is the independent or predictor variable that helps us predict y</p></li><li><p>$\epsilon$ is the error left due to not so accurate calculation of the regression coefficients.</p></li></ul><p>Linear regression determines the best fit line across your data by looking for the regression coefficient (B1) that minimizes the model's total error (e).</p><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685266456833/373dc767-fa4f-4e7b-87dc-fa00a5530086.png" alt class="image--center mx-auto" /></p><p>Let's understand the regression line with the example graph above.</p><ul><li><p>The \(\beta_0\) parameter indicates the intercept or the constant value of y when x is 0.</p></li><li><p>The \(\beta_1\) parameter is the slope or steepness of the regression line.</p></li><li><p>The distance between the predicted value of y on the regression line and the corresponding true value of y is basically the error.</p></li></ul><h2 id="heading-errors-in-regression"><strong>Errors in Regression</strong></h2><p>The regression line regress towards the mean to create the best fit which essentially means that the errors are at the lowest. In the above plot, it is visible that the regression line is not able to exactly predict the true values. There is always going to be some space for errors.</p><p>Let's understand the various errors in Regression:</p><ol><li>The mean absolute error (MAE) is the most basic regression error statistic to grasp. We'll compute the residual for each data point individually, using only the absolute value of each so that negative and positive residuals don't cancel out. The average of all these residuals is then calculated. MAE essentially describes the typical magnitude of the residuals.</li></ol><p>$$MAE = \frac{1}{n}\sum_{i=1}^{n}|y-\hat y|$$</p><ol><li>The mean square error (MSE) is identical to the mean absolute error (MAE) but squares the difference before aggregating all of them. The MSE will nearly always be greater than the MAE because we are squaring the difference. Because of this, we are unable to directly compare the MAE and MSE. We are limited to comparing the error metrics of our model to those of a rival model. The presence of outliers in our data makes the square term's impact on the MSE equation very clear. In MAE, each residual adds proportionally to the overall error, whereas in MSE, the error increases quadratically. As a result, our data outliers will ultimately result in a considerably bigger total error in the MSE than they will in the MAE. Similarly to this, our model will suffer more if it predicts values that are significantly different from the matching actual value. This means that in MSE as opposed to MAE, substantial disparities between actual and predicted values are punished more severely.</li></ol><ul><li>If we wish to limit the importance of outliers, we should use MAE because outlier residuals do not contribute as much to overall error as MSE. Finally, the decision between MSE and MAE is application-specific and depends on how large errors need to be handled.</li></ul><p>$$MSE= \frac{1}{n}\sum_{i=1}^{n}(y-\hat y)^2$$</p><ul><li>The root mean squared error (RMSE) is another error statistic you may come upon. It is the square root of the MSE, as the name implies. Because the MSE is squared, its units differ from the original output. RMSE is frequently used to transform the error metric back into comparable units, making interpretation easier. Outliers have a comparable effect on the MSE and RMSE because they both square the residual.</li></ul><p>$$RMSE= \sqrt(\frac{1}{n}\sum_{i=1}^{n}(y-\hat y)^2)$$</p><ul><li>The percentage counterpart of MAE is the mean absolute percentage error (MAPE). Just as MAE is the average amount of error created by your model, MAPE is the average distance between the model's predictions and their associated outputs. MAPE, like MAE, has a clear meaning because percentages are easier for people to understand. Because of the use of absolute value, MAPE and MAE are both resistant to the effects of outliers.</li></ul><p>$$MAPE= \frac{100\%}{n}\ \sum_{i=1}^{n}\left| \frac{y-\hat y}{y} \right|$$</p><h2 id="heading-finding-the-best-fit-line"><strong>Finding the Best Fit Line</strong></h2><p>We can proceed with the Linear Regression model after determining the correlation between the variables, independent variable, and target variable, and if the variables are linearly correlated. Finding coefficients of linear regression is the process of determining the values of the coefficients ( \(\beta_0\) and \(\beta_1\) ) in the equation of a linear regression model. The objective of finding the coefficients is to minimize the difference between the actual values of the target variable and the predicted values.</p><p>The Linear Regression model will determine the best fit line for the scatter of data points.</p><p>The equation of the regression line is :</p><p>$$y=\beta_0+\beta_1x$$</p><p>where \(\beta_0\) and \(\beta_1\) are regression coefficients.</p><p>\(\beta_1\): This is basically the slope of the line which shows how steep the regression line would be. The slope is calculated by a change in y divided by a change in x.</p><p>$$\beta_1 = \frac{dy}{dx}$$</p><p>\(\beta_0\): This is the intercept. It is the value of y when x is 0. When the straight line passes through the origin intercept is 0.</p><p>We can have infinite possibilities for the values of the regression coefficients. How do you find the best fit line out of all the possible lines?</p><p>The best fit line should have the minimum errors.</p><h3 id="heading-cost-function"><strong>Cost Function</strong></h3><p>The cost function assesses how well a machine learning model performs. The cost function calculates the difference between predicted and actual values as a single real number.</p><p>The following is the distinction between the cost function and the loss function:</p><p>The loss function is the error for individual data points, while the cost function is the average error of n-samples in the data.</p><h3 id="heading-residual-sum-of-squares-rss-or-sum-of-squared-errorssse"><strong>Residual Sum of Squares (RSS) or Sum of Squared Errors(SSE)</strong></h3><p>Ordinary least square or Residual Sum of squares (RSS) or Sum of Squared Errors (SSE) is minimized to find the value of 0 and 1, to find the best fit of the predicted line.</p><p>$$MSE = \frac{1}{N} RSS = \frac{1}{N}\sum_{i=1}^{n}(y-\hat y)^2$$</p><p>Hence,</p><p>$$SSE = \sum_{i=1}^{n}(y-\hat y)^2$$</p><p>There are two main methods to find the coefficients of linear regression: least squares and gradient descent.</p><h3 id="heading-least-squares-estimators"><strong>Least Squares Estimators</strong></h3><p>One of the methods to optimize the Linear Regression equation for the minimum SSE is using Least Squares Estimators. These are the following steps involved in finding out the best fit line parameters:</p><ol><li><p>Differentiate the SSE with respect to 0 and 1</p></li><li><p>Setting the partial derivatives equal to zero yields <strong>normal equations</strong> which can then be manipulated to find 0 and 1 for the minimum SSE.</p></li></ol><h3 id="heading-gradient-descent"><strong>Gradient Descent</strong></h3><p>Gradient descent is an optimization algorithm that iteratively adjusts the coefficients to minimize the cost function. The cost function measures the difference between the actual and predicted values. The gradient descent algorithm updates the coefficients using the gradient of the cost function. The gradient of the cost function gives us the direction of the steepest descent, which we use to adjust the coefficients. The process is repeated until the cost function reaches a minimum.</p><h3 id="heading-did-you-know-iii"><strong>Did you know - III</strong></h3><p>Scikit-learn's LinearRegression and Statsmodels' OLS (Ordinary Least Squares) are two popular libraries for linear regression in Python. While both can be used to perform linear regression, there are some differences between them:</p><ul><li><p>Model Fitting: Scikit-learn provides a simple API for model fitting. The fit method of the LinearRegression class takes in the input features and target variable, and returns the fitted model. On the other hand, Statsmodels provides a more detailed and statistically rigorous approach to model fitting with its OLS class, which allows users to specify various model assumptions, summary statistics and hypothesis testing.</p></li><li><p>Model Summary: Scikit-learn provides only the coefficients and their standard errors, while Statsmodels provides a more detailed summary of the regression results, including R-squared, F-statistic, p-values, and confidence intervals for the coefficients. This can be useful for hypothesis testing and model selection.</p></li><li><p>Model Evaluation: Scikit-learn provides several evaluation metrics for regression models, such as mean squared error, mean absolute error, and R-squared. Statsmodels provides similar metrics, but also offers the ability to run hypothesis tests on the coefficients, such as t-tests and F-tests, which are useful for model selection and inference.</p></li><li><p>Speed: Scikit-learn's LinearRegression is optimized for speed, making it a good choice for large datasets. Statsmodels' OLS is less optimized for speed, and can be slower for large datasets.</p></li></ul><p>In conclusion, when it comes to choosing between the two, it depends on the specific requirements of the project. If a simple, fast, and flexible linear regression model is needed, scikit-learn's LinearRegression is a good choice. If a more detailed statistical analysis of the regression results is needed, with the ability to perform hypothesis tests and perform detailed evaluation, then Statsmodels' OLS may be a better choice.</p><h2 id="heading-point-estimator-of-the-mean-response"><strong>Point Estimator of the Mean Response</strong></h2><p>Point estimators of the mean response in linear regression refer to estimates of the expected value of the response variable for a given predictor variable. These estimates are calculated using the estimated coefficients of the regression line, which are obtained through regression analysis.</p><p>In linear regression, the mean response is modeled as a linear combination of the predictor variables, where the coefficients represent the effect of each predictor on the response. Given a set of predictor variables, the point estimator for the mean response can be calculated by plugging in the values of the predictors into the regression equation and solving for the expected value of the response.</p><p>For example, if we have a simple linear regression model with one predictor, \(x\) , and a response, \(y\), the point estimator for the mean response would be given by \(\hat{y} = \hat{\beta_0} + \hat{\beta_1}x\), where \(\hat{\beta_0}\) and \(\hat{\beta_1}\) are the estimated intercept and slope coefficients, respectively.</p><p>Point estimators are useful because they provide a quick and straightforward way to make predictions about the mean response for a given set of predictor values. However, it is important to keep in mind that point estimators are just estimates and are subject to sampling variability and other sources of error. Therefore, it is common to also provide confidence intervals or prediction intervals along with point estimators to account for uncertainty in the predictions.</p><h2 id="heading-point-estimator-of-the-variance"><strong>Point Estimator of the Variance</strong></h2><p>In linear regression, the variance of the error (also called residual variance) is a measure of the spread of the residuals around the fitted line. The residuals are the differences between the observed values and the values predicted by the regression model.</p><p>The point estimator of the variance of the error is the mean squared error (MSE) divided by the degrees of freedom (n-p-1), where n is the number of observations and p is the number of predictors in the model. The formula for the point estimator of the variance of the error is:</p><p>$$\hat{\sigma}^2 = \frac{1}{n-p-1} \sum_{i=1}^n (y_i - \hat{y_i})^2$$</p><p>where \(\hat{y_i}\) is the predicted value for the i-th observation, and \(y_i\) is the observed value. The MSE is a measure of the overall fit of the regression model. The smaller the MSE, the better the fit of the model. The variance of the error provides information on the spread of the residuals, which can be used to determine the reliability of the regression model.</p><p>In summary, the point estimator of the variance of the error is an important quantity in linear regression as it provides information on the spread of the residuals, which can be used to evaluate the fit of the regression model.</p><h2 id="heading-sampling-distribution-for-regression-coefficients"><strong>Sampling Distribution for Regression Coefficients</strong></h2><p>In linear regression, the coefficients (also known as parameters) represent the relationship between the independent variable(s) and the dependent variable. The coefficients can be estimated using different methods, such as the method of least squares, maximum likelihood estimation, or Bayesian inference.</p><p>The sampling distribution of the coefficients is an important aspect of linear regression analysis. It provides information about the variability and uncertainty of the estimates, and allows us to make inferences about the population parameters based on the sample estimates.</p><p>For example, consider a simple linear regression model with one independent variable (x) and one dependent variable (y):</p><p>$$y = \beta_0 + \beta_1 x + \epsilon$$</p><p>where \(\beta_0\) and \(\beta_1\) are the intercept and slope coefficients, respectively, and \(\epsilon\) is the error term.</p><h3 id="heading-t-distribution-and-hypothesis-testing"><strong>T-Distribution and Hypothesis Testing</strong></h3><p>A common way to make inferences about the coefficients is through hypothesis testing. This involves setting up a null hypothesis, \(H_0\), and an alternate hypothesis, \(H_1\). For example, the null hypothesis may be that \(\beta_1=0\), meaning that there is no relationship between the predictor variable and the response variable. The alternate hypothesis may be that \(\beta_10\), meaning that there is a relationship.</p><p>The point estimator of \(\beta_1\) is the sample regression coefficient, denoted by \(\hat{\beta_1}\). It is the value that minimizes the sum of squared differences between the observed y values and the predicted y values. The sampling distribution of \(\hat{\beta_1}\) is approximately normal, with mean \(\beta_1\) and variance \(\sigma^2_{\hat{\beta_1}}\), where \(\sigma^2_{\hat{\beta_1}} = \frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}\).</p><p>The t-distribution is used to model the sampling distribution of the coefficients in linear regression. The t-distribution takes into account the sample size and the degrees of freedom (df), which is the number of independent observations in the sample minus the number of parameters estimated. For example, in simple linear regression with one predictor, there are 2 parameters (the intercept and the slope) and n - 2 degrees of freedom.</p><p>The <strong>t-distribution</strong> is used to test the hypothesis that the population mean of a coefficient is equal to some value. For example, we may want to test the hypothesis that the population slope, \(\beta_1\), is equal to zero, which means that there is no relationship between the predictor and the response. The hypothesis test is performed by calculating the t-statistic and the corresponding p-value. The t-statistic is calculated as:</p><p>$$t = \frac{\hat{\beta_1} - \beta_{1}}{SE(\hat{\beta_1})}$$</p><p>where \(\hat{\beta_1}\) is the sample estimate of the slope, \(\beta_{1}\) is the hypothesized value of the population slope, and \(SE(\hat{\beta_1})\) is the standard error of the estimate of the slope. The standard error can be calculated as:</p><p>$$SE(\hat{\beta_1}) = \sqrt{\frac{s^2}{n-2}}$$</p><p>where \(s^2\) is the residual sum of squares divided by the degrees of freedom and \(n\) is the sample size.</p><p>The t-statistic is compared to the critical value of the t-distribution to determine if the null hypothesis should be rejected. A small p-value indicates that the data provides strong evidence against the null hypothesis, while a large p-value suggests that the data does not provide strong evidence against the null hypothesis.</p><p>The p-value is the probability of observing a t-statistic as extreme or more extreme than the calculated value, assuming the null hypothesis is true. If the p-value is less than a significance level (e.g. 0.05), we reject the null hypothesis and conclude that there is evidence that the population slope is not equal to zero.</p><p>In conclusion, the t-distribution is used to model the sampling distribution of the coefficients in linear regression and to perform hypothesis testing. It allows us to make inferences about the population parameters and assess the strength of the relationship between the predictor and the response.</p><h3 id="heading-confidence-intervals"><strong>Confidence Intervals</strong></h3><p>Confidence intervals can also be calculated for the coefficients. A confidence interval is a range of values that is likely to contain the population parameter with a certain level of confidence. For example, a 95% confidence interval means that if we repeated the sampling process many times, 95% of the intervals calculated would contain the true population parameter.</p><p>The confidence interval for a coefficient is calculated as:</p><p>$$\hat{\beta_1} t_{critical} * SE(\hat{\beta_1})$$</p><p>where \(t_{critical}\) is the critical value of the t-distribution at a certain level of confidence.</p><p>In summary, sampling distributions of coefficients in linear regression help us make inferences about the population parameters. Through hypothesis testing and confidence intervals, we can determine if the coefficients are statistically significant and estimate their values with a certain level of confidence.</p><h2 id="heading-comparing-regression-models"><strong>Comparing Regression Models</strong></h2><p>We now have the best possible Linear Regression equation parameters. The question is, how do we assess our model? Is it possible to use SSE to say that our model has this much SSE and thus is good? What criteria would you use to compare one regression model to another?</p><p>One of the major drawbacks of SSE is that the SSE will change if the units of the actual y and predicted y change. As a result, we introduce a relative term called \(R^2\)</p><p>, which creates consistency among the models. But, before we get any further into \(R^2\), let's take a step back and look at TSS.</p><h3 id="heading-total-sum-of-squares-tss"><strong>Total Sum of Squares (TSS)</strong></h3><p>The total Sum of Squares is similar to SSE but instead of adding the actual values difference from the predicted value, in the TSS, we find the difference from the mean y.</p><p>$$TSS = \sum_{i=1}^{n}(y-\bar y)^2$$</p><p>TSS functions as a cost function for a model with no independent variables and only the intercept ( \(\bar y\) ). This indicates how good the model is in the absence of any independent variables.</p><p>SSE provides the model performance when an independent variable is added. The ratio \(\frac{SSE}{TSS}\) indicates how good the model is in comparison to the mean value without variance. The residual error with actual values (SSE) is smaller, while the residual error with the mean (TSS) is larger. Hence, the overall ratio is lower for a robust model.</p><p>Lets move to our case, we are going to model the relationship between Cost and Scores using Ordinary Least Squares of the statsmodels library.</p><pre><code class="lang-python"><span class="hljs-comment">#statsmodels approach to regression</span><span class="hljs-comment"># fit the model</span>lr = sm.OLS(y_train, x_train).fit()<span class="hljs-comment"># Printing the parameters</span>lr.paramslr.summary()<span class="hljs-comment">#force intercept term</span>x_train_with_intercept = sm.add_constant(x_train)lr = sm.OLS(y_train, x_train_with_intercept).fit()print(lr.summary())</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685278603308/54906975-b583-41c9-a108-78df60ca21fe.png" alt class="image--center mx-auto" /></p><h2 id="heading-model-summary"><strong>Model Summary</strong></h2><p>Now that we have successfully modeled let's evaluate the results and summary of the model:</p><ul><li>\(R^2\): The \(R^2\) or the coefficient of determination is the proportion of the variance in the dependent variable that is explained from the independent variable(s). \(R^2\) is expressed between 0 and 1 for the level of variance explained. As we learned in the previous section, the ratio \(\frac{SSE}{TSS}\) should be low for a robust model, this ratio signifies the error or unexplained variance by the independent variable(s). Mathematically, \(R^2\) or explained variance can be expressed as:</li></ul><p>$$R^2 = 1 - \frac{SSE}{TSS}$$</p><ul><li><p>We got an \(R^2\) of <strong>0.93</strong> which is pretty good.</p></li><li><p>Adjusted \(R^2\): For linear models, adjusted \(R^2\) is a corrected goodness-of-fit statistic. It determines the proportion of variance in the target field explained by the input or inputs. \(R^2\) tends to overestimate the goodness-of-fit of the linear regression. It always grows as the number of independent variables in the model grows. It happens because we tend to deduct a large amount (due to multiple variables) to calculate error as the number of independent variables increases. Hence, the ratio \(\frac{SSE}{TSS}\) is even lower than it should be and \(R^2\) seems to be high but it might not be an appropriate model for production data. It is adjusted to account for this overestimation. Considering N as the total sample size and p as the number of independent variables, adjusted \(R^2\) can be expressed as:</p></li></ul><p>$$\text{Adjusted } R^2 = 1 - \frac{(1 - R^2)(N - 1)}{N - p - 1}$$</p><ul><li><p>F-Statistic: F-Statistic can be used for hypothesis testing about whether the slope is meaningful or not. F-statistics is a statistic used to test the significance of regression coefficients in linear regression models. F-statistics can be calculated as MSR/MSE where MSR represents the mean sum of squares regression and MSE represents the mean sum of squares error. The null hypothesis is that the slope is 0 or there is no relationship between the predictor and target variables. If the value of F-statistics is greater than the critical value, we can reject the null hypothesis and conclude that theres a significant relationship between the predictor variables and the response variable.</p></li><li><p>Prob (F-Statistic): The p-value of the f statistic is very small, which basically means what are the odds that the null hypothesis is true and we observe the same result due to random chance, and the odds are very small that \(H_0 : \beta_1\) is 0, highly unlikely that the model is not good, and highly likely that the slope is not zero.</p></li></ul><h3 id="heading-did-you-know-iv"><strong>Did you know - IV</strong></h3><h3 id="heading-r2-can-take-negative-values">\(R^2\) <strong>can take negative values!</strong></h3><p>It compares the fit of the chosen model with that of a horizontal straight line (the null hypothesis). If the chosen model fits worse than a horizontal line, then \(R^2\) is negative which will essentially mean that model is making no sense and is predicting randomly.</p><blockquote><h3 id="heading-think-about-it-iii"><strong>Think about it - III</strong></h3><p>What would happen if we do not include constant while building the Linear Regression model?</p></blockquote><pre><code class="lang-python"><span class="hljs-comment">#Extract the B0, B1</span>print(lr.params)b0=lr.params[<span class="hljs-number">0</span>]b1=lr.params[<span class="hljs-number">1</span>]<span class="hljs-comment">#Plot the fitted line on training data</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(x_train, y_train)plt.plot(x_train, b0+ b1*x_train, <span class="hljs-string">'r'</span>)plt.xlabel(<span class="hljs-string">"Cost"</span>)plt.ylabel(<span class="hljs-string">"Score"</span>)plt.title(<span class="hljs-string">"Regression line through the Training Data"</span>)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685279414878/bf1fc3ea-3b8c-4aae-b9ef-2ad400fd2d4f.png" alt class="image--center mx-auto" /></p><p>In this plot, we are extracting the values of the intercept \(\beta_0\) and coefficient/slope \(\beta_1\) and plotting the regression line over the scatter plot of the Cost and Score of training data.</p><p>The regression line has a good fitting, it probably deviates a little after a cost of 125 or so, let's see if we can improve it in the later sections when we diagnose and remedy but first let's see how our model performs on the test data.</p><h2 id="heading-prediction-on-test-data"><strong>Prediction on Test Data</strong></h2><pre><code class="lang-python"><span class="hljs-comment">#Plot the fitted line on test data</span>x_test_with_intercept = sm.add_constant(x_test)y_test_fitted = lr.predict(x_test_with_intercept)<span class="hljs-comment"># scatter plot on test data</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(x_test, y_test)plt.plot(x_test, y_test_fitted, <span class="hljs-string">'r'</span>)plt.xlabel(<span class="hljs-string">"Cost"</span>)plt.ylabel(<span class="hljs-string">"Score"</span>)plt.title(<span class="hljs-string">"Regression line through the Testing Data"</span>)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685279561861/f204f7b7-0410-4ef2-8bee-fb764ab8d5aa.png" alt class="image--center mx-auto" /></p><p>Here we can see that the model has built a good regression fit as it is passing through the middle of all the points to get the minimum error.</p><p>Observe that all the data points in the test data lie in the range of the training data. This is called interpolation. What if we analyze a data point with a cost say 560? This is extrapolation and the model probably won't be robust to it.</p><h2 id="heading-assumptions-of-linear-regression"><strong>Assumptions of Linear Regression</strong></h2><p>Linear regression is a parametric model which means it requires the specification of some parameters before they can be used to make predictions. These parameters or assumptions are:</p><ul><li><p><strong>The relationship between the independent and dependent variables is linear</strong>: the line of best fit through the data points is a straight line.</p></li><li><p><strong>Homoscedasticity</strong>: means homogeneity of variance of residuals across the values of the independent variable.</p></li><li><p><strong>Independence of observations:</strong> the observations in the dataset were collected using statistically valid sampling methods, and there are no hidden relationships among observations.</p></li><li><p><strong>Normality</strong>: The data follows a normal distribution.</p></li></ul><h2 id="heading-diagnostics-and-remedies"><strong>Diagnostics and Remedies</strong></h2><p>As we learned in the previous section, Linear Regression follows some assumptions. The section Diagnostics and Remedies is evaluating if the data follows the assumptions or not, whether Linear Regression is a good fit for the patterns in the data, and simply includes the things we do in order to assess how well the model performs. The following are the things we look for in the data to diagnose Linear Regression as an unfit model:</p><ul><li><p><strong>Non-Linearity</strong>: First thing to look for is non-linearity, for example, your data might look linear for some time, and then it shows non-linearity and a parabola would fit better than a straight line.</p></li><li><p><strong>Heteroscedasticity</strong>: meaning non-constant variance, variance in one region may not be the same as in the second region.</p></li><li><p><strong>Independence</strong>: Errors are not independent and identically distributed.</p></li><li><p><strong>Outliers</strong>: Outliers can have a large impact on the model, for example, if there's a slow-growing regression line and there is an outlier up in the center, it will pull the regression line upwards than most of the data.</p></li><li><p><strong>Missing Features</strong>: Missing predictor variables, no need in a simple linear regression, which simply means losing on variables that can be useful but are not included.</p></li></ul><p>How do we begin to assess all these things?</p><p><strong>Residual Analysis</strong></p><p>Residual analysis is used to study residuals in data and to understand what needs to be done to improve our model performance. Residual is the error we get by subtracting the prediction value from the true value of the dependent variable.</p><p>$$R_i = y_i-\hat y_i$$</p><ul><li><p>First, plot the residual versus predictor; if the scatter plot shows a departure from linearity (parabola), reevaluate the model; if not, try a modification to make the data linear.</p></li><li><p>This plot also shows indications of non-constant variance; if the data points scatter in the shape of a megaphone, we can claim the variance is not constant. We can also use transformations to overcome heteroscedasticity, or we can use weighted least squares.</p></li><li><p>Another plot that can be used is a sequence plot or residuals versus time order. We may want to search for a cyclical pattern or a straight trend, which indicates when linear regression would be useful and when it would not.</p></li><li><p>Box plot of residuals: if we have a lovely and centered box plot, we are fine; if we have a little bias to one side of the box plot, we clearly lack some normalcy; we can also check this with normal probability plot if it is skewed to the right and skewed to the left.</p></li><li><p>Next is to check for outliers, don't eliminate outliers unless you absolutely have to such as in a scenario when a data point is simply incorrect.</p></li></ul><pre><code class="lang-python"><span class="hljs-comment">#DIAGNOSTICS</span><span class="hljs-comment">#CHECKLIST:</span><span class="hljs-comment"># NON-LINEARITY</span><span class="hljs-comment"># NON-CONSTANT VARIANCE</span><span class="hljs-comment"># DEVIATIONS FROM NORMALITY</span><span class="hljs-comment"># ERRORS NOT IID</span><span class="hljs-comment"># OUTLIERS</span><span class="hljs-comment"># MISSING PREDICTORS</span><span class="hljs-comment">#Build predictions on training data</span>predictions_y = lr.predict(x_train_with_intercept)<span class="hljs-comment">#Find residuals</span>r_i = (y_train - predictions_y)<span class="hljs-comment">#Residuals vs. predictor in training data</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.title(<span class="hljs-string">' Residuals vs. Cost'</span>)plt.xlabel(<span class="hljs-string">'Cost'</span>,fontsize=<span class="hljs-number">15</span>)plt.scatter(x_train, r_i)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685279815215/b2e9d966-4f14-449f-9847-1aaf25db0cba.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python"><span class="hljs-comment">#Absolute residuals against predictor</span>abs_r_i = np.abs(y_train - predictions_y)figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.title(<span class="hljs-string">'Absolute Residuals vs. Cost'</span>)plt.xlabel(<span class="hljs-string">'Cost'</span>,fontsize=<span class="hljs-number">15</span>)plt.scatter(x_train, abs_r_i)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685279894722/5c4886f9-e637-4aa6-9cd2-864a78a68c30.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python"><span class="hljs-comment">#Normality plot</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)scipy.stats.probplot(r_i,plot=plt)</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685279966102/fc31c122-86b8-47c3-b4fd-b714876ef3c7.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python"><span class="hljs-comment">#Tails might be a little heavy, but overall no clear reason to reject normality expectations</span><span class="hljs-comment"># Evaluate normality through histogram of residuals</span><span class="hljs-comment"># Plotting the histogram using the residual values</span>fig = plt.figure()figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)sns.distplot(r_i, bins = <span class="hljs-number">15</span>)plt.title(<span class="hljs-string">'Error Terms'</span>, fontsize = <span class="hljs-number">15</span>)plt.xlabel(<span class="hljs-string">'y_train - y_train_pred'</span>, fontsize = <span class="hljs-number">15</span>)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280042922/033cc19f-70e1-479b-835f-60b98fb3d015.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python"><span class="hljs-comment">#Boxplot for outliers</span><span class="hljs-comment"># plot</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.boxplot(r_i, boxprops=dict(color=<span class="hljs-string">'red'</span>))plt.title(<span class="hljs-string">'Residual Boxplot'</span>);</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280096007/c9608650-70de-4b5b-bfcb-3defc9c2b995.png" alt class="image--center mx-auto" /></p><p>At the beginning of the section - Diagnostics and Remedies, we saw the steps we can take to understand the patterns in data and if we need to do some transformations to make the assumptions of linear regression hold true. Here are the observations that we made from the above plots:</p><ul><li><p>The residuals vs cost plot shows a good scatter of residuals and no pattern is observed up until 125 or 150 costs. We can say we have some heteroscedasticity in the higher costs. We'll see how we can tackle it.</p></li><li><p>The normality of the errors can be seen in the normal probability plot and the histogram. It is more or less normal or bell shaped.</p></li><li><p>The residual boxplot shows no obvious outliers.</p></li></ul><h3 id="heading-transformations-to-avoid-non-constant-variance"><strong>Transformations to avoid non-constant variance</strong></h3><p>Non-constant variance can be a problem in linear regression, as the assumption of constant variance of the errors is a key requirement for the ordinary least squares (OLS) method to be unbiased and efficient. When this assumption is violated, the regression coefficients can be inefficient and/or the predictions can be biased. To avoid non-constant variance, there are different data transformations that can be applied.</p><ul><li><p>Log transformation: This transformation is often used when the variance of the data increases with the mean. A log transformation can be used to stabilize the variance by converting the data into logarithmic values.</p></li><li><p>Square root transformation: This transformation is also used to stabilize the variance by converting the data into square root values.</p></li><li><p>Box-Cox transformation: This is a statistical transformation that is used to stabilize the variance by transforming the data into values that are closer to a normal distribution. The Box-Cox transformation is a more flexible and powerful transformation compared to the log and square root transformations.</p></li><li><p>Yeo-Johnson transformation: This is a newer and more flexible version of the Box-Cox transformation that can handle both positive and negative data values.</p></li></ul><pre><code class="lang-python"><span class="hljs-comment">#Demo of how to deal with non-constant variance through transformations</span>test_residuals=(y_test-y_test_fitted)len(y_test)len(y_test_fitted)len(test_residuals)<span class="hljs-comment">#Residuals vs. predictor in test set</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.title(<span class="hljs-string">'Test Residuals vs. Cost'</span>)plt.xlabel(<span class="hljs-string">'Cost'</span>,fontsize=<span class="hljs-number">15</span>)plt.ylabel(<span class="hljs-string">'Residuals'</span>,fontsize=<span class="hljs-number">15</span>)plt.scatter(x_test, test_residuals)plt.show()<span class="hljs-comment">#Some evidence of non-constant variance</span></code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280221649/a330203b-1f4d-4204-b1ac-57a204d9f9f6.png" alt class="image--center mx-auto" /></p><p>We can see the scatter of data points increases as we increase the cost. This is evidence of Heteroscedasticity.</p><p>We'll try different transformations such as square root, log, and box-cox to see if we can introduce linearity with these transformations.</p><pre><code class="lang-python"><span class="hljs-comment">#Try sqrt</span>sqrt_y=np.sqrt(y)figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(x, sqrt_y,color=<span class="hljs-string">'red'</span>);<span class="hljs-comment">#Try ln</span>ln_y=np.log(y)plt.scatter(x, ln_y,color=<span class="hljs-string">'blue'</span>);<span class="hljs-comment">#Let's try a BC transformation</span><span class="hljs-comment">#Box Cox procedure on all cost</span>bc_y=list(stats.boxcox(y))bc_y=bc_y[<span class="hljs-number">0</span>]plt.scatter(x, bc_y,color=<span class="hljs-string">'orange'</span>);<span class="hljs-comment">#Overall, most satisfied with the sqrt transformation</span></code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280280936/e23140e8-639b-4566-ab48-1983a8d88298.png" alt class="image--center mx-auto" /></p><p>We can observe that the square root transformation denoted by red dots gives the most linear scatter of data points. Let's try to run the linear regression model on the transformed variable and analyze the change in results.</p><pre><code class="lang-python"><span class="hljs-comment">#Run regression on this set</span>x_train, x_test, y_train, y_test = train_test_split(x, sqrt_y, train_size = <span class="hljs-number">0.75</span>, test_size = <span class="hljs-number">0.25</span>, random_state = <span class="hljs-number">100</span>)<span class="hljs-comment">#force intercept term</span>x_train_with_intercept = sm.add_constant(x_train)lr = sm.OLS(y_train, x_train_with_intercept).fit()print(lr.summary())</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280352331/9b77ca6c-1e5a-4389-ad27-945f3893def8.png" alt class="image--center mx-auto" /></p><p>We can see the change in \(R^2\) and Adjusted \(R^2\) after the transformation. They are almost similar which suggests that \(R^2\) is no longer overestimating the variance explained by the predictor variable.</p><pre><code class="lang-python"><span class="hljs-comment">#Extract the B0, B1</span>print(lr.params)b0=lr.params[<span class="hljs-number">0</span>]b1=lr.params[<span class="hljs-number">1</span>]<span class="hljs-comment">#Plot the fitted line on training data</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(x_train, y_train)plt.plot(x_train, b0+ b1*x_train, <span class="hljs-string">'r'</span>)plt.xlabel(<span class="hljs-string">"Cost"</span>)plt.ylabel(<span class="hljs-string">"Score"</span>)plt.title(<span class="hljs-string">"Regression line through the Training Data"</span>)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280504708/d54e1dac-e68c-4a1a-b384-4ba12750d949.png" alt class="image--center mx-auto" /></p><p>We extracted the linear regression coefficients and plotted the regression line on the Cost vs Score scatter plot.</p><pre><code class="lang-python"><span class="hljs-comment">#Plot the fitted line on test data</span>x_test_with_intercept = sm.add_constant(x_test)y_test_fitted = lr.predict(x_test_with_intercept)figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.scatter(x_test, y_test)plt.plot(x_test, y_test_fitted, <span class="hljs-string">'r'</span>)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280573937/b5e7116d-ed82-4a6e-a9e9-ce5acbba02f9.png" alt class="image--center mx-auto" /></p><pre><code class="lang-python"><span class="hljs-comment">#Evaluate variance</span><span class="hljs-comment">#Diagnostics</span>test_residuals=(y_test-y_test_fitted)len(y_test)len(y_test_fitted)len(test_residuals)<span class="hljs-comment">#Residuals vs. predictor</span>figure(figsize=(<span class="hljs-number">8</span>, <span class="hljs-number">6</span>), dpi=<span class="hljs-number">80</span>)plt.title(<span class="hljs-string">'Residuals vs. Cost'</span>)plt.xlabel(<span class="hljs-string">'Cost'</span>,fontsize=<span class="hljs-number">15</span>)plt.scatter(x_test, test_residuals)plt.show()</code></pre><p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1685280731077/bac23250-7323-4dec-a29f-e40c91c6be9e.png" alt class="image--center mx-auto" /></p><h2 id="heading-conclusion"><strong>Conclusion</strong></h2><p>This project is coming to a close, but let's go through what we learned and what we can do next. With a Simple Linear Regression problem statement, we learned the fundamentals of Linear Regression by predicting the scores of soccer players based on their cost.</p><p>We grasped the assumptions for Linear Regression and how to diagnose and correct data errors related to the assumptions.</p><p>The following phase in the learning process should be to introduce multiple linear regression and the issues that come with developing a multiple linear regression model, such as multicollinearity.</p><p>In the next section, we will explore some real-world applications of Linear Regression to give you a taste of this fascinating concept.</p><h2 id="heading-linear-regression-in-real-life"><strong>Linear Regression in Real Life</strong></h2><p>There are many real-world applications of linear regression. Exploring some of the best Linear Regression real-world applications will help us comprehend the concept more clearly.</p><ol><li><p>Humans are not an exception; everything has a shelf life. We can store vast amounts of information about a person's medical history and estimate how long they will live thanks to ongoing improvements in medical science technology and diagnostic tools. The term "life expectancy" describes the number of years one can expect to live. This application is frequently used by insurance companies and public healthcare organizations to increase their productivity and achieve organizational goals.</p></li><li><p>A common method used by agricultural scientists to assess how fertilizer and water affect crop yields is linear regression. For instance, researchers may vary the water and fertilizer applications in various fields to observe the effects on crop yield. A multiple linear regression model can be used with crop yield as the target variable and fertilizer and water as the predictor variables.</p></li><li><p>For professional sports teams, analysts use linear regression to gauge the impact of various training schedules on player performance. For instance, data scientists in the NBA may examine how various frequencies of yoga and weightlifting sessions each week affect a player's point total. With yoga and weightlifting sessions as the predictor variables and total points earned as the response variable, they could fit a multiple linear regression model.</p></li></ol>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1686636182879/cf15c201-910e-4d9c-9e9a-89ed69c0def2.png<![CDATA[Let me explain - How to connect EC2 instance to windows server?]]>https://blog.hemath.com/let-me-explain-how-to-connect-ec2-instance-to-windows-serverhttps://blog.hemath.com/let-me-explain-how-to-connect-ec2-instance-to-windows-serverTue, 22 Nov 2022 14:12:05 GMT<![CDATA[<h2 id="heading-what-is-ec2-instance">What is EC2 instance?</h2><p>A virtual server in Amazon's Elastic Compute Cloud (EC2) for running applications on the Amazon Web Services (AWS) infrastructure is known as an Amazon EC2 instance. AWS is a comprehensive and ever-evolving cloud computing platform; EC2 is a service that allows business subscribers to run application programs in a computing environment. It can serve as a virtually limitless number of virtual machines (VMs).</p><p>To meet the needs of users, Amazon offers a variety of instances with varying configurations of CPU, memory, storage, and networking resources. Each type is available in a variety of sizes to meet the needs of different workloads.</p><p>Amazon Machine Images are used to create instances (AMI). The machine images function as templates. They are pre-installed with an operating system (OS) and other software that determines the user's operating environment. Users can choose an AMI from AWS, the user community, or the AWS Marketplace. Users can also create and share their own AMIs.</p><h2 id="heading-how-does-amazon-ec2-work">How does Amazon EC2 Work?</h2><p>It is not difficult to get started with Amazon EC2. You have a choice of pre-configured, templated Amazon Machine Images for a quick launch (AMI). If it's more convenient for you, you can create your own AMI with all of your libraries, data, programs, and relevant configuration settings. Amazon EC2 allows you to customize settings by managing security and network access. Because you can rapidly expand your VM environment to meet utilization spikes or dips, you have control over how many resources are being used at any given time. The service's elasticity facilitates the lower costs of a "pay-what-you-use" payment method.</p><h2 id="heading-whats-an-ami">Whats An AMI ?</h2><p>AMI is an abbreviation for Amazon Machine Image.</p><p>If you've worked with older computers or even physical servers, you're probably aware of the need to update or install the same stack on more than three devices. This is accomplished by using tools such as Norton Ghost or Acronis True Image to create disc snapshots, which allow you to take a snapshot of the current state of the disc and load it on other devices in less time than if you did it manually</p><p>An AMI works on the same principle, but it is made up of multiple snapshots. This means that if the instance has more than one disc, the AMI will include the additional discs.</p><p>There are other AMIs on the market with preinstalled software such as LAMP and LEMP stacks, Redis optimized or hardened software to help you launch your new instances. But be careful, my friend, because some of those marketplace AMI are not only smiled, but they also charge you a fee per hour to use it (plus the cost per hour of the instance). So, if you want to select an AMI from the AWS Marketplace, look at the pricing.</p><h2 id="heading-steps-to-connect-amazon-windows-ec2-instance">Steps To Connect Amazon Windows EC2 Instance</h2><ul><li>Step 1</li></ul><p>First, select the Windows instance from the EC2 dashboard's Running Instances section and click Connect</p><ul><li>Step 2</li></ul><p>Here, we must select the RDP (Remote desktop protocol) Client, then Download the RDP File and save it somewhere safe. We will also need a password to access the RDP file, so click Get Password.</p><ul><li>Step 3</li></ul><p>At this stage of the launch, we must upload the Key-pair (the key which we have created in the earlier step). Click Browse, then select the key, and finally click Decrypt Password. This gives us a workable password.</p><ul><li>Step 4</li></ul><p>After submitting the Key-Pair, the Password will be generated; copy and save it somewhere safe.</p><ul><li>Step 5</li></ul><p>Now, open the Remote Desktop File from your downloads to start the Windows instance. If your local computer is a Mac, you must first download "Microsoft Remote Desktop" from the App Store before you can open your RDP file.</p><ul><li>Step 6</li></ul><p>After you've opened the RDP file, click Connect to launch the Window instance.</p><ul><li>Step 7</li></ul><p>Here we must provide the credentials for accessing the instance, so enter the password copied in step 4 and click OK.</p><ul><li>Step 8</li></ul><p>Click on Yes.</p><ul><li>Step 9</li></ul><p>So now that we have successfully connected to an Amazon Windows Instance, we can perform all of the operations and tasks that we would normally perform on the Windows operating.</p>]]><![CDATA[<h2 id="heading-what-is-ec2-instance">What is EC2 instance?</h2><p>A virtual server in Amazon's Elastic Compute Cloud (EC2) for running applications on the Amazon Web Services (AWS) infrastructure is known as an Amazon EC2 instance. AWS is a comprehensive and ever-evolving cloud computing platform; EC2 is a service that allows business subscribers to run application programs in a computing environment. It can serve as a virtually limitless number of virtual machines (VMs).</p><p>To meet the needs of users, Amazon offers a variety of instances with varying configurations of CPU, memory, storage, and networking resources. Each type is available in a variety of sizes to meet the needs of different workloads.</p><p>Amazon Machine Images are used to create instances (AMI). The machine images function as templates. They are pre-installed with an operating system (OS) and other software that determines the user's operating environment. Users can choose an AMI from AWS, the user community, or the AWS Marketplace. Users can also create and share their own AMIs.</p><h2 id="heading-how-does-amazon-ec2-work">How does Amazon EC2 Work?</h2><p>It is not difficult to get started with Amazon EC2. You have a choice of pre-configured, templated Amazon Machine Images for a quick launch (AMI). If it's more convenient for you, you can create your own AMI with all of your libraries, data, programs, and relevant configuration settings. Amazon EC2 allows you to customize settings by managing security and network access. Because you can rapidly expand your VM environment to meet utilization spikes or dips, you have control over how many resources are being used at any given time. The service's elasticity facilitates the lower costs of a "pay-what-you-use" payment method.</p><h2 id="heading-whats-an-ami">Whats An AMI ?</h2><p>AMI is an abbreviation for Amazon Machine Image.</p><p>If you've worked with older computers or even physical servers, you're probably aware of the need to update or install the same stack on more than three devices. This is accomplished by using tools such as Norton Ghost or Acronis True Image to create disc snapshots, which allow you to take a snapshot of the current state of the disc and load it on other devices in less time than if you did it manually</p><p>An AMI works on the same principle, but it is made up of multiple snapshots. This means that if the instance has more than one disc, the AMI will include the additional discs.</p><p>There are other AMIs on the market with preinstalled software such as LAMP and LEMP stacks, Redis optimized or hardened software to help you launch your new instances. But be careful, my friend, because some of those marketplace AMI are not only smiled, but they also charge you a fee per hour to use it (plus the cost per hour of the instance). So, if you want to select an AMI from the AWS Marketplace, look at the pricing.</p><h2 id="heading-steps-to-connect-amazon-windows-ec2-instance">Steps To Connect Amazon Windows EC2 Instance</h2><ul><li>Step 1</li></ul><p>First, select the Windows instance from the EC2 dashboard's Running Instances section and click Connect</p><ul><li>Step 2</li></ul><p>Here, we must select the RDP (Remote desktop protocol) Client, then Download the RDP File and save it somewhere safe. We will also need a password to access the RDP file, so click Get Password.</p><ul><li>Step 3</li></ul><p>At this stage of the launch, we must upload the Key-pair (the key which we have created in the earlier step). Click Browse, then select the key, and finally click Decrypt Password. This gives us a workable password.</p><ul><li>Step 4</li></ul><p>After submitting the Key-Pair, the Password will be generated; copy and save it somewhere safe.</p><ul><li>Step 5</li></ul><p>Now, open the Remote Desktop File from your downloads to start the Windows instance. If your local computer is a Mac, you must first download "Microsoft Remote Desktop" from the App Store before you can open your RDP file.</p><ul><li>Step 6</li></ul><p>After you've opened the RDP file, click Connect to launch the Window instance.</p><ul><li>Step 7</li></ul><p>Here we must provide the credentials for accessing the instance, so enter the password copied in step 4 and click OK.</p><ul><li>Step 8</li></ul><p>Click on Yes.</p><ul><li>Step 9</li></ul><p>So now that we have successfully connected to an Amazon Windows Instance, we can perform all of the operations and tasks that we would normally perform on the Windows operating.</p>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1686838201419/5fe4eccf-accf-4ebc-a746-a2907d947adc.png<![CDATA[Get to know - What is IAM user and its features]]>https://blog.hemath.com/get-to-know-what-is-iam-user-and-its-featureshttps://blog.hemath.com/get-to-know-what-is-iam-user-and-its-featuresSat, 19 Nov 2022 13:43:03 GMT<![CDATA[<h2 id="heading-what-is-iam-user">What is IAM user?</h2><p>AWS IAM is at the heart of AWS security because it allows you to control access by creating users and groups, assigning specific permissions and policies to specific users, managing Root Access Keys, configuring MFA Multi-Factor authentication for added security, and much more. And, to top it all off, IAM is completely free to use!</p><h2 id="heading-aws-identity-and-access-management">AWS Identity And Access Management</h2><p>IAM is a preventative security measure.</p><p>It has the ability to create and manage AWS users and groups, as well as use permissions to grant and deny access to AWS resources.</p><p>IAM is concerned with four concepts: users, groups, roles, and policies.</p><p>It manages centralized and fine-grained API resources, as well as a management console.</p><p>You can control which operations a user or role can perform on AWS resources by specifying permissions.</p><p>Access to the AWS Management Console, AWS API, and AWS Command-Line Interface is provided by the IAM service (CLI)</p><h2 id="heading-aws-iam-key-features">AWS IAMKey Features</h2><p>We should consider IAM to be the first step toward ensuring the security of all your AWS administrations and assets.</p><h3 id="heading-confirmation">Confirmation</h3><p>AWS IAM enables you to create and manage characters such as clients, groups, and jobs, allowing you to issue and enable verification for assets, individuals, administrations, and applications within your AWS account.</p><h3 id="heading-approval">Approval</h3><p>In IAM, access to executives or approval is comprised of two critical segments: Policies and Permissions.</p><h3 id="heading-fine-grained-consents">Fine-grained consents</h3><p>Consider this: you need to give the business group in your organization access to charging data, but you also need to give the engineering group full access to the EC2 administration and the marketing group access to specific S3 pails. You can design and tune these consents using IAM to meet the needs of your clients.</p><h3 id="heading-common-admittance-to-aws-accounts">Common admittance to AWS accounts</h3><p>Most organizations have multiple AWS accounts and must occasionally designate access between them. IAM allows you to do this without sharing your credentials, and AWS recently released ControlTower to further streamline multi-account designs.</p><h3 id="heading-aws-organizations">AWS Organizations</h3><p>You can use AWS Organizations to divide accounts into gatherings and assign consent limits for fine-grained control over multiple AWS accounts.</p><h3 id="heading-personality-federation">Personality Federation</h3><p>In many cases, your organization should combine access from other character providers, such as Okta, G Suite, or Active Directory. Identity Federation, a component of IAM, allows you to do this.</p><h2 id="heading-iam-users">IAM users</h2><p>IAM users can be individuals, systems, or applications that require AWS services.</p><p>A user account is made up of a unique name and security credentials such as a password, access key, and/or multi-factor authentication (MFA).</p><p>IAM users only need passwords when they access the AWS Management Console.</p><h2 id="heading-iam-policies">IAM policies</h2><p>IAM Groups are a way to assign permissions to your organization's logical and functional units.</p><p>IAM Groups are a tool to help with operational efficiency, bulk permissions management (scalable), and easy permission changes as individuals change teams (portable)</p><p>A group can have many users, and a user can be a member of multiple groups.</p><p>Groups cannot be nested; they can only contain users and not other groups.</p><h2 id="heading-iam-roles">IAM Roles</h2><p>An IAM role, like a user, is an AWS identity with permission policies governing what the identity can and cannot do in AWS</p><p>For specific access to services, you can authorize roles to be assumed by humans, Amazon EC2 instances, custom code, or other AWS services.</p><p>Roles do not have standard long-term credentials associated with them, such as a password or access keys; rather, when you assume a role, it provides you with temporary security credentials for your role session.</p><h2 id="heading-aws-iam-access-analyzer">AWS IAM Access Analyzer</h2><p>Do yourself a favor and start using the IAM access analyzer for organizational security if you have two or more AWS accounts. The access analyzer displays all AWS resources that are accessible outside of your AWS organization.</p><p>IAM Access Analyzer continuously monitors resource policies for changes, removing the need for infrequent manual checks to catch issues as policies are added or updated.</p><p>It enables you to create a comprehensive report for all of your AWS assets that can be accessed publicly by utilizing Access Analyzer.</p><p>Access Analyzer is a component of Amazon's Provable Security endeavor to achieve the highest levels of security utilizing mechanized reasoning innovation and scientific reasoning.</p>]]><![CDATA[<h2 id="heading-what-is-iam-user">What is IAM user?</h2><p>AWS IAM is at the heart of AWS security because it allows you to control access by creating users and groups, assigning specific permissions and policies to specific users, managing Root Access Keys, configuring MFA Multi-Factor authentication for added security, and much more. And, to top it all off, IAM is completely free to use!</p><h2 id="heading-aws-identity-and-access-management">AWS Identity And Access Management</h2><p>IAM is a preventative security measure.</p><p>It has the ability to create and manage AWS users and groups, as well as use permissions to grant and deny access to AWS resources.</p><p>IAM is concerned with four concepts: users, groups, roles, and policies.</p><p>It manages centralized and fine-grained API resources, as well as a management console.</p><p>You can control which operations a user or role can perform on AWS resources by specifying permissions.</p><p>Access to the AWS Management Console, AWS API, and AWS Command-Line Interface is provided by the IAM service (CLI)</p><h2 id="heading-aws-iam-key-features">AWS IAMKey Features</h2><p>We should consider IAM to be the first step toward ensuring the security of all your AWS administrations and assets.</p><h3 id="heading-confirmation">Confirmation</h3><p>AWS IAM enables you to create and manage characters such as clients, groups, and jobs, allowing you to issue and enable verification for assets, individuals, administrations, and applications within your AWS account.</p><h3 id="heading-approval">Approval</h3><p>In IAM, access to executives or approval is comprised of two critical segments: Policies and Permissions.</p><h3 id="heading-fine-grained-consents">Fine-grained consents</h3><p>Consider this: you need to give the business group in your organization access to charging data, but you also need to give the engineering group full access to the EC2 administration and the marketing group access to specific S3 pails. You can design and tune these consents using IAM to meet the needs of your clients.</p><h3 id="heading-common-admittance-to-aws-accounts">Common admittance to AWS accounts</h3><p>Most organizations have multiple AWS accounts and must occasionally designate access between them. IAM allows you to do this without sharing your credentials, and AWS recently released ControlTower to further streamline multi-account designs.</p><h3 id="heading-aws-organizations">AWS Organizations</h3><p>You can use AWS Organizations to divide accounts into gatherings and assign consent limits for fine-grained control over multiple AWS accounts.</p><h3 id="heading-personality-federation">Personality Federation</h3><p>In many cases, your organization should combine access from other character providers, such as Okta, G Suite, or Active Directory. Identity Federation, a component of IAM, allows you to do this.</p><h2 id="heading-iam-users">IAM users</h2><p>IAM users can be individuals, systems, or applications that require AWS services.</p><p>A user account is made up of a unique name and security credentials such as a password, access key, and/or multi-factor authentication (MFA).</p><p>IAM users only need passwords when they access the AWS Management Console.</p><h2 id="heading-iam-policies">IAM policies</h2><p>IAM Groups are a way to assign permissions to your organization's logical and functional units.</p><p>IAM Groups are a tool to help with operational efficiency, bulk permissions management (scalable), and easy permission changes as individuals change teams (portable)</p><p>A group can have many users, and a user can be a member of multiple groups.</p><p>Groups cannot be nested; they can only contain users and not other groups.</p><h2 id="heading-iam-roles">IAM Roles</h2><p>An IAM role, like a user, is an AWS identity with permission policies governing what the identity can and cannot do in AWS</p><p>For specific access to services, you can authorize roles to be assumed by humans, Amazon EC2 instances, custom code, or other AWS services.</p><p>Roles do not have standard long-term credentials associated with them, such as a password or access keys; rather, when you assume a role, it provides you with temporary security credentials for your role session.</p><h2 id="heading-aws-iam-access-analyzer">AWS IAM Access Analyzer</h2><p>Do yourself a favor and start using the IAM access analyzer for organizational security if you have two or more AWS accounts. The access analyzer displays all AWS resources that are accessible outside of your AWS organization.</p><p>IAM Access Analyzer continuously monitors resource policies for changes, removing the need for infrequent manual checks to catch issues as policies are added or updated.</p><p>It enables you to create a comprehensive report for all of your AWS assets that can be accessed publicly by utilizing Access Analyzer.</p><p>Access Analyzer is a component of Amazon's Provable Security endeavor to achieve the highest levels of security utilizing mechanized reasoning innovation and scientific reasoning.</p>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1686836550448/be16be5f-7a15-44e1-867a-c3f7bbf7555e.png<![CDATA[Let me explain - Auto scaling and its components]]>https://blog.hemath.com/let-me-explain-auto-scaling-and-its-componentshttps://blog.hemath.com/let-me-explain-auto-scaling-and-its-componentsWed, 16 Nov 2022 13:22:25 GMT<![CDATA[<p>AWS AutoScaling is an advanced cloud computing feature that provides automatic resource management based on server load. A server cluster's resources typically scale up and scale down dynamically via mechanisms such as a load balancer, AutoScaling groups, Amazon Machine Image (AMI), EC2 Instances, and Snapshots. The AWS AutoScaling feature assists businesses in managing their pick-time load. Furthermore, it optimizes performance and cost based on on-demand needs. AWS allows you to configure a threshold value for CPU utilization and any resource utilization level; once that threshold is reached, the AWS cloud computes engine automatically enables and provision for scaling up the resources. Similarly, if the load falls below the threshold, it automatically scales down to the default configuration level.</p><h2 id="heading-how-does-autoscaling-work-in-aws">How does Autoscaling work in AWS?</h2><p>There are multiple entities involved in the Autoscaling process in AWS, which are: Load Balancer and AMIs are two main components involved in this process. To begin, you must create an AMI of your current server; in simpler terms, an AMI of your current configuration consists of all system settings as well as the current website. This is possible in AWS's AMI section. If we follow our above scenario and configure autoscaling, your system will be ready for future traffic.</p><p>When traffic begins to increase, the AWS autoscaling service will automatically launch another instance with the same configuration as your current server using your server's AMI.</p><p>The next step is to divide or route our traffic equally among the newly launched instances; the load balancer in AWS will handle this. A load balancer divides traffic based on the load on a specific system; they use internal processes to determine where to route traffic.</p><p>A new instance is created solely based on a set of rules defined by the user configuring autoscaling. The rules can be as simple as CPU utilization; for example, you can configure autoscaling to launch a new instance when your CPU utilization reaches 70-80%. Of course, there can be rules for scaling down.</p><h2 id="heading-autoscaling-components-in-aws">Autoscaling Components in AWS</h2><p>There are numerous components involved in the autoscaling process, some of which we have already mentioned, such as AMI and load balancers, as well as others.</p><p>Components involved in Autoscaling:</p><ul><li><p>AMI (Amazon Machine Image)</p></li><li><p>Load Balancer</p></li><li><p>Snapshot</p></li><li><p>EC2 Instance</p></li><li><p>Autoscaling groups</p></li></ul><p>There may be additional components, but most of the components that can be scaled are included in Autoscaling.</p><h2 id="heading-1-ami">1. AMI</h2><p>An AMI is a downloadable executable image of your EC2 instance that you can use to launch new instances. To scale your resources, your new server must have all of your websites configured and ready to go. In AWS, you can accomplish this through AMIs, which are nothing more than identical executable images of a system that you can use to create new images, and AWS will use the same in the case of autoscaling to launch new instances.</p><h2 id="heading-2-load-balancer">2. Load Balancer</h2><p>Creating an instance is only one part of autoscaling; you must also divide your traffic among the new instances, which is handled by the Load Balancer. A load balancer can automatically identify traffic over the systems to which it is connected and redirect requests based on rules or in the traditional manner to the instance with the least load. Load balancing is the process of distributing traffic among instances. Load balancers are used to improve an application's reliability and efficiency in handling concurrent users.</p><p>A load balancer is extremely important in autoscaling. Load balancers are typically classified into two types.:-</p><ul><li><p><strong>Classic Load Balancer</strong></p><p> A traditional load balancer takes a very simple approach: it simply distributes traffic evenly among all instances. It's very simple, and nobody uses a traditional load balancer anymore. It could be a good choice for a simple static HTML page website, but in today's scenarios, there are hybrid apps or multi-component and high computation applications that have numerous components dedicated to a specific task.</p></li><li><p><strong>Application Load Balancer</strong></p><p> The most common type of load balancer, in which traffic is redirected based on simple or complex rules that can be based on "path" or "host" or as user-defined.</p></li></ul><p>Consider the following scenario: a document processing application.</p><p>Assume you have a monolithic or microservice architecture application, and the path "/document" is specific to a document processing service, and other paths "/reports" simply show the reports of the documents that have been processed and statistics about processed data. We can have an autoscaling group for one server that handles document processing and another that only displays reports.</p><p>In the application load balancer, you can configure and set a rule based on a path that redirects to an autoscale group for server 1 if the path matches "/document," or to an autoscale group for server 2 if the path matches "/reports." Internally, one group can have multiple instances, and the load will be distributed equally among the instances in the classical form.</p><h2 id="heading-3-snapshot">3. Snapshot</h2><p>The copy of data on your hard drive is usually an image of your storage. The primary distinction between a snapshot and an AMI is that an AMI is an executable image that can be used to create a new instance, whereas a snapshot is simply a copy of the data in your instance. If you have an incremental snapshot of your EC2 instance, a snapshot is a copy of the blocks that have changed since the last snapshot.</p><h2 id="heading-4-ec2-elastic-compute-cloud-instance">4. EC2 (Elastic Compute Cloud) Instance</h2><p>An Elastic Compute Cloud (EC2) instance is a virtual server in Amazon's Elastic Compute Cloud (EC2) that is used to deploy your applications on Amazon Web Services (AWS) infrastructure. The EC2 service allows you to connect to a virtual server with an authenticated key via SSH and install various components of your application alongside your application.</p><h2 id="heading-5-autoscaling-group">5. Autoscaling group</h2><p>It is a collection of EC2 instances that serves as the foundation of Amazon EC2 AutoScaling. When you create an AutoScaling group, you must specify the subnets and the number of instances you want to start with.</p>]]><![CDATA[<p>AWS AutoScaling is an advanced cloud computing feature that provides automatic resource management based on server load. A server cluster's resources typically scale up and scale down dynamically via mechanisms such as a load balancer, AutoScaling groups, Amazon Machine Image (AMI), EC2 Instances, and Snapshots. The AWS AutoScaling feature assists businesses in managing their pick-time load. Furthermore, it optimizes performance and cost based on on-demand needs. AWS allows you to configure a threshold value for CPU utilization and any resource utilization level; once that threshold is reached, the AWS cloud computes engine automatically enables and provision for scaling up the resources. Similarly, if the load falls below the threshold, it automatically scales down to the default configuration level.</p><h2 id="heading-how-does-autoscaling-work-in-aws">How does Autoscaling work in AWS?</h2><p>There are multiple entities involved in the Autoscaling process in AWS, which are: Load Balancer and AMIs are two main components involved in this process. To begin, you must create an AMI of your current server; in simpler terms, an AMI of your current configuration consists of all system settings as well as the current website. This is possible in AWS's AMI section. If we follow our above scenario and configure autoscaling, your system will be ready for future traffic.</p><p>When traffic begins to increase, the AWS autoscaling service will automatically launch another instance with the same configuration as your current server using your server's AMI.</p><p>The next step is to divide or route our traffic equally among the newly launched instances; the load balancer in AWS will handle this. A load balancer divides traffic based on the load on a specific system; they use internal processes to determine where to route traffic.</p><p>A new instance is created solely based on a set of rules defined by the user configuring autoscaling. The rules can be as simple as CPU utilization; for example, you can configure autoscaling to launch a new instance when your CPU utilization reaches 70-80%. Of course, there can be rules for scaling down.</p><h2 id="heading-autoscaling-components-in-aws">Autoscaling Components in AWS</h2><p>There are numerous components involved in the autoscaling process, some of which we have already mentioned, such as AMI and load balancers, as well as others.</p><p>Components involved in Autoscaling:</p><ul><li><p>AMI (Amazon Machine Image)</p></li><li><p>Load Balancer</p></li><li><p>Snapshot</p></li><li><p>EC2 Instance</p></li><li><p>Autoscaling groups</p></li></ul><p>There may be additional components, but most of the components that can be scaled are included in Autoscaling.</p><h2 id="heading-1-ami">1. AMI</h2><p>An AMI is a downloadable executable image of your EC2 instance that you can use to launch new instances. To scale your resources, your new server must have all of your websites configured and ready to go. In AWS, you can accomplish this through AMIs, which are nothing more than identical executable images of a system that you can use to create new images, and AWS will use the same in the case of autoscaling to launch new instances.</p><h2 id="heading-2-load-balancer">2. Load Balancer</h2><p>Creating an instance is only one part of autoscaling; you must also divide your traffic among the new instances, which is handled by the Load Balancer. A load balancer can automatically identify traffic over the systems to which it is connected and redirect requests based on rules or in the traditional manner to the instance with the least load. Load balancing is the process of distributing traffic among instances. Load balancers are used to improve an application's reliability and efficiency in handling concurrent users.</p><p>A load balancer is extremely important in autoscaling. Load balancers are typically classified into two types.:-</p><ul><li><p><strong>Classic Load Balancer</strong></p><p> A traditional load balancer takes a very simple approach: it simply distributes traffic evenly among all instances. It's very simple, and nobody uses a traditional load balancer anymore. It could be a good choice for a simple static HTML page website, but in today's scenarios, there are hybrid apps or multi-component and high computation applications that have numerous components dedicated to a specific task.</p></li><li><p><strong>Application Load Balancer</strong></p><p> The most common type of load balancer, in which traffic is redirected based on simple or complex rules that can be based on "path" or "host" or as user-defined.</p></li></ul><p>Consider the following scenario: a document processing application.</p><p>Assume you have a monolithic or microservice architecture application, and the path "/document" is specific to a document processing service, and other paths "/reports" simply show the reports of the documents that have been processed and statistics about processed data. We can have an autoscaling group for one server that handles document processing and another that only displays reports.</p><p>In the application load balancer, you can configure and set a rule based on a path that redirects to an autoscale group for server 1 if the path matches "/document," or to an autoscale group for server 2 if the path matches "/reports." Internally, one group can have multiple instances, and the load will be distributed equally among the instances in the classical form.</p><h2 id="heading-3-snapshot">3. Snapshot</h2><p>The copy of data on your hard drive is usually an image of your storage. The primary distinction between a snapshot and an AMI is that an AMI is an executable image that can be used to create a new instance, whereas a snapshot is simply a copy of the data in your instance. If you have an incremental snapshot of your EC2 instance, a snapshot is a copy of the blocks that have changed since the last snapshot.</p><h2 id="heading-4-ec2-elastic-compute-cloud-instance">4. EC2 (Elastic Compute Cloud) Instance</h2><p>An Elastic Compute Cloud (EC2) instance is a virtual server in Amazon's Elastic Compute Cloud (EC2) that is used to deploy your applications on Amazon Web Services (AWS) infrastructure. The EC2 service allows you to connect to a virtual server with an authenticated key via SSH and install various components of your application alongside your application.</p><h2 id="heading-5-autoscaling-group">5. Autoscaling group</h2><p>It is a collection of EC2 instances that serves as the foundation of Amazon EC2 AutoScaling. When you create an AutoScaling group, you must specify the subnets and the number of instances you want to start with.</p>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1686835315159/46427ca5-0be0-436c-9319-35e093895d0c.png<![CDATA[This versus That - Microsoft Azure Machine Learning vs AWS Sagemaker]]>https://blog.hemath.com/this-versus-that-microsoft-azure-machine-learning-vs-aws-sagemakerhttps://blog.hemath.com/this-versus-that-microsoft-azure-machine-learning-vs-aws-sagemakerSun, 13 Nov 2022 12:40:34 GMT<![CDATA[<h2 id="heading-what-is-microsoft-azure-machine-learning"><strong>What is Microsoft Azure Machine Learning?</strong></h2><p>Use an enterprise-grade service for the end-to-end machine learning lifecycle. Source: <a target="_blank" href="https://azure.microsoft.com/en-us/products/machine-learning/#product-overview">https://azure.microsoft.com/en-us/products/machine-learning/#product-overview</a></p><h2 id="heading-what-is-aws-sagemaker"><strong>What is AWS Sagemaker?</strong></h2><p>Build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. (Source: <a target="_blank" href="https://aws.amazon.com/sagemaker/">https://aws.amazon.com/sagemaker/</a></p><h2 id="heading-microsoft-azure-machine-learning-vs-aws-sagemaker-comparison"><strong>Microsoft Azure Machine Learning vs AWS Sagemaker Comparison</strong></h2><div class="hn-table"><table><thead><tr><td><strong>Characteristic</strong></td><td><strong>Microsoft Azure Machine Learning</strong></td><td>AWS Sagemaker</td></tr></thead><tbody><tr><td><strong>Use Cases</strong></td><td>Predictive maintenance and asset management Customer churn analysis and retention Sentiment analysis and social media monitoring Healthcare diagnosis and treatment prediction Fraud detection and credit scoring</td><td>Fraud detection Image and video recognition Natural language processing Predictive maintenance Financial forecasting Health monitoring and diagnostics</td></tr><tr><td><strong>When not to use</strong></td><td>For simple data analysis or reporting tasks For small-scale projects with limited data</td><td>When you have limited data processing requirements When you have limited machine learning requirements When you have strict budget constraints</td></tr><tr><td><strong>Type of data processing</strong></td><td>Azure Machine Learning supports various types of data processing, including preprocessing, feature engineering, and training data. The tool provides a range of data preprocessing and feature engineering techniques, including data cleaning, scaling, and normalization.</td><td>AWS Sagemaker supports a wide range of data processing capabilities, including data ingestion, data transformation, and feature engineering. It also supports real-time and batch processing of large-scale datasets.</td></tr><tr><td><strong>Data ingestion</strong></td><td>Azure Machine Learning supports data ingestion from various sources, including Azure Blob Storage, Azure Data Lake Storage, Azure SQL Database, and other public and private data sources.</td><td>AWS Sagemaker supports data ingestion from various sources, such as Amazon S3, Amazon Redshift, and Amazon Aurora. It also supports data streaming from Amazon Kinesis and Apache Kafka.</td></tr><tr><td><strong>Data transformation</strong></td><td>Azure Machine Learning provides various data transformation capabilities, including data cleaning, normalization, and feature engineering. The tool also supports a range of feature extraction techniques such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF).</td><td>AWS Sagemaker provides built-in data transformation tools such as Amazon Glue, Apache Spark, and AWS Lambda to transform raw data into machine learning-ready datasets.</td></tr><tr><td><strong>Machine learning support</strong></td><td>Azure Machine Learning supports a wide range of machine learning algorithms, including supervised and unsupervised learning algorithms such as regression, classification, and clustering. It also supports deep learning models such as neural networks and reinforcement learning.</td><td>AWS Sagemaker supports a wide range of machine learning algorithms and frameworks, including TensorFlow, PyTorch, and Apache MXNet. It also provides built-in algorithms for popular use cases, such as image classification and anomaly detection.</td></tr><tr><td><strong>Query language</strong></td><td>Azure Machine Learning provides a range of APIs and SDKs that support popular programming languages such as Python and R.</td><td>AWS Sagemaker supports multiple query languages, including SQL, Apache Spark, and Presto, to query and analyze data.</td></tr><tr><td><strong>Deployment model</strong></td><td>Azure Machine Learning provides deployment options for both cloud and on-premises environments. It also supports containerization for deploying machine learning models in production.</td><td>AWS Sagemaker provides flexible deployment options, including hosting models on Amazon Elastic Compute Cloud (EC2) instances or as serverless functions using AWS Lambda.</td></tr><tr><td><strong>Integration with other services</strong></td><td>Azure Machine Learning integrates with various other Azure services, including Azure Data Factory, Azure Databricks, and Azure DevOps. It also integrates with third-party tools such as Jupyter Notebook and Visual Studio.</td><td>AWS Sagemaker integrates with a wide range of AWS services, such as Amazon S3, Amazon Redshift, and Amazon Kinesis. It also integrates with popular third-party services, such as Apache Spark and TensorFlow.</td></tr><tr><td><strong>Security</strong></td><td>Azure Machine Learning supports various security features such as role-based access control, data encryption, and network isolation.</td><td>AWS Sagemaker provides strong security measures, including encryption, identity and access management, and compliance with industry standards such as HIPAA and GDPR.</td></tr><tr><td><strong>Pricing model</strong></td><td>Azure Machine Learning offers a pay-as-you-go pricing model based on usage. It also provides pre-configured virtual machine images with machine learning tools for a fixed monthly fee.</td><td>AWS Sagemaker offers a pay-as-you-go pricing model, with charges based on the usage of various services and resources.</td></tr><tr><td><strong>Scalability</strong></td><td>Azure Machine Learning can scale to handle large datasets and high-performance computing requirements.</td><td>AWS Sagemaker can scale up or down depending on the size of the data and the complexity of the model. It can also handle large-scale distributed training and inference.</td></tr><tr><td><strong>Performance</strong></td><td>Azure Machine Learning provides high-performance computing capabilities for building and training machine learning models.</td><td>AWS Sagemaker provides high-performance computing capabilities to handle complex machine-learning tasks.</td></tr><tr><td><strong>Availability</strong></td><td>Azure Machine Learning provides high availability and reliability with built-in redundancy and failover mechanisms.</td><td>AWS Sagemaker offers high availability and reliability, with built-in fault tolerance and automatic scaling.</td></tr><tr><td><strong>Reliability</strong></td><td>Azure Machine Learning is a reliable tool that is backed by Microsoft's strong commitment to providing reliable cloud-based services.</td><td>Reliability is one of the key features of AWS SageMaker. AWS ensures high availability, durability, and fault tolerance of the SageMaker service by using various techniques and technologies.</td></tr><tr><td><strong>Monitoring and management</strong></td><td>Azure Machine Learning provides various monitoring and management capabilities, including model versioning, model deployment, and model performance monitoring.</td><td>AWS Sagemaker provides tools for monitoring and managing machine learning models, such as Amazon CloudWatch and AWS Step Functions.</td></tr><tr><td><strong>Developer tools & integration</strong></td><td>Azure Machine Learning provides various developer tools and integrations, including Azure Machine Learning Studio, Azure Machine Learning Workbench, and Azure Machine Learning CLI.</td><td>AWS Sagemaker provides a wide range of developer tools and integrations, such as Jupyter notebooks, AWS SDKs, and popular IDEs such as PyCharm and Visual Studio Code.</td></tr><tr><td><strong>Analyzing GitHub Metrics</strong></td><td>As of 2022, Azure Machine Learning has over 3.6k stars, and 2.3k forks, 45 contributors, and 1259 commits.</td><td>As of 2022, AWS Sagemaker has over 7.8k stars, 5.8k forks, 439 contributors, and 2565 commits.</td></tr><tr><td><strong>Decoding Pricing</strong></td><td>Azure Machine Learning offers a pay-as-you-go service.</td><td>You only pay for what you use with Amazon SageMaker. ML model development, training, and deployment are billed by the second, with no minimum costs or up-front requirements. The cost of Amazon SageMaker is broken into costs for hosting instances, on-demand ML instances, and ML storage.</td></tr></tbody></table></div><h2 id="heading-faqs"><strong>FAQ's</strong></h2><h3 id="heading-1-is-azure-good-for-machine-learning"><strong>1. Is Azure good for machine learning?</strong></h3><p>Yes, Azure machine learning is one of the best tools to perform predictive analysis.</p><h3 id="heading-2-why-use-azure-machine-learning"><strong>2. Why use Azure machine learning?</strong></h3><p>Azure Machine Learning is quite user-friendly and provides a range of less restrictive tools. The Azure tool has many data and algorithms to make more accurate predictions. The tool simplifies the process of importing training data and fine-tuning the outcomes.</p><h3 id="heading-3-what-is-amazon-sagemaker-used-for"><strong>3. What is Amazon SageMaker used for?</strong></h3><p>Amazon Sagemaker is a fully-managed service containing modules that can be used independently or together to build, manage, and deploy your ML models.</p><h3 id="heading-4-is-sagemaker-part-of-aws"><strong>4. Is SageMaker part of AWS?</strong></h3><p>Yes. Sagemaker is a service of the AWS public cloud, and it contains tools for creating, training, and deploying machine learning (ML) models for predictive analytics applications.</p><h3 id="heading-5-is-sagemaker-saas-or-paas"><strong>5. Is SageMaker SaaS or PAAS?</strong></h3><p>Amazon Sagemaker is SaaS.</p>]]><![CDATA[<h2 id="heading-what-is-microsoft-azure-machine-learning"><strong>What is Microsoft Azure Machine Learning?</strong></h2><p>Use an enterprise-grade service for the end-to-end machine learning lifecycle. Source: <a target="_blank" href="https://azure.microsoft.com/en-us/products/machine-learning/#product-overview">https://azure.microsoft.com/en-us/products/machine-learning/#product-overview</a></p><h2 id="heading-what-is-aws-sagemaker"><strong>What is AWS Sagemaker?</strong></h2><p>Build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows. (Source: <a target="_blank" href="https://aws.amazon.com/sagemaker/">https://aws.amazon.com/sagemaker/</a></p><h2 id="heading-microsoft-azure-machine-learning-vs-aws-sagemaker-comparison"><strong>Microsoft Azure Machine Learning vs AWS Sagemaker Comparison</strong></h2><div class="hn-table"><table><thead><tr><td><strong>Characteristic</strong></td><td><strong>Microsoft Azure Machine Learning</strong></td><td>AWS Sagemaker</td></tr></thead><tbody><tr><td><strong>Use Cases</strong></td><td>Predictive maintenance and asset management Customer churn analysis and retention Sentiment analysis and social media monitoring Healthcare diagnosis and treatment prediction Fraud detection and credit scoring</td><td>Fraud detection Image and video recognition Natural language processing Predictive maintenance Financial forecasting Health monitoring and diagnostics</td></tr><tr><td><strong>When not to use</strong></td><td>For simple data analysis or reporting tasks For small-scale projects with limited data</td><td>When you have limited data processing requirements When you have limited machine learning requirements When you have strict budget constraints</td></tr><tr><td><strong>Type of data processing</strong></td><td>Azure Machine Learning supports various types of data processing, including preprocessing, feature engineering, and training data. The tool provides a range of data preprocessing and feature engineering techniques, including data cleaning, scaling, and normalization.</td><td>AWS Sagemaker supports a wide range of data processing capabilities, including data ingestion, data transformation, and feature engineering. It also supports real-time and batch processing of large-scale datasets.</td></tr><tr><td><strong>Data ingestion</strong></td><td>Azure Machine Learning supports data ingestion from various sources, including Azure Blob Storage, Azure Data Lake Storage, Azure SQL Database, and other public and private data sources.</td><td>AWS Sagemaker supports data ingestion from various sources, such as Amazon S3, Amazon Redshift, and Amazon Aurora. It also supports data streaming from Amazon Kinesis and Apache Kafka.</td></tr><tr><td><strong>Data transformation</strong></td><td>Azure Machine Learning provides various data transformation capabilities, including data cleaning, normalization, and feature engineering. The tool also supports a range of feature extraction techniques such as Principal Component Analysis (PCA) and Non-negative Matrix Factorization (NMF).</td><td>AWS Sagemaker provides built-in data transformation tools such as Amazon Glue, Apache Spark, and AWS Lambda to transform raw data into machine learning-ready datasets.</td></tr><tr><td><strong>Machine learning support</strong></td><td>Azure Machine Learning supports a wide range of machine learning algorithms, including supervised and unsupervised learning algorithms such as regression, classification, and clustering. It also supports deep learning models such as neural networks and reinforcement learning.</td><td>AWS Sagemaker supports a wide range of machine learning algorithms and frameworks, including TensorFlow, PyTorch, and Apache MXNet. It also provides built-in algorithms for popular use cases, such as image classification and anomaly detection.</td></tr><tr><td><strong>Query language</strong></td><td>Azure Machine Learning provides a range of APIs and SDKs that support popular programming languages such as Python and R.</td><td>AWS Sagemaker supports multiple query languages, including SQL, Apache Spark, and Presto, to query and analyze data.</td></tr><tr><td><strong>Deployment model</strong></td><td>Azure Machine Learning provides deployment options for both cloud and on-premises environments. It also supports containerization for deploying machine learning models in production.</td><td>AWS Sagemaker provides flexible deployment options, including hosting models on Amazon Elastic Compute Cloud (EC2) instances or as serverless functions using AWS Lambda.</td></tr><tr><td><strong>Integration with other services</strong></td><td>Azure Machine Learning integrates with various other Azure services, including Azure Data Factory, Azure Databricks, and Azure DevOps. It also integrates with third-party tools such as Jupyter Notebook and Visual Studio.</td><td>AWS Sagemaker integrates with a wide range of AWS services, such as Amazon S3, Amazon Redshift, and Amazon Kinesis. It also integrates with popular third-party services, such as Apache Spark and TensorFlow.</td></tr><tr><td><strong>Security</strong></td><td>Azure Machine Learning supports various security features such as role-based access control, data encryption, and network isolation.</td><td>AWS Sagemaker provides strong security measures, including encryption, identity and access management, and compliance with industry standards such as HIPAA and GDPR.</td></tr><tr><td><strong>Pricing model</strong></td><td>Azure Machine Learning offers a pay-as-you-go pricing model based on usage. It also provides pre-configured virtual machine images with machine learning tools for a fixed monthly fee.</td><td>AWS Sagemaker offers a pay-as-you-go pricing model, with charges based on the usage of various services and resources.</td></tr><tr><td><strong>Scalability</strong></td><td>Azure Machine Learning can scale to handle large datasets and high-performance computing requirements.</td><td>AWS Sagemaker can scale up or down depending on the size of the data and the complexity of the model. It can also handle large-scale distributed training and inference.</td></tr><tr><td><strong>Performance</strong></td><td>Azure Machine Learning provides high-performance computing capabilities for building and training machine learning models.</td><td>AWS Sagemaker provides high-performance computing capabilities to handle complex machine-learning tasks.</td></tr><tr><td><strong>Availability</strong></td><td>Azure Machine Learning provides high availability and reliability with built-in redundancy and failover mechanisms.</td><td>AWS Sagemaker offers high availability and reliability, with built-in fault tolerance and automatic scaling.</td></tr><tr><td><strong>Reliability</strong></td><td>Azure Machine Learning is a reliable tool that is backed by Microsoft's strong commitment to providing reliable cloud-based services.</td><td>Reliability is one of the key features of AWS SageMaker. AWS ensures high availability, durability, and fault tolerance of the SageMaker service by using various techniques and technologies.</td></tr><tr><td><strong>Monitoring and management</strong></td><td>Azure Machine Learning provides various monitoring and management capabilities, including model versioning, model deployment, and model performance monitoring.</td><td>AWS Sagemaker provides tools for monitoring and managing machine learning models, such as Amazon CloudWatch and AWS Step Functions.</td></tr><tr><td><strong>Developer tools & integration</strong></td><td>Azure Machine Learning provides various developer tools and integrations, including Azure Machine Learning Studio, Azure Machine Learning Workbench, and Azure Machine Learning CLI.</td><td>AWS Sagemaker provides a wide range of developer tools and integrations, such as Jupyter notebooks, AWS SDKs, and popular IDEs such as PyCharm and Visual Studio Code.</td></tr><tr><td><strong>Analyzing GitHub Metrics</strong></td><td>As of 2022, Azure Machine Learning has over 3.6k stars, and 2.3k forks, 45 contributors, and 1259 commits.</td><td>As of 2022, AWS Sagemaker has over 7.8k stars, 5.8k forks, 439 contributors, and 2565 commits.</td></tr><tr><td><strong>Decoding Pricing</strong></td><td>Azure Machine Learning offers a pay-as-you-go service.</td><td>You only pay for what you use with Amazon SageMaker. ML model development, training, and deployment are billed by the second, with no minimum costs or up-front requirements. The cost of Amazon SageMaker is broken into costs for hosting instances, on-demand ML instances, and ML storage.</td></tr></tbody></table></div><h2 id="heading-faqs"><strong>FAQ's</strong></h2><h3 id="heading-1-is-azure-good-for-machine-learning"><strong>1. Is Azure good for machine learning?</strong></h3><p>Yes, Azure machine learning is one of the best tools to perform predictive analysis.</p><h3 id="heading-2-why-use-azure-machine-learning"><strong>2. Why use Azure machine learning?</strong></h3><p>Azure Machine Learning is quite user-friendly and provides a range of less restrictive tools. The Azure tool has many data and algorithms to make more accurate predictions. The tool simplifies the process of importing training data and fine-tuning the outcomes.</p><h3 id="heading-3-what-is-amazon-sagemaker-used-for"><strong>3. What is Amazon SageMaker used for?</strong></h3><p>Amazon Sagemaker is a fully-managed service containing modules that can be used independently or together to build, manage, and deploy your ML models.</p><h3 id="heading-4-is-sagemaker-part-of-aws"><strong>4. Is SageMaker part of AWS?</strong></h3><p>Yes. Sagemaker is a service of the AWS public cloud, and it contains tools for creating, training, and deploying machine learning (ML) models for predictive analytics applications.</p><h3 id="heading-5-is-sagemaker-saas-or-paas"><strong>5. Is SageMaker SaaS or PAAS?</strong></h3><p>Amazon Sagemaker is SaaS.</p>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1686832800584/e23aeda1-1d0e-4003-ac29-8b0795a4ffaf.png<![CDATA[Let me explain - Different types of storages in S3]]>https://blog.hemath.com/let-me-explain-different-types-of-storages-in-s3https://blog.hemath.com/let-me-explain-different-types-of-storages-in-s3Thu, 10 Nov 2022 17:07:16 GMT<![CDATA[<h2 id="heading-what-is-simple-storage-service-in-aws">What is Simple Storage Service in AWS</h2><p>Amazon Simple Storage Service (S3) is employed for the storage of data in the form of objects S3 is not like any other file storage device or service. In addition, Amazon S3 offers industry-leading scalability, data availability, security, and performance. The data that the user uploads to S3 is stored as objects and assigned an ID. Furthermore, they store data in bucket-like shapes and can upload files up to 5 Terabytes in size (TB). This service is primarily intended for Amazon Web Services' online backup and archiving of data and applications (AWS).</p><h2 id="heading-amazon-s3-storage-classes">Amazon S3 Storage Classes:</h2><h3 id="heading-by-inspecting-the-data-this-storage-preserves-its-originality-storage-classes-are-classified-as-follows">By inspecting the data, this storage preserves its originality. Storage classes are classified as follows:</h3><ul><li><p>Amazon S3 Standard</p></li><li><p>Amazon S3 Intelligent-Tiering</p></li><li><p>Amazon S3 Standard-Infrequent Access</p></li><li><p>Amazon S3 One Zone-Infrequent Access</p></li><li><p>Amazon S3 Glacier Instant Retrieval</p></li><li><p>Amazon S3 Glacier Flexible Retrieval</p></li><li><p>Amazon S3 Glacier Deep Archive</p></li></ul><h2 id="heading-amazon-s3-standard">Amazon S3 Standard</h2><p>It is used for general storage and provides high durability, availability, and performance for frequently accessed data. Cloud applications, dynamic websites, content distribution, mobile and gaming applications, and big data analytics are all appropriate use cases for S3 Standard.</p><p>It is primarily used for general purposes in order to increase durability, availability, and performance. Cloud applications, dynamic websites, content distribution, mobile and gaming apps, and big data analysis or data mining are some of its applications.</p><h3 id="heading-characteristics-of-s3-standard">Characteristics of S3 Standard</h3><ol><li><p>The availability criteria are quite good, such as 99.9%.</p></li><li><p>Improves an object file's recovery.</p></li><li><p>It is against the difficult events that can affect an entire Availability Zone.</p></li><li><p>S3 standard has a durability of 99.999999999%.</p></li></ol><h2 id="heading-amazon-s3-intelligent-tiering">Amazon S3 Intelligent-Tiering</h2><p>The first cloud storage reduces the user's storage costs automatically. It offers very cost-effective access based on frequency without compromising other performance. It also handles difficult operations. Amazon S3 Intelligent - Tiering automatically reduces the cost of granular objects. Amazon S3 Intelligent - Tiering has no retrieval fees.</p><h3 id="heading-characteristics-of-s3-intelligent-tiering">Characteristics of S3 Intelligent-Tiering</h3><ol><li><p>Less monitoring was required, and the tier charge was automatically applied.</p></li><li><p>There are no minimum storage requirements or recovery fees to use the service.</p></li><li><p>S3 Intelligent- Tiering has a durability of 99.999999999%.</p></li></ol><h2 id="heading-amazon-s3-standard-infrequent-access">Amazon S3 Standard-Infrequent Access</h2><p>S3 Standard-IA is used by users to access less frequently used data. It necessitates quick access when required. Using S3 Standard-IA, we can achieve high strength, high output, and low bandwidth. It is ideal for storing backup and recovery data for an extended period of time. It serves as a repository for disaster recovery files.</p><h3 id="heading-characteristics-of-s3-standard-infrequent-access">Characteristics of S3 Standard-Infrequent Access</h3><ol><li><p>High performance and the same rate of action.</p></li><li><p>In all AZs, it is extremely durable.</p></li><li><p>The durability is 99.999999999%.</p></li></ol><h2 id="heading-amazon-s3-glacier-instant-retrieval">Amazon S3 Glacier Instant Retrieval</h2><p>It is an archive storage class that provides the lowest-cost storage for data archiving and is structured to give you the best performance and flexibility. The S3 Glacier Instant Retrieval service provides the quickest access to archive storage. Data retrieval in milliseconds, as in the S3 standard.</p><h3 id="heading-characteristics-of-s3-glacier-instant-retrieval">Characteristics of S3 Glacier Instant Retrieval</h3><ol><li><p>The data is recovered in milliseconds.</p></li><li><p>The minimum size of an object should be 128KB.</p></li><li><p>S3 Glacier Instant Retrieval has a 99.9% availability rate..</p></li><li><p>The durability is 99.999999999%</p></li></ol><h2 id="heading-amazon-s3-one-zone-infrequent-access">Amazon S3 One Zone-Infrequent Access</h2><p>S3 One Zone-IA, in contrast to other S3 Storage Classes that store data in a minimum of three Availability Zones, stores data in a single Availability Zone and costs 20% less than S3 Standard-IA. It's an excellent choice for storing secondary backup copies of on-premises data or data that can be easily recreated. S3 One Zone-IA offers the same high durability, throughput, and latency as S3 Standard.</p><h3 id="heading-characteristics-of-s3-one-zone-infrequent-access">Characteristics of S3 One Zone-Infrequent Access</h3><ol><li><p>SSL (Secure Sockets Layer) is supported for data transfer and encryption.</p></li><li><p>Data can be harmed if an availability zone is destroyed.</p></li><li><p>In S3 one Zone-Infrequent Access, availability is 99.5%..</p></li><li><p>The durability is 99.999999999%.</p></li></ol><h2 id="heading-amazon-s3-glacier-flexible-retrieval">Amazon S3 Glacier Flexible Retrieval</h2><p>When compared to S3 Glacier Instant Retrieval, it offers less expensive storage. It is an appropriate solution for backing up data so that it can be easily recovered a few times per year. It only takes a few minutes to access the data.</p><h3 id="heading-characteristics-of-s3-glacier-flexible-retrieval">Characteristics of S3 Glacier Flexible Retrieval</h3><ol><li><p>Free recoveries in abundance.</p></li><li><p>Data access may be hampered if AZs are destroyed.</p></li><li><p>S3 glacier flexible retrieval is best for backup and disaster recovery use cases when retrieving large data sets.</p></li><li><p>S3 glacier flexible retrieval has a 99.99% availability rate..</p></li><li><p>Durability is of 99.999999999%</p></li></ol><h2 id="heading-amazon-s3-glacier-deep-archive">Amazon S3 Glacier Deep Archive</h2><p>The Glacier Deep Archive storage class is intended to provide long-term, secure storage for large amounts of data at a price competitive with low-cost off-premises tape archival services. You no longer have to deal with expensive services. Accessibility is so efficient that data can be restored within 12 hours. This storage class is designed in such a way that users can easily obtain long-lasting and more secure storage for large amounts of data at a low cost. It has efficient accessibility and can restore data in very little time, so its time complexity is also efficient. Object replication is another feature of S3 Glacier Deep Archive.</p><h3 id="heading-characteristics-of-s3-glacier-deep-archive">Characteristics of S3 Glacier Deep Archive</h3><ol><li><p>Storage that is more secure.</p></li><li><p>Recovery time is shorter and requires less time.</p></li><li><p>S3 glacier deep archive availability is 99.99%.</p></li><li><p>Durability is of 99.999999999%.</p></li></ol>]]><![CDATA[<h2 id="heading-what-is-simple-storage-service-in-aws">What is Simple Storage Service in AWS</h2><p>Amazon Simple Storage Service (S3) is employed for the storage of data in the form of objects S3 is not like any other file storage device or service. In addition, Amazon S3 offers industry-leading scalability, data availability, security, and performance. The data that the user uploads to S3 is stored as objects and assigned an ID. Furthermore, they store data in bucket-like shapes and can upload files up to 5 Terabytes in size (TB). This service is primarily intended for Amazon Web Services' online backup and archiving of data and applications (AWS).</p><h2 id="heading-amazon-s3-storage-classes">Amazon S3 Storage Classes:</h2><h3 id="heading-by-inspecting-the-data-this-storage-preserves-its-originality-storage-classes-are-classified-as-follows">By inspecting the data, this storage preserves its originality. Storage classes are classified as follows:</h3><ul><li><p>Amazon S3 Standard</p></li><li><p>Amazon S3 Intelligent-Tiering</p></li><li><p>Amazon S3 Standard-Infrequent Access</p></li><li><p>Amazon S3 One Zone-Infrequent Access</p></li><li><p>Amazon S3 Glacier Instant Retrieval</p></li><li><p>Amazon S3 Glacier Flexible Retrieval</p></li><li><p>Amazon S3 Glacier Deep Archive</p></li></ul><h2 id="heading-amazon-s3-standard">Amazon S3 Standard</h2><p>It is used for general storage and provides high durability, availability, and performance for frequently accessed data. Cloud applications, dynamic websites, content distribution, mobile and gaming applications, and big data analytics are all appropriate use cases for S3 Standard.</p><p>It is primarily used for general purposes in order to increase durability, availability, and performance. Cloud applications, dynamic websites, content distribution, mobile and gaming apps, and big data analysis or data mining are some of its applications.</p><h3 id="heading-characteristics-of-s3-standard">Characteristics of S3 Standard</h3><ol><li><p>The availability criteria are quite good, such as 99.9%.</p></li><li><p>Improves an object file's recovery.</p></li><li><p>It is against the difficult events that can affect an entire Availability Zone.</p></li><li><p>S3 standard has a durability of 99.999999999%.</p></li></ol><h2 id="heading-amazon-s3-intelligent-tiering">Amazon S3 Intelligent-Tiering</h2><p>The first cloud storage reduces the user's storage costs automatically. It offers very cost-effective access based on frequency without compromising other performance. It also handles difficult operations. Amazon S3 Intelligent - Tiering automatically reduces the cost of granular objects. Amazon S3 Intelligent - Tiering has no retrieval fees.</p><h3 id="heading-characteristics-of-s3-intelligent-tiering">Characteristics of S3 Intelligent-Tiering</h3><ol><li><p>Less monitoring was required, and the tier charge was automatically applied.</p></li><li><p>There are no minimum storage requirements or recovery fees to use the service.</p></li><li><p>S3 Intelligent- Tiering has a durability of 99.999999999%.</p></li></ol><h2 id="heading-amazon-s3-standard-infrequent-access">Amazon S3 Standard-Infrequent Access</h2><p>S3 Standard-IA is used by users to access less frequently used data. It necessitates quick access when required. Using S3 Standard-IA, we can achieve high strength, high output, and low bandwidth. It is ideal for storing backup and recovery data for an extended period of time. It serves as a repository for disaster recovery files.</p><h3 id="heading-characteristics-of-s3-standard-infrequent-access">Characteristics of S3 Standard-Infrequent Access</h3><ol><li><p>High performance and the same rate of action.</p></li><li><p>In all AZs, it is extremely durable.</p></li><li><p>The durability is 99.999999999%.</p></li></ol><h2 id="heading-amazon-s3-glacier-instant-retrieval">Amazon S3 Glacier Instant Retrieval</h2><p>It is an archive storage class that provides the lowest-cost storage for data archiving and is structured to give you the best performance and flexibility. The S3 Glacier Instant Retrieval service provides the quickest access to archive storage. Data retrieval in milliseconds, as in the S3 standard.</p><h3 id="heading-characteristics-of-s3-glacier-instant-retrieval">Characteristics of S3 Glacier Instant Retrieval</h3><ol><li><p>The data is recovered in milliseconds.</p></li><li><p>The minimum size of an object should be 128KB.</p></li><li><p>S3 Glacier Instant Retrieval has a 99.9% availability rate..</p></li><li><p>The durability is 99.999999999%</p></li></ol><h2 id="heading-amazon-s3-one-zone-infrequent-access">Amazon S3 One Zone-Infrequent Access</h2><p>S3 One Zone-IA, in contrast to other S3 Storage Classes that store data in a minimum of three Availability Zones, stores data in a single Availability Zone and costs 20% less than S3 Standard-IA. It's an excellent choice for storing secondary backup copies of on-premises data or data that can be easily recreated. S3 One Zone-IA offers the same high durability, throughput, and latency as S3 Standard.</p><h3 id="heading-characteristics-of-s3-one-zone-infrequent-access">Characteristics of S3 One Zone-Infrequent Access</h3><ol><li><p>SSL (Secure Sockets Layer) is supported for data transfer and encryption.</p></li><li><p>Data can be harmed if an availability zone is destroyed.</p></li><li><p>In S3 one Zone-Infrequent Access, availability is 99.5%..</p></li><li><p>The durability is 99.999999999%.</p></li></ol><h2 id="heading-amazon-s3-glacier-flexible-retrieval">Amazon S3 Glacier Flexible Retrieval</h2><p>When compared to S3 Glacier Instant Retrieval, it offers less expensive storage. It is an appropriate solution for backing up data so that it can be easily recovered a few times per year. It only takes a few minutes to access the data.</p><h3 id="heading-characteristics-of-s3-glacier-flexible-retrieval">Characteristics of S3 Glacier Flexible Retrieval</h3><ol><li><p>Free recoveries in abundance.</p></li><li><p>Data access may be hampered if AZs are destroyed.</p></li><li><p>S3 glacier flexible retrieval is best for backup and disaster recovery use cases when retrieving large data sets.</p></li><li><p>S3 glacier flexible retrieval has a 99.99% availability rate..</p></li><li><p>Durability is of 99.999999999%</p></li></ol><h2 id="heading-amazon-s3-glacier-deep-archive">Amazon S3 Glacier Deep Archive</h2><p>The Glacier Deep Archive storage class is intended to provide long-term, secure storage for large amounts of data at a price competitive with low-cost off-premises tape archival services. You no longer have to deal with expensive services. Accessibility is so efficient that data can be restored within 12 hours. This storage class is designed in such a way that users can easily obtain long-lasting and more secure storage for large amounts of data at a low cost. It has efficient accessibility and can restore data in very little time, so its time complexity is also efficient. Object replication is another feature of S3 Glacier Deep Archive.</p><h3 id="heading-characteristics-of-s3-glacier-deep-archive">Characteristics of S3 Glacier Deep Archive</h3><ol><li><p>Storage that is more secure.</p></li><li><p>Recovery time is shorter and requires less time.</p></li><li><p>S3 glacier deep archive availability is 99.99%.</p></li><li><p>Durability is of 99.999999999%.</p></li></ol>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1686675127715/e94ec8f0-4801-4fb5-baf5-08b270971561.png<![CDATA[Get to know - AWS Load Balancer]]>https://blog.hemath.com/get-to-know-aws-load-balancerhttps://blog.hemath.com/get-to-know-aws-load-balancerMon, 07 Nov 2022 16:22:11 GMT<![CDATA[<h2 id="heading-what-is-a-load-balancer">What is a load balancer</h2><p>AWS load balancers accept incoming client application traffic and distribute it across multiple registered targets, such as EC2 instances in different availability zones. The AWS application load balancer feature enables developers to route and configure incoming traffic between end-users and applications in the AWS public cloud.</p><p>The AWS elastic load balancer, which serves as a single point of contact for clients, only routes to healthy instances and identifies unhealthy instances. When the target becomes operational, the AWS load balancer algorithm resumes traffic routing to it. In cloud environments with multiple web services, load balancing is critical.</p><p>AWS Elastic Load Balancing (ELB) automatically distributes incoming application traffic across multiple targets in one or more availability zones, such as containers, EC2 instances, and IP addresses. This improves the fault tolerance and availability of user applications by distributing and balancing how frontend traffic reaches backend servers. AWS load balancing also checks the health of registered targets and routes traffic accordingly.</p><h2 id="heading-aws-load-balancer-types">AWS Load Balancer Types</h2><h3 id="heading-there-are-four-types-of-aws-load-balancers-supported">There are four types of AWS load balancers supported:</h3><ul><li><p>AWS Classic Load Balancer</p></li><li><p>AWS Network Load Balancer (NLB)</p></li><li><p>AWS Application Load Balancer (ALB)</p></li><li><p>AWS Gateway Load Balancer (GLB)</p></li></ul><ul><li>A. Classic Load Balancer:</li></ul><p>Initially, the traditional type of load balancer was used. It distributes traffic among instances and lacks the intelligence to support host-based or path-based routing. In some situations, it reduces efficiency and performance. It works at both the connection and request levels. The classic load balancer sits between the transport (TCP/SSL) and application layers (HTTP/HTTPS).</p><ul><li>B. Application Load Balancer:</li></ul><p>This type of Load Balancer is used when decisions about HTTP and HTTPS traffic routing must be made. It supports both path-based and host-based routing. This load balancer operates at the OSI Model's Application layer. Dynamic host port mapping is also supported by the load balancer.</p><ul><li>C. Network Load Balancer:</li></ul><p>This type of load balancer operates at the OSI model's transport layer (TCP/SSL). It can handle millions of requests per second. It is primarily used to balance TCP traffic.</p><ul><li>D. Gateway Load Balancer:</li></ul><p>Gateway Load Balancers enable you to deploy, scale, and manage virtual appliances such as firewalls. Gateway Load Balancers combine a transparent network gateway with traffic distribution.</p><p>By acting as a single point of contact for clients, the AWS load balancer improves application availability. As needs change, users can seamlessly add and remove instances from the AWS load balancer without disrupting the overall request flow to the application. As a result, AWS elastic load balancing scales as application traffic fluctuates and can automatically scale to most workloads</p><p>Users configure the load balancer with one or more listeners. A listener checks the configured port and protocol for connection requests from clients and forwards them to registered instances using the configured port number and protocol. The AWS load balancer sends requests only to healthy instances thanks to health checks.</p><p>By default, the AWS load balancer distributes traffic evenly across enabled availability zones. Maintain instances in roughly equal numbers across availability zones to improve fault tolerance. Cross-zone load balancing is also an option. This kind of elastic load balancing ensures that traffic is distributed evenly across all registered instances</p><p>When an availability zone is enabled, a load balancer node is created within the availability zone. Targets do not receive traffic if the availability zone is not enabled, even if they are registered.</p><p>Furthermore, the classic AWS load balancer algorithm performs best with at least one registered target in each enabled availability zone, but enabling multiple availability zones for all load balancers is recommended. To ensure continuous traffic routing, AWS application load balancers require the activation of at least two availability zones.</p><h2 id="heading-limitations-of-aws-load-balancer">Limitations of AWS Load Balancer</h2><p>Although AWS load balancers perform well in basic functions, they face a few significant challenges.</p><h2 id="heading-aws-load-balancer-latency">AWS Load Balancer Latency</h2><p>AWS load balancer latency is among the systems most notable limitations. With a classic load balancer, several things can cause high latency, starting with faulty configuration. Beyond that, the high latency trouble spots are basically the same for the AWS application load balancer, especially relating to backend instances:</p><ul><li><p>Incorrect configuration</p></li><li><p>Issues with network connectivity</p></li><li><p>And as to backend instances</p></li><li><p>Excessive CPU utilization</p></li><li><p>High memory (RAM) utilization</p></li><li><p>Incorrect web server configuration Problems caused by web application dependencies such as Amazon S3 buckets or external databases running on backend instances</p></li></ul>]]><![CDATA[<h2 id="heading-what-is-a-load-balancer">What is a load balancer</h2><p>AWS load balancers accept incoming client application traffic and distribute it across multiple registered targets, such as EC2 instances in different availability zones. The AWS application load balancer feature enables developers to route and configure incoming traffic between end-users and applications in the AWS public cloud.</p><p>The AWS elastic load balancer, which serves as a single point of contact for clients, only routes to healthy instances and identifies unhealthy instances. When the target becomes operational, the AWS load balancer algorithm resumes traffic routing to it. In cloud environments with multiple web services, load balancing is critical.</p><p>AWS Elastic Load Balancing (ELB) automatically distributes incoming application traffic across multiple targets in one or more availability zones, such as containers, EC2 instances, and IP addresses. This improves the fault tolerance and availability of user applications by distributing and balancing how frontend traffic reaches backend servers. AWS load balancing also checks the health of registered targets and routes traffic accordingly.</p><h2 id="heading-aws-load-balancer-types">AWS Load Balancer Types</h2><h3 id="heading-there-are-four-types-of-aws-load-balancers-supported">There are four types of AWS load balancers supported:</h3><ul><li><p>AWS Classic Load Balancer</p></li><li><p>AWS Network Load Balancer (NLB)</p></li><li><p>AWS Application Load Balancer (ALB)</p></li><li><p>AWS Gateway Load Balancer (GLB)</p></li></ul><ul><li>A. Classic Load Balancer:</li></ul><p>Initially, the traditional type of load balancer was used. It distributes traffic among instances and lacks the intelligence to support host-based or path-based routing. In some situations, it reduces efficiency and performance. It works at both the connection and request levels. The classic load balancer sits between the transport (TCP/SSL) and application layers (HTTP/HTTPS).</p><ul><li>B. Application Load Balancer:</li></ul><p>This type of Load Balancer is used when decisions about HTTP and HTTPS traffic routing must be made. It supports both path-based and host-based routing. This load balancer operates at the OSI Model's Application layer. Dynamic host port mapping is also supported by the load balancer.</p><ul><li>C. Network Load Balancer:</li></ul><p>This type of load balancer operates at the OSI model's transport layer (TCP/SSL). It can handle millions of requests per second. It is primarily used to balance TCP traffic.</p><ul><li>D. Gateway Load Balancer:</li></ul><p>Gateway Load Balancers enable you to deploy, scale, and manage virtual appliances such as firewalls. Gateway Load Balancers combine a transparent network gateway with traffic distribution.</p><p>By acting as a single point of contact for clients, the AWS load balancer improves application availability. As needs change, users can seamlessly add and remove instances from the AWS load balancer without disrupting the overall request flow to the application. As a result, AWS elastic load balancing scales as application traffic fluctuates and can automatically scale to most workloads</p><p>Users configure the load balancer with one or more listeners. A listener checks the configured port and protocol for connection requests from clients and forwards them to registered instances using the configured port number and protocol. The AWS load balancer sends requests only to healthy instances thanks to health checks.</p><p>By default, the AWS load balancer distributes traffic evenly across enabled availability zones. Maintain instances in roughly equal numbers across availability zones to improve fault tolerance. Cross-zone load balancing is also an option. This kind of elastic load balancing ensures that traffic is distributed evenly across all registered instances</p><p>When an availability zone is enabled, a load balancer node is created within the availability zone. Targets do not receive traffic if the availability zone is not enabled, even if they are registered.</p><p>Furthermore, the classic AWS load balancer algorithm performs best with at least one registered target in each enabled availability zone, but enabling multiple availability zones for all load balancers is recommended. To ensure continuous traffic routing, AWS application load balancers require the activation of at least two availability zones.</p><h2 id="heading-limitations-of-aws-load-balancer">Limitations of AWS Load Balancer</h2><p>Although AWS load balancers perform well in basic functions, they face a few significant challenges.</p><h2 id="heading-aws-load-balancer-latency">AWS Load Balancer Latency</h2><p>AWS load balancer latency is among the systems most notable limitations. With a classic load balancer, several things can cause high latency, starting with faulty configuration. Beyond that, the high latency trouble spots are basically the same for the AWS application load balancer, especially relating to backend instances:</p><ul><li><p>Incorrect configuration</p></li><li><p>Issues with network connectivity</p></li><li><p>And as to backend instances</p></li><li><p>Excessive CPU utilization</p></li><li><p>High memory (RAM) utilization</p></li><li><p>Incorrect web server configuration Problems caused by web application dependencies such as Amazon S3 buckets or external databases running on backend instances</p></li></ul>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1686673002266/b819f7de-fc7e-4cab-8c98-018b3f57e3d5.png<![CDATA[This versus That - Tableau vs Power BI]]>https://blog.hemath.com/this-versus-that-tableau-vs-power-bihttps://blog.hemath.com/this-versus-that-tableau-vs-power-biSun, 06 Nov 2022 15:57:48 GMT<![CDATA[<h2 id="heading-what-is-tableau"><strong>What is Tableau?</strong></h2><p>The world's leading analytics platform - Successful business forecasts, decisions, and strategies are driven by data. Source: <a target="_blank" href="https://www.tableau.com/">https://www.tableau.com/</a></p><h2 id="heading-what-is-power-bi"><strong>What is Power BI?</strong></h2><p>Do more with less using an end-to-end BI platform to create a single source of truth, uncover more powerful insights, and translate them into impact. Source: <a target="_blank" href="https://powerbi.microsoft.com/en-us/">https://powerbi.microsoft.com/en-us/</a></p><h2 id="heading-tableau-vs-power-bi-comparison"><strong>Tableau vs Power BI Comparison</strong></h2><div class="hn-table"><table><thead><tr><td><strong>Characteristic</strong></td><td><strong>Tableau</strong></td><td><strong>Power BI</strong></td></tr></thead><tbody><tr><td><strong>Use Cases</strong></td><td>Sales and marketing analytics Financial analysis and reporting Operations and supply chain management Human resources analytics Healthcare data analysis Government and public sector data analysis Education data analysis</td><td>Financial analysis and reporting Sales and marketing analysis Supply chain management and logistics analysis Human resources and employee performance analysis Healthcare data analysis Real-time monitoring of operational data</td></tr><tr><td><strong>When not to use</strong></td><td>When you need to perform complex data processing and transformation When you need to perform real-time data analysis and streaming When you need to perform predictive analytics and machine learning at scale</td><td>When you have very limited data analysis and visualization needs When you need to perform complex statistical analysis that requires specialized software</td></tr><tr><td><strong>Type of data processing</strong></td><td>Tableau provides both ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) data processing methods. Users can connect to various data sources, transform the data using drag and drop features, and visualize it through interactive dashboards and reports.</td><td>Power BI offers interactive data processing capabilities that allow users to connect to various data sources, transform data, and create interactive reports and dashboards.</td></tr><tr><td><strong>Data ingestion</strong></td><td>Tableau supports a wide range of data sources, including databases, cloud-based data warehouses, spreadsheets, and more. Users can connect to these sources and extract data using Tableau's connector library.</td><td>Power BI allows users to import data from various sources, such as Excel, CSV, and SQL Server. It also supports direct connections to cloud-based data sources like Azure SQL Database, Azure Data Lake Storage, and many more.</td></tr><tr><td><strong>Data transformation</strong></td><td>Tableau provides a variety of data transformation tools, such as data blending, pivoting, splitting, and aggregating. Users can perform these operations using drag and drop features, without writing any code.</td><td>Power BI offers a wide range of data transformation capabilities, including data shaping, cleansing, and modeling. Users can use the built-in Power Query Editor to transform and shape data before building visualizations.</td></tr><tr><td><strong>Machine learning support</strong></td><td>Tableau offers machine learning integration through its Tableau Prep product. Users can use machine learning models to clean, transform, and prepare their data for analysis.</td><td>Power BI offers machine learning support through the integration of Azure Machine Learning. This allows users to train machine learning models and use them for predictive analytics within Power BI.</td></tr><tr><td><strong>Query language</strong></td><td>Tableau provides a proprietary visual query language that allows users to perform complex data analysis and aggregation without writing SQL queries.</td><td>Power BI supports a variety of query languages, including SQL, DAX, and M.</td></tr><tr><td><strong>Deployment model</strong></td><td>Tableau offers both on-premises and cloud-based deployment options. Users can deploy Tableau Server on their own infrastructure or use Tableau Online, a cloud-based service.</td><td>Power BI offers both cloud-based and on-premises deployment models, giving users the flexibility to choose the deployment model that best suits their needs.</td></tr><tr><td><strong>Integration with other services</strong></td><td>Tableau integrates with a wide range of other services, such as Salesforce, AWS, Azure, Google Cloud, and more. Users can connect to these services and extract data for analysis.</td><td>Power BI offers integration with various Microsoft services, including Azure, Dynamics 365, and Office 365. It also supports integration with third-party services like Salesforce and Google Analytics.</td></tr><tr><td><strong>Security</strong></td><td>Tableau provides a variety of security features, such as user authentication, data encryption, and access control. Users can also configure Tableau to comply with various data privacy regulations, such as GDPR and HIPAA.</td><td>Power BI provides robust security features, including role-based access control, row-level security, and data encryption.</td></tr><tr><td><strong>Pricing model</strong></td><td>Tableau offers a variety of pricing models, including perpetual licenses, subscription licenses, and pay-as-you-go licenses. Pricing varies based on the features and deployment options chosen.</td><td>Power BI offers a range of pricing options, including a free version, a per-user monthly subscription, and an enterprise version with a capacity-based pricing model.</td></tr><tr><td><strong>Scalability</strong></td><td>Tableau can scale to handle large amounts of data and users. Users can add more servers and resources as needed to support their growing needs.</td><td>Power BI offers scalable solutions for small and large businesses. Users can start with a free version and upgrade to a paid version as their needs grow.</td></tr><tr><td><strong>Performance</strong></td><td>Tableau provides fast and responsive data analysis and visualization, thanks to its in-memory data engine and parallel processing capabilities.</td><td>Power BI provides high-performance data analysis and visualization capabilities, allowing users to work with large data sets quickly and easily.</td></tr><tr><td><strong>Availability</strong></td><td>Tableau offers high availability and fault tolerance, thanks to its distributed architecture and automatic failover capabilities.</td><td>Power BI offers high availability and uptime through its cloud-based deployment model.</td></tr><tr><td><strong>Reliability</strong></td><td>Tableau provides reliable and consistent results, thanks to its data validation and error handling capabilities.</td><td>Power BI offers reliable solutions with its enterprise-level SLAs and disaster recovery capabilities.</td></tr><tr><td><strong>Monitoring and management</strong></td><td>Tableau provides a variety of monitoring and management tools, such as performance metrics, alerting, and logging. Users can also monitor and manage their Tableau deployments using third-party tools.</td><td>Power BI provides monitoring and management capabilities through the Power BI service admin portal, allowing administrators to manage users, monitor usage, and configure settings.</td></tr><tr><td><strong>Developer tools & integration</strong></td><td>Tableau provides a variety of developer tools and APIs, such as the Tableau Extensions API, the Tableau Web Data Connector API, and the Tableau JavaScript API. Developers can use these tools to build custom integrations and extensions for Tableau.</td><td>Power BI provides developer tools and integration with Visual Studio and other development environments, allowing developers to build custom visualizations and extend Power BI functionality.</td></tr><tr><td><strong>Decoding Pricing</strong></td><td>Tableau has a complex pricing structure with several options for its products and editions such as Tableau Desktop, Tableau Server, Tableau Creator, Tableau Online, etc.</td><td>Power BI offers a variety of pricing options based on user needs: Power BI Free, Power BI Pro, Power BI Premium, and Power BI Embedded.</td></tr></tbody></table></div><h2 id="heading-faqs">FAQ's</h2><h3 id="heading-1-what-is-tableau-used-for"><strong>1. What is Tableau used for?</strong></h3><p>Tableau is a data visualization tool used for data analysis, data exploration, and generating meaningful insights in a clear and intuitive way.</p><h3 id="heading-2-is-sql-used-in-tableau"><strong>2. Is SQL used in Tableau?</strong></h3><p>Yes, SQL is used in Tableau to connect to relational databases and retrieve data, and users can also write their own SQL queries within Tableau to manipulate and transform data.</p><h3 id="heading-3-is-tableau-better-than-excel"><strong>3. Is Tableau better than Excel?</strong></h3><p>Tableau is far better than Excel when it comes to data visualization. It has powerful data visualization capabilities, allowing users to create interactive and dynamic visualizations.</p><h3 id="heading-4-what-is-power-bi-used-for"><strong>4. What is Power BI used for?</strong></h3><p>Power BI is a business analytics service by Microsoft used for gathering, analyzing, and visualizing data to help businesses make data-driven decisions, integrate data from different sources, collaborate, and gain insights.</p><h3 id="heading-5-is-power-bi-vs-excel"><strong>5. Is Power BI vs Excel?</strong></h3><p>Excel is a powerful tool for data analysis, but is limited in terms of advanced data modeling and visualization capabilities, while Power BI is designed specifically for business analytics and provides more advanced capabilities for data modeling, visualization, and analysis.</p><h3 id="heading-6-what-is-power-bi-full-form"><strong>6. What is Power BI full form?</strong></h3><p>The full form of Power BI is "Power Business Intelligence".</p>]]><![CDATA[<h2 id="heading-what-is-tableau"><strong>What is Tableau?</strong></h2><p>The world's leading analytics platform - Successful business forecasts, decisions, and strategies are driven by data. Source: <a target="_blank" href="https://www.tableau.com/">https://www.tableau.com/</a></p><h2 id="heading-what-is-power-bi"><strong>What is Power BI?</strong></h2><p>Do more with less using an end-to-end BI platform to create a single source of truth, uncover more powerful insights, and translate them into impact. Source: <a target="_blank" href="https://powerbi.microsoft.com/en-us/">https://powerbi.microsoft.com/en-us/</a></p><h2 id="heading-tableau-vs-power-bi-comparison"><strong>Tableau vs Power BI Comparison</strong></h2><div class="hn-table"><table><thead><tr><td><strong>Characteristic</strong></td><td><strong>Tableau</strong></td><td><strong>Power BI</strong></td></tr></thead><tbody><tr><td><strong>Use Cases</strong></td><td>Sales and marketing analytics Financial analysis and reporting Operations and supply chain management Human resources analytics Healthcare data analysis Government and public sector data analysis Education data analysis</td><td>Financial analysis and reporting Sales and marketing analysis Supply chain management and logistics analysis Human resources and employee performance analysis Healthcare data analysis Real-time monitoring of operational data</td></tr><tr><td><strong>When not to use</strong></td><td>When you need to perform complex data processing and transformation When you need to perform real-time data analysis and streaming When you need to perform predictive analytics and machine learning at scale</td><td>When you have very limited data analysis and visualization needs When you need to perform complex statistical analysis that requires specialized software</td></tr><tr><td><strong>Type of data processing</strong></td><td>Tableau provides both ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) data processing methods. Users can connect to various data sources, transform the data using drag and drop features, and visualize it through interactive dashboards and reports.</td><td>Power BI offers interactive data processing capabilities that allow users to connect to various data sources, transform data, and create interactive reports and dashboards.</td></tr><tr><td><strong>Data ingestion</strong></td><td>Tableau supports a wide range of data sources, including databases, cloud-based data warehouses, spreadsheets, and more. Users can connect to these sources and extract data using Tableau's connector library.</td><td>Power BI allows users to import data from various sources, such as Excel, CSV, and SQL Server. It also supports direct connections to cloud-based data sources like Azure SQL Database, Azure Data Lake Storage, and many more.</td></tr><tr><td><strong>Data transformation</strong></td><td>Tableau provides a variety of data transformation tools, such as data blending, pivoting, splitting, and aggregating. Users can perform these operations using drag and drop features, without writing any code.</td><td>Power BI offers a wide range of data transformation capabilities, including data shaping, cleansing, and modeling. Users can use the built-in Power Query Editor to transform and shape data before building visualizations.</td></tr><tr><td><strong>Machine learning support</strong></td><td>Tableau offers machine learning integration through its Tableau Prep product. Users can use machine learning models to clean, transform, and prepare their data for analysis.</td><td>Power BI offers machine learning support through the integration of Azure Machine Learning. This allows users to train machine learning models and use them for predictive analytics within Power BI.</td></tr><tr><td><strong>Query language</strong></td><td>Tableau provides a proprietary visual query language that allows users to perform complex data analysis and aggregation without writing SQL queries.</td><td>Power BI supports a variety of query languages, including SQL, DAX, and M.</td></tr><tr><td><strong>Deployment model</strong></td><td>Tableau offers both on-premises and cloud-based deployment options. Users can deploy Tableau Server on their own infrastructure or use Tableau Online, a cloud-based service.</td><td>Power BI offers both cloud-based and on-premises deployment models, giving users the flexibility to choose the deployment model that best suits their needs.</td></tr><tr><td><strong>Integration with other services</strong></td><td>Tableau integrates with a wide range of other services, such as Salesforce, AWS, Azure, Google Cloud, and more. Users can connect to these services and extract data for analysis.</td><td>Power BI offers integration with various Microsoft services, including Azure, Dynamics 365, and Office 365. It also supports integration with third-party services like Salesforce and Google Analytics.</td></tr><tr><td><strong>Security</strong></td><td>Tableau provides a variety of security features, such as user authentication, data encryption, and access control. Users can also configure Tableau to comply with various data privacy regulations, such as GDPR and HIPAA.</td><td>Power BI provides robust security features, including role-based access control, row-level security, and data encryption.</td></tr><tr><td><strong>Pricing model</strong></td><td>Tableau offers a variety of pricing models, including perpetual licenses, subscription licenses, and pay-as-you-go licenses. Pricing varies based on the features and deployment options chosen.</td><td>Power BI offers a range of pricing options, including a free version, a per-user monthly subscription, and an enterprise version with a capacity-based pricing model.</td></tr><tr><td><strong>Scalability</strong></td><td>Tableau can scale to handle large amounts of data and users. Users can add more servers and resources as needed to support their growing needs.</td><td>Power BI offers scalable solutions for small and large businesses. Users can start with a free version and upgrade to a paid version as their needs grow.</td></tr><tr><td><strong>Performance</strong></td><td>Tableau provides fast and responsive data analysis and visualization, thanks to its in-memory data engine and parallel processing capabilities.</td><td>Power BI provides high-performance data analysis and visualization capabilities, allowing users to work with large data sets quickly and easily.</td></tr><tr><td><strong>Availability</strong></td><td>Tableau offers high availability and fault tolerance, thanks to its distributed architecture and automatic failover capabilities.</td><td>Power BI offers high availability and uptime through its cloud-based deployment model.</td></tr><tr><td><strong>Reliability</strong></td><td>Tableau provides reliable and consistent results, thanks to its data validation and error handling capabilities.</td><td>Power BI offers reliable solutions with its enterprise-level SLAs and disaster recovery capabilities.</td></tr><tr><td><strong>Monitoring and management</strong></td><td>Tableau provides a variety of monitoring and management tools, such as performance metrics, alerting, and logging. Users can also monitor and manage their Tableau deployments using third-party tools.</td><td>Power BI provides monitoring and management capabilities through the Power BI service admin portal, allowing administrators to manage users, monitor usage, and configure settings.</td></tr><tr><td><strong>Developer tools & integration</strong></td><td>Tableau provides a variety of developer tools and APIs, such as the Tableau Extensions API, the Tableau Web Data Connector API, and the Tableau JavaScript API. Developers can use these tools to build custom integrations and extensions for Tableau.</td><td>Power BI provides developer tools and integration with Visual Studio and other development environments, allowing developers to build custom visualizations and extend Power BI functionality.</td></tr><tr><td><strong>Decoding Pricing</strong></td><td>Tableau has a complex pricing structure with several options for its products and editions such as Tableau Desktop, Tableau Server, Tableau Creator, Tableau Online, etc.</td><td>Power BI offers a variety of pricing options based on user needs: Power BI Free, Power BI Pro, Power BI Premium, and Power BI Embedded.</td></tr></tbody></table></div><h2 id="heading-faqs">FAQ's</h2><h3 id="heading-1-what-is-tableau-used-for"><strong>1. What is Tableau used for?</strong></h3><p>Tableau is a data visualization tool used for data analysis, data exploration, and generating meaningful insights in a clear and intuitive way.</p><h3 id="heading-2-is-sql-used-in-tableau"><strong>2. Is SQL used in Tableau?</strong></h3><p>Yes, SQL is used in Tableau to connect to relational databases and retrieve data, and users can also write their own SQL queries within Tableau to manipulate and transform data.</p><h3 id="heading-3-is-tableau-better-than-excel"><strong>3. Is Tableau better than Excel?</strong></h3><p>Tableau is far better than Excel when it comes to data visualization. It has powerful data visualization capabilities, allowing users to create interactive and dynamic visualizations.</p><h3 id="heading-4-what-is-power-bi-used-for"><strong>4. What is Power BI used for?</strong></h3><p>Power BI is a business analytics service by Microsoft used for gathering, analyzing, and visualizing data to help businesses make data-driven decisions, integrate data from different sources, collaborate, and gain insights.</p><h3 id="heading-5-is-power-bi-vs-excel"><strong>5. Is Power BI vs Excel?</strong></h3><p>Excel is a powerful tool for data analysis, but is limited in terms of advanced data modeling and visualization capabilities, while Power BI is designed specifically for business analytics and provides more advanced capabilities for data modeling, visualization, and analysis.</p><h3 id="heading-6-what-is-power-bi-full-form"><strong>6. What is Power BI full form?</strong></h3><p>The full form of Power BI is "Power Business Intelligence".</p>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1686638183751/f527a8cf-31eb-4815-8192-ca4743b9837e.png<![CDATA[Let's break down - Pandas for beginners]]>https://blog.hemath.com/lets-break-down-pandas-for-beginnershttps://blog.hemath.com/lets-break-down-pandas-for-beginnersFri, 04 Nov 2022 03:31:33 GMT<![CDATA[<p>In this blog, we will start with the basics of the famous Python library called Pandas and gradually advance to complex and advanced topics. Let us begin this tutorial on Pandas with a brief introduction to what the library is all about.</p><h2 id="heading-whats-pandas-in-python-for"><strong>Whats Pandas in Python for?</strong></h2><p>Data Science involves collecting, storing, and aggregating data, followed by its cleaning, exploration, and analysis. There is a heavy emphasis on the cleaning of data before it can be further processed. As a result, care is taken to perform a thorough exploratory data analysis to generate a dataset with the utmost quality. Python offers the Pandas library, in-built with features that support data pre-processing throughout the lifeline of the data analysis process. A clean dataset is an excellent starting point for hypothesis testing and can also be used for further modeling and application of data analysis and machine learning algorithms.</p><p>Developed by Wes McKinney, Pandas is a high-level data manipulation library built on the Python programming language. Python Pandas is a quick, powerful, versatile, easy-to-use open-source data analysis and manipulation tool. It is based on the Numpy package, and the dataframe is its primary data structure.</p><h2 id="heading-python-pandas-series"><strong>Python Pandas - Series</strong></h2><p>A Series is quite similar to a <a target="_blank" href="https://blog.hemath.com/lets-break-down-numpy-for-beginners">NumPy</a> array (it is built on top of the NumPy array object). A Series can include axis labels, which means it can be indexed by a label instead of just a number location, which distinguishes the NumPy array from a Series. It can also hold any arbitrary Python Object rather than just numeric data.</p><h3 id="heading-create-a-pandas-series"><strong>Create a Pandas Series</strong></h3><p>We can convert a NumPy array, dictionary, or list to a Series:</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pdmy_labels = [<span class="hljs-string">'x'</span>,<span class="hljs-string">'y'</span>,<span class="hljs-string">'z'</span>]demo_list = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]demo_array = np.array([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>])demo_dict = {<span class="hljs-string">'x'</span>:<span class="hljs-number">1</span>,<span class="hljs-string">'y'</span>:<span class="hljs-number">2</span>,<span class="hljs-string">'z'</span>:<span class="hljs-number">3</span>}Using <span class="hljs-string">"Numpy array"</span>pd.Series(data = demo_array,index = my_labels)</code></pre><h3 id="heading-what-is-the-main-difference-between-a-pandas-series-and-a-single-column-dataframe-in-python"><strong>What is the main difference between a Pandas series and a single-column Dataframe in Python?</strong></h3><p>A Pandas Series has only one dimension, but a DataFrame has two. As a result, whereas a single-column DataFrame can have a name for its single column, a Series cannot. In reality, a DataFrame's columns can all be turned into Series.</p><p>There are a few intriguing points to consider -</p><p>1. Indexes and columns in Pandas Dataframes and Series make data access and retrieval simple. They're also changeable.</p><p>2. In a Dataframe, a column is essentially a Series. Series operations are used when you simply wish to manipulate a single column of data. They are commonly used in graph plotting.</p><p>3. Dataframes are often used to represent data in a tabular format. It simplifies the analysis, extraction, and alteration of two-dimensional data.</p><h3 id="heading-how-to-convert-a-pandas-series-to-a-list-in-python"><strong>How to convert a Pandas Series to a list in Python?</strong></h3><p>To convert a series to a list, use Pandas tolist(). The Series is initially of the type pandas.core.series. It is transformed to a list data type by using the tolist() method.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing library</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating series</span>demo_series = pd.Series([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>])<span class="hljs-comment">#Converting series to list</span>demo_list = demo_series.tolist()print(<span class="hljs-string">"Data type before converting = "</span>,type(demo_series))print(<span class="hljs-string">"Data type after converting = "</span>,type(demo_list))</code></pre><h3 id="heading-how-to-convert-a-list-to-a-pandas-series-in-python"><strong>How to convert a list to a Pandas Series in Python?</strong></h3><p>We can directly convert a list to Pandas Series by just passing a list object in Series.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing library</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating list</span>demo_list = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]<span class="hljs-comment">#Converting list to series</span>demo_series = pd.Series(demo_list)print(<span class="hljs-string">"Data type before converting = "</span>,type(demo_list))print(<span class="hljs-string">"Data type after converting = "</span>,type(demo_series))</code></pre><h3 id="heading-how-to-convert-pandas-series-to-dataframe-in-python"><strong>How to convert Pandas series to Dataframe in Python?</strong></h3><p>To convert a series to a Dataframe, use Pandas to_frame(). The Series is initially of the type pandas.core.series. It is transformed into a Dataframe data type by using the to_frame() method.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing library</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating series</span>demo_series = pd.Series([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>])<span class="hljs-comment">#Converting series to dataframe</span>demo_dataframe = demo_series.to_frame()print(<span class="hljs-string">"Data type before converting = "</span>,type(demo_series))print(<span class="hljs-string">"Data type after converting = "</span>,type(demo_dataframe))</code></pre><h3 id="heading-how-to-sort-a-pandas-series-in-python"><strong>How to sort a Pandas series in Python?</strong></h3><p>The Series.sort_values() function is used to sort a series object in ascending or descending order according to a set of criteria. The function also gives you the option of using your own sorting algorithm.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing library</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating series</span>demo_series = pd.Series([<span class="hljs-number">23</span>,<span class="hljs-number">10</span>,<span class="hljs-number">5</span>,<span class="hljs-number">16</span>,<span class="hljs-number">30</span>])print(<span class="hljs-string">"Original Series =>\n"</span>,demo_series)<span class="hljs-comment">#Sorting series in ascending order</span>asc_series = demo_series.sort_values()<span class="hljs-comment">#Sorting series in descending order</span>dsc_series = demo_series.sort_values(ascending=<span class="hljs-literal">False</span>)<span class="hljs-comment">#To make changes in original series</span>demo_series.sort_values(inplace=<span class="hljs-literal">True</span>)print(<span class="hljs-string">"Sorted Series in ascending order =>\n"</span>,asc_series)print(<span class="hljs-string">"Sorted Series in descending order =>\n"</span>,dsc_series)</code></pre><h2 id="heading-python-pandas-dataframes"><strong>Python Pandas - Dataframes</strong></h2><p>DataFrames are Pandas' workhorses, and they're based on the R programming language. A DataFrame can be thought of as a collection of Series objects that share the same index. Pandas DataFrame is a possibly heterogeneous two-dimensional size-mutable tabular data format with labelled axes (rows and columns). A data frame is a two-dimensional data structure in which data is organized in rows and columns in a tabular format.</p><p>Key features -</p><p>1. It can consist of columns with different data types.</p><p>2. We can perform arithmetic operations on rows and columns.</p><p>3. It is mutable.</p><h3 id="heading-how-to-create-pandas-dataframe-in-python"><strong>How to create Pandas Dataframe in Python?</strong></h3><p>We can create a Pandas Dataframe from a dictionary or list:</p><pre><code class="lang-python"><span class="hljs-comment">#Example 1</span><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating one dimensional list</span>demo_list = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]<span class="hljs-comment">#Creating Dataframe from list</span>pd.DataFrame(demo_list)</code></pre><pre><code class="lang-python"><span class="hljs-comment">#Example 2</span><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating two dimensional list</span>demo_list = [[<span class="hljs-string">'Roy'</span>,<span class="hljs-number">1</span>],[<span class="hljs-string">'Jason'</span>,<span class="hljs-number">2</span>],[<span class="hljs-string">'Sancho'</span>,<span class="hljs-number">3</span>]]<span class="hljs-comment">#Creating dataframe from list</span>demo_dataframe = pd.DataFrame(demo_list,columns=[<span class="hljs-string">'Name'</span>,<span class="hljs-string">'Roll No'</span>])</code></pre><pre><code class="lang-python">Using <span class="hljs-string">"Dictionary"</span><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Roll No"</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict)</code></pre><h3 id="heading-how-to-add-a-column-to-pandas-dataframe-in-python"><strong>How to add a column to Pandas Dataframe in Python?</strong></h3><p>Let's assume we have a "Students" Dataframe having two columns as "Name" and "Roll no". Now we want to add a new column as "Marks" in the pre-existing students Dataframe.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Roll No"</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict)print(<span class="hljs-string">"Original dataframe =>\n"</span>,demo_dataframe)<span class="hljs-comment">#Adding new column "Marks"</span>demo_dataframe[<span class="hljs-string">"Marks"</span>] = [<span class="hljs-number">90</span>,<span class="hljs-number">87</span>,<span class="hljs-number">95</span>]<span class="hljs-comment">#Dataframe after adding a new column.</span>demo_dataframe</code></pre><h3 id="heading-how-to-append-a-pandas-dataframe-to-another-dataframe-in-python"><strong>How to append a Pandas Dataframe to another Dataframe in Python?</strong></h3><p>The append() function adds rows from another Dataframe to the end of the current Dataframe and returns a new Dataframe object. Columns not present in the original Data Frames are created as new columns, and the new cells are filled with a NaN value.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dataframe1</span>demo_dataframe1 = pd.DataFrame({<span class="hljs-string">'Name'</span>: [<span class="hljs-string">'A'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'D'</span>], <span class="hljs-string">'Roll No'</span>: [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>]})<span class="hljs-comment">#Creating dataframe2</span>demo_dataframe2 = pd.DataFrame({<span class="hljs-string">'Name'</span>: [<span class="hljs-string">'E'</span>, <span class="hljs-string">'F'</span>, <span class="hljs-string">'G'</span>, <span class="hljs-string">'H'</span>], <span class="hljs-string">'Roll No'</span>: [<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">7</span>,<span class="hljs-number">8</span>]})<span class="hljs-comment">#Appending dataframe2 to dataframe1</span>demo_dataframe1.append(demo_dataframe2)</code></pre><h3 id="heading-how-to-sort-pandas-dataframe-in-python"><strong>How to sort Pandas Dataframe in Python?</strong></h3><p>The Dataframe.sort_values() function is used to sort a Dataframe object in ascending or descending order according to a set of criteria. The function also gives you the option of using a sorting algorithm of your choice.</p><p>Let's assume we have a "Students'' Dataframe. Now we want to sort this data frame according to the "Marks" column.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Roll No"</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],<span class="hljs-string">"Marks"</span>:[<span class="hljs-number">90</span>,<span class="hljs-number">87</span>,<span class="hljs-number">95</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict)print(<span class="hljs-string">"Original dataframe =>\n"</span>,demo_dataframe)<span class="hljs-comment">#Sorting on the basis of the "Marks" column in ascending order.</span>demo_dataframe = demo_dataframe.sort_values(by=<span class="hljs-string">"Marks"</span>)print(<span class="hljs-string">"\n Sorted dataframe in ascending order of Marks =>\n"</span>,demo_dataframe)<span class="hljs-comment">#Sorting on the basis of the "Marks" column in descending order.</span>demo_dataframe = demo_dataframe.sort_values(by=<span class="hljs-string">"Marks"</span>,ascending=<span class="hljs-literal">False</span>)print(<span class="hljs-string">"\n Sorted dataframe in descending order of Marks =>\n"</span>,demo_dataframe)</code></pre><h3 id="heading-how-to-export-pandas-dataframe-to-csv-in-python"><strong>How to export Pandas Dataframe to CSV in Python?</strong></h3><p>We can use the to_csv() function to export Pandas Dataframe to CSV.</p><p>Let's assume we have a "Students" Dataframe. Now we want to export this Dataframe to CSV.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Roll No"</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],<span class="hljs-string">"Marks"</span>:[<span class="hljs-number">90</span>,<span class="hljs-number">87</span>,<span class="hljs-number">95</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict)<span class="hljs-comment">#Exporting dataframe to "Student.csv" file.</span>demo_dataframe.to_csv(<span class="hljs-string">"Student.csv"</span>)</code></pre><h3 id="heading-what-is-pandas-dataframe-index"><strong>What is Pandas Dataframe index?</strong></h3><p>Indexing, also called subset selection, involves picking specific rows and columns of data from a DataFrame- you can either select all rows and a few columns, all columns and a few rows, or a few rows and columns as needed.</p><p>Let's assume we have a "Students" Dataframe. Now Let's play with the index to get some part of the data from a Dataframe.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Roll No"</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],<span class="hljs-string">"Marks"</span>:[<span class="hljs-number">90</span>,<span class="hljs-number">87</span>,<span class="hljs-number">95</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict)demo_dataframe<span class="hljs-comment">#Selecting only the "Name" column</span>demo_dataframe[<span class="hljs-string">'Name'</span>]<span class="hljs-comment">#Selecting only 2nd row</span>demo_dataframe.iloc[<span class="hljs-number">1</span>,:]<span class="hljs-comment">#Selecting top 2 rows</span>demo_dataframe.iloc[<span class="hljs-number">0</span>:<span class="hljs-number">2</span>,:]</code></pre><h3 id="heading-how-to-read-csv-files-using-pandas-in-python"><strong>How to read CSV files using Pandas in Python?</strong></h3><p>We can use the read_csv() function to read CSV files in Pandas.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)</code></pre><h2 id="heading-python-pandas-aggregations"><strong>Python Pandas- Aggregations</strong></h2><h3 id="heading-how-to-use-pandas-to-perform-aggregations-such-as-sum-min-max-on-dataframe-columns-in-python"><strong>How to use Pandas to perform aggregations such as sum, min, max on Dataframe columns in Python?</strong></h3><p>Python Pandas provides built-in functions to perform aggregations on Dataframe columns. To find the minimum and maximum elements, we can use the min() and max() functions, respectively. The sum() function can be used to find the sum of elements.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Sum aggregation function</span>data[<span class="hljs-string">'Points'</span>].sum()<span class="hljs-comment">#Min aggregation function</span>data[<span class="hljs-string">'Points'</span>].min()<span class="hljs-comment">#Max aggregation function</span>data[<span class="hljs-string">'Points'</span>].max()</code></pre><h3 id="heading-how-to-apply-the-pandas-group-by-aggregation-in-python"><strong>How to apply the Pandas group by aggregation in Python?</strong></h3><p>In Pandas, the groupby operation combines or splits the data frame object by applying some function and combining the results obtained. The groupby function is used to group large amounts of data and perform computations on the groups created.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Find out total points scored by each category of gender.</span>data[[<span class="hljs-string">'Artist.gender'</span>,<span class="hljs-string">'Points'</span>]].groupby(<span class="hljs-string">'Artist.gender'</span>).sum()<span class="hljs-comment">#Find out the minimum point scored by each category of gender.</span>data[[<span class="hljs-string">'Artist.gender'</span>,<span class="hljs-string">'Points'</span>]].groupby(<span class="hljs-string">'Artist.gender'</span>).min()<span class="hljs-comment">#Find out the maximum point scored by each category of gender.</span>data[[<span class="hljs-string">'Artist.gender'</span>,<span class="hljs-string">'Points'</span>]].groupby(<span class="hljs-string">'Artist.gender'</span>).max()</code></pre><h3 id="heading-how-to-use-pandas-to-apply-groupby-with-two-different-aggregationssum-and-mean-on-dataframe-columns-in-python"><strong>How to use Pandas to apply groupby with two different aggregations(sum and mean) on Dataframe columns in Python?</strong></h3><p>Here, we demonstrate the use of grouping the dataset to perform further aggregations based on the grouping. </p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Find out mean points scored as well as total points scored by each category of gender.</span>data.groupby(<span class="hljs-string">"Artist.gender"</span>).agg({<span class="hljs-string">"Points"</span>: [np.mean, np.sum]})</code></pre><h3 id="heading-how-to-apply-pandas-median-aggregation-in-python"><strong>How to apply Pandas median aggregation in Python?</strong></h3><p>The Pandas library has the built-in function median() that can find the median value in a particular column.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data <span class="hljs-comment">#Median aggregation function</span>data[<span class="hljs-string">'Points'</span>].median()</code></pre><h2 id="heading-python-pandas-missing-data"><strong>Python Pandas- Missing Data</strong></h2><h3 id="heading-how-to-find-missing-values-in-pandas-dataframe"><strong>How to find missing values in Pandas Dataframe?</strong></h3><p>Once data is gathered, it is often found that there are several missing values that can interfere with the analysis. It is essential first to identify the missing values so that they can be handled accordingly.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h3 id="heading-how-to-remove-missing-data-using-pandas-in-python"><strong>How to remove missing data using Pandas in Python?</strong></h3><p>Null values appear as NaN in Dataframe when a CSV file contains null values. The dropna() method in Pandas allows the user to evaluate and drop Null Rows/Columns in various methods.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>There are so many missing values present in the data. We have 422 missing values in both 'Artist.gender' and 'Group.Solo' columns. There are 367 missing values in the '<a target="_blank" href="http://Semi.Final">Semi.Final</a>.Number' column. And so on. Let's see how we can remove all missing values.</p><pre><code class="lang-python"><span class="hljs-comment">#Remove missing values.</span>data.dropna(inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h3 id="heading-how-to-iterate-over-rows-to-find-missing-data-in-pandas-dataframe"><strong>How to iterate over rows to find missing data in Pandas Dataframe?</strong></h3><p>You can use the iterrows() function to find missing values by iterating over the rows in a Python Pandas data frame.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Let's try to find out which rows contain missing data in the 'Artist.gender' column.</span><span class="hljs-keyword">for</span> i,d <span class="hljs-keyword">in</span> data.iterrows():<span class="hljs-keyword">if</span> pd.isna(d[<span class="hljs-string">'Artist.gender'</span>]):print(<span class="hljs-string">"Row number => "</span>,i,<span class="hljs-string">"Data => "</span>,d[<span class="hljs-string">'Artist.gender'</span>])</code></pre><h3 id="heading-how-to-drop-rows-which-contain-missing-data-in-pandas-dataframe"><strong>How to drop rows which contain missing data in Pandas Dataframe?</strong></h3><p>Null values appear as NaN in Data Frame when a CSV file contains null values. The dropna() method in Pandas allows the user to evaluate and drop Null Rows/Columns in various methods. The dropna() method removes all the rows which contain at least one missing data.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>There are so many missing values present in data. We have 422 missing values in 'Artist.gender' and 'Group.Solo' columns. There are 367 missing values in '<a target="_blank" href="http://Semi.Final">Semi.Final</a>.Number' column. And so on. Let's see how we can remove all missing values.</p><pre><code class="lang-python"><span class="hljs-comment">#Remove missing values.</span>data.dropna(inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h3 id="heading-how-to-transform-missing-data-in-pandas-dataframe"><strong>How to transform missing data in Pandas Dataframe?</strong></h3><p>The fillna() method is used to replace missing data with some value. For numerical columns, we can replace missing values either with mean or median. For categorical columns, we can replace missing values with mode.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>As we can see, we have 422 missing values in Artist.gender column, which is a categorical variable. We will replace all the missing values with the mode of this column.</p><pre><code class="lang-python"><span class="hljs-comment">#Finding out the Mode of 'Artist.gender' column.</span>mode = data[<span class="hljs-string">'Artist.gender'</span>].mode()[<span class="hljs-number">0</span>]<span class="hljs-comment">#Replacing all the missing value with Mode of 'Artist.gender' column.</span>data[<span class="hljs-string">'Artist.gender'</span>].fillna(value=mode,inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>Now let's see one numerical column. We have 344 missing values in the Happiness column, which is a numerical variable. We will replace all the missing values with the mean of this column.</p><pre><code class="lang-python"><span class="hljs-comment">#Finding out the Mean of the 'Happiness' column.</span>mean = data[<span class="hljs-string">'Happiness'</span>].mean()<span class="hljs-comment">#Replacing all the missing values with the mean of the 'Happiness' column.</span>data[<span class="hljs-string">'Happiness'</span>].fillna(value=mean,inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h3 id="heading-how-to-replace-the-mean-of-the-column-with-missing-data-in-the-pandas-dataframe"><strong>How to replace the mean of the column with missing data in the Pandas Dataframe?</strong></h3><p>The fillna() method is used to replace missing data with some value. For numerical columns, we can replace missing values either with mean or median. For the categorical column, we can replace missing values with mode.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>As we can see, we have 344 missing values in the Happiness column, which is a numerical variable. We will replace all the missing values with the Mean of this column.</p><pre><code class="lang-python"><span class="hljs-comment">#Finding out the Mean of 'Happiness' column.</span>mean = data[<span class="hljs-string">'Happiness'</span>].mean()<span class="hljs-comment">#Replacing all the missing values with the mean of the 'Happiness' column.</span>data[<span class="hljs-string">'Happiness'</span>].fillna(value=mean,inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h3 id="heading-how-to-impute-values-if-there-are-missing-values-in-the-particular-column-in-pandas"><strong>How to impute values if there are missing values in the particular column in Pandas?</strong></h3><p>The fillna() method is used to replace missing data with some value. For numerical columns, we can replace missing values either with mean or median. For the categorical column, we can replace missing values with mode.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>As we can see, we have 422 missing values in the Artist.gender column, which is a categorical variable. We will replace all the missing values with the mode of this column.</p><pre><code class="lang-python"><span class="hljs-comment">#Finding out the Mode of 'Artist.gender' column.</span>mode = data[<span class="hljs-string">'Artist.gender'</span>].mode()[<span class="hljs-number">0</span>]<span class="hljs-comment">#Replacing all the missing values with Mode of 'Artist.gender' column.</span>data[<span class="hljs-string">'Artist.gender'</span>].fillna(value=mode,inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>Now let's see one numerical column. We have 344 missing values in the Happiness column, which is a numerical variable. We will replace all the missing values with the Mean of this column.</p><pre><code class="lang-python"><span class="hljs-comment">#Finding out the mean of the 'Happiness' column.</span>mean = data[<span class="hljs-string">'Happiness'</span>].mean()<span class="hljs-comment">#Replacing all the missing values with the mean of the 'Happiness' column.</span>data[<span class="hljs-string">'Happiness'</span>].fillna(value=mean,inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h2 id="heading-python-pandas-reindexing"><strong>Python Pandas - Reindexing</strong></h2><h3 id="heading-how-to-reindex-a-dataframe-in-pandas"><strong>How to reindex a Dataframe in Pandas?</strong></h3><p>A DataFrame's row and column labels are changed when it is reindexed. The term "reindex" refers to the process of aligning data to a specific set of tags along a single axis.</p><p>It rearranges the data to correspond to a new set of labels.</p><p>It adds missing value (NA) markers to label positions if there is no data for the label.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pdindex = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>]<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Marks"</span>:[<span class="hljs-number">90</span>,<span class="hljs-number">95</span>,<span class="hljs-number">90</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict,index=index)<span class="hljs-comment">#Reindexing</span>new_index = [<span class="hljs-number">2</span>, <span class="hljs-number">1</span>, <span class="hljs-number">3</span>]demo_dataframe.reindex(new_index)</code></pre><h3 id="heading-how-to-reset-the-index-using-concatenation-in-pandas"><strong>How to reset the index using concatenation in Pandas?</strong></h3><p>We can reset the index using concat() function as well. Example-</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pddemo_df1 = pd.DataFrame([[<span class="hljs-number">1</span>,<span class="hljs-string">'Roy'</span>,<span class="hljs-number">90</span>],[<span class="hljs-number">2</span>,<span class="hljs-string">'James'</span>,<span class="hljs-number">95</span>]],columns=[<span class="hljs-string">'Roll no'</span>, <span class="hljs-string">'Name'</span>, <span class="hljs-string">'Marks'</span>])demo_df2 = pd.DataFrame([[<span class="hljs-number">3</span>,<span class="hljs-string">'Cris'</span>,<span class="hljs-number">98</span>],[<span class="hljs-number">4</span>,<span class="hljs-string">'Jeff'</span>,<span class="hljs-number">80</span>]],columns=[<span class="hljs-string">'Roll no'</span>, <span class="hljs-string">'Name'</span>, <span class="hljs-string">'Marks'</span>])<span class="hljs-comment">#reset index while concatenation</span>final_df = pd.concat([demo_df1, demo_df2], ignore_index=<span class="hljs-literal">True</span>)<span class="hljs-comment">#print dataframe</span>print(final_df)</code></pre><h3 id="heading-how-to-reset-index-after-sorting-in-pandas-dataframe"><strong>How to reset index after sorting in Pandas Dataframe?</strong></h3><p>The reset_index() function reset the index of the data frame and use the default one.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Let's try to sort our data on the basis of the 'Place' column.</span>data = data.sort_values(by=<span class="hljs-string">"Place"</span>)data.head()<span class="hljs-comment">#Resetting index</span><span class="hljs-comment">#Also add, 'drop=True' indicates that you wish to drop the existing index rather than adding it as a new column to your dataframe.</span>data = data.reset_index(drop=<span class="hljs-literal">True</span>)</code></pre><h2 id="heading-python-pandas-categorical-data"><strong>Python Pandas - Categorical Data</strong></h2><h3 id="heading-how-to-perform-binning-on-categorical-data-in-pandas-dataframe"><strong>How to perform binning on categorical data in Pandas Dataframe?</strong></h3><p>When working with continuous numeric data, it's common to divide it into different buckets for additional analysis. The cut function is used to convert data to a set of discrete buckets.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Let's remove all the rows which contain missing data</span>data.dropna(inplace=<span class="hljs-literal">True</span>)</code></pre><p>As we can see, 'Happiness' column contains continuous numeric data. Let's try to divide it into different buckets/bins using the cut function. We are creating a new column as 'Happiness_bins' where Happiness data is divided into Low, Medium and High categories.</p><pre><code class="lang-python"><span class="hljs-comment">#Performing binning</span>data[<span class="hljs-string">'Happiness_bins'</span>] = pd.cut(data[<span class="hljs-string">'Happiness'</span>],<span class="hljs-number">3</span>,labels=[<span class="hljs-string">'Low'</span>,<span class="hljs-string">'Medium'</span>,<span class="hljs-string">'High'</span>])data.head()</code></pre><h3 id="heading-how-to-list-categorical-variables-in-the-data-in-pandas-dataframe"><strong>How to list categorical variables in the data in Pandas Dataframe?</strong></h3><p>Here, we place all the columns that are not numerical columns into the list of categorical columns. The final list of categorical variables contains the name of all the columns that are not numerical columns.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#List of all columns</span>all_columns = list(data.columns)<span class="hljs-comment">#List of only numeric columns</span>numerical_columns = list(data._get_numeric_data().columns)<span class="hljs-comment">#Columns that are not there in numerical columns are categorical columns</span><span class="hljs-comment">#Creating an empty list for categorical columns.</span>categorical_columns=list()<span class="hljs-keyword">for</span> column <span class="hljs-keyword">in</span> all_columns: <span class="hljs-comment">#Checking if the column is not there in the list of numerical_columns.</span> <span class="hljs-keyword">if</span> column <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> numerical_columns: <span class="hljs-comment">#Appending it into categorical_column list</span> categorical_columns.append(column)<span class="hljs-comment">#List of only categorical columns</span>categorical_columns</code></pre><h3 id="heading-how-to-convert-categorical-data-into-dummy-variables-in-pandas-dataframe"><strong>How to convert categorical data into dummy variables in Pandas Dataframe?</strong></h3><p>The get_dummies() function converts categorical data into dummy or indicator variables.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>datapd.get_dummies(data[<span class="hljs-string">'Artist.gender'</span>])</code></pre><h3 id="heading-how-to-convert-categorical-data-to-numeric-using-catcodeshttpcatcodes-in-pandas-dataframe"><strong>How to convert categorical data to numeric using</strong> <a target="_blank" href="http://cat.codes"><strong>cat.codes</strong></a> <strong>in Pandas Dataframe?</strong></h3><p>The <a target="_blank" href="http://cat.codes">cat.codes</a> function converts categorical data into codes. We can only apply <a target="_blank" href="http://cat.codes">cat.codes</a> on </p><p>columns that have the data type as 'category'.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Removing all rows having missing values.</span>data.dropna(inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Converting data type of 'Artist.gender' column into category</span>data[<span class="hljs-string">'Artist.gender'</span>] = data[<span class="hljs-string">'Artist.gender'</span>].astype(<span class="hljs-string">'category'</span>)<span class="hljs-comment">#Converting categorical data to numeric using cat.codes</span>data[<span class="hljs-string">'Artist.gender_codes'</span>] = data[<span class="hljs-string">'Artist.gender'</span>].cat.codesdata.head()</code></pre><h3 id="heading-how-to-plot-a-countplot-for-categorical-data-in-a-pandas-dataframe"><strong>How to plot a countplot for categorical data in a Pandas Dataframe?</strong></h3><p>The value counts() function returns a Series with unique value counts. The resulting object will be sorted in descending order, with the first member being the most common. By default, NA values are excluded.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data</code></pre><p>Let's try to plot a countplot for 'Artist.gender' column.</p><pre><code class="lang-python"><span class="hljs-comment">#Printing the value count of each category</span>print(data[<span class="hljs-string">'Artist.gender'</span>].value_counts())<span class="hljs-comment">#Plotting countplot</span>data[<span class="hljs-string">'Artist.gender'</span>].value_counts().plot(kind=<span class="hljs-string">"bar"</span>)</code></pre><h2 id="heading-data-visualization-with-pandas"><strong>Data visualization with Pandas</strong></h2><p>Exploratory data analysis requires the use of data visualization. When it comes to delivering an overview or summary of data, it is more effective than just numbers. Data visualizations assist us in comprehending the underlying structure of a dataset or investigating the correlations between variables. Let's see how to visualize data with Pandas.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Scatter plot</span>data.plot(x=<span class="hljs-string">'Place'</span>, y=<span class="hljs-string">'Points'</span>, kind=<span class="hljs-string">'scatter'</span>, figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">6</span>), title=<span class="hljs-string">'Place x Points'</span>)<span class="hljs-comment">#Histogram</span>data[<span class="hljs-string">'Points'</span>].plot(kind=<span class="hljs-string">'hist'</span>, figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">6</span>), title=<span class="hljs-string">'Distribution of Points'</span>)<span class="hljs-comment">#Box plot</span>data.boxplot(column=<span class="hljs-string">'Points'</span>, by=<span class="hljs-string">'Artist.gender'</span>, figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">6</span>))<span class="hljs-comment">#Bar plot</span>data[<span class="hljs-string">'Artist.gender'</span>].value_counts().plot(kind=<span class="hljs-string">'bar'</span>)</code></pre>]]><![CDATA[<p>In this blog, we will start with the basics of the famous Python library called Pandas and gradually advance to complex and advanced topics. Let us begin this tutorial on Pandas with a brief introduction to what the library is all about.</p><h2 id="heading-whats-pandas-in-python-for"><strong>Whats Pandas in Python for?</strong></h2><p>Data Science involves collecting, storing, and aggregating data, followed by its cleaning, exploration, and analysis. There is a heavy emphasis on the cleaning of data before it can be further processed. As a result, care is taken to perform a thorough exploratory data analysis to generate a dataset with the utmost quality. Python offers the Pandas library, in-built with features that support data pre-processing throughout the lifeline of the data analysis process. A clean dataset is an excellent starting point for hypothesis testing and can also be used for further modeling and application of data analysis and machine learning algorithms.</p><p>Developed by Wes McKinney, Pandas is a high-level data manipulation library built on the Python programming language. Python Pandas is a quick, powerful, versatile, easy-to-use open-source data analysis and manipulation tool. It is based on the Numpy package, and the dataframe is its primary data structure.</p><h2 id="heading-python-pandas-series"><strong>Python Pandas - Series</strong></h2><p>A Series is quite similar to a <a target="_blank" href="https://blog.hemath.com/lets-break-down-numpy-for-beginners">NumPy</a> array (it is built on top of the NumPy array object). A Series can include axis labels, which means it can be indexed by a label instead of just a number location, which distinguishes the NumPy array from a Series. It can also hold any arbitrary Python Object rather than just numeric data.</p><h3 id="heading-create-a-pandas-series"><strong>Create a Pandas Series</strong></h3><p>We can convert a NumPy array, dictionary, or list to a Series:</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pdmy_labels = [<span class="hljs-string">'x'</span>,<span class="hljs-string">'y'</span>,<span class="hljs-string">'z'</span>]demo_list = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]demo_array = np.array([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>])demo_dict = {<span class="hljs-string">'x'</span>:<span class="hljs-number">1</span>,<span class="hljs-string">'y'</span>:<span class="hljs-number">2</span>,<span class="hljs-string">'z'</span>:<span class="hljs-number">3</span>}Using <span class="hljs-string">"Numpy array"</span>pd.Series(data = demo_array,index = my_labels)</code></pre><h3 id="heading-what-is-the-main-difference-between-a-pandas-series-and-a-single-column-dataframe-in-python"><strong>What is the main difference between a Pandas series and a single-column Dataframe in Python?</strong></h3><p>A Pandas Series has only one dimension, but a DataFrame has two. As a result, whereas a single-column DataFrame can have a name for its single column, a Series cannot. In reality, a DataFrame's columns can all be turned into Series.</p><p>There are a few intriguing points to consider -</p><p>1. Indexes and columns in Pandas Dataframes and Series make data access and retrieval simple. They're also changeable.</p><p>2. In a Dataframe, a column is essentially a Series. Series operations are used when you simply wish to manipulate a single column of data. They are commonly used in graph plotting.</p><p>3. Dataframes are often used to represent data in a tabular format. It simplifies the analysis, extraction, and alteration of two-dimensional data.</p><h3 id="heading-how-to-convert-a-pandas-series-to-a-list-in-python"><strong>How to convert a Pandas Series to a list in Python?</strong></h3><p>To convert a series to a list, use Pandas tolist(). The Series is initially of the type pandas.core.series. It is transformed to a list data type by using the tolist() method.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing library</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating series</span>demo_series = pd.Series([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>])<span class="hljs-comment">#Converting series to list</span>demo_list = demo_series.tolist()print(<span class="hljs-string">"Data type before converting = "</span>,type(demo_series))print(<span class="hljs-string">"Data type after converting = "</span>,type(demo_list))</code></pre><h3 id="heading-how-to-convert-a-list-to-a-pandas-series-in-python"><strong>How to convert a list to a Pandas Series in Python?</strong></h3><p>We can directly convert a list to Pandas Series by just passing a list object in Series.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing library</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating list</span>demo_list = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]<span class="hljs-comment">#Converting list to series</span>demo_series = pd.Series(demo_list)print(<span class="hljs-string">"Data type before converting = "</span>,type(demo_list))print(<span class="hljs-string">"Data type after converting = "</span>,type(demo_series))</code></pre><h3 id="heading-how-to-convert-pandas-series-to-dataframe-in-python"><strong>How to convert Pandas series to Dataframe in Python?</strong></h3><p>To convert a series to a Dataframe, use Pandas to_frame(). The Series is initially of the type pandas.core.series. It is transformed into a Dataframe data type by using the to_frame() method.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing library</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating series</span>demo_series = pd.Series([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>])<span class="hljs-comment">#Converting series to dataframe</span>demo_dataframe = demo_series.to_frame()print(<span class="hljs-string">"Data type before converting = "</span>,type(demo_series))print(<span class="hljs-string">"Data type after converting = "</span>,type(demo_dataframe))</code></pre><h3 id="heading-how-to-sort-a-pandas-series-in-python"><strong>How to sort a Pandas series in Python?</strong></h3><p>The Series.sort_values() function is used to sort a series object in ascending or descending order according to a set of criteria. The function also gives you the option of using your own sorting algorithm.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing library</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating series</span>demo_series = pd.Series([<span class="hljs-number">23</span>,<span class="hljs-number">10</span>,<span class="hljs-number">5</span>,<span class="hljs-number">16</span>,<span class="hljs-number">30</span>])print(<span class="hljs-string">"Original Series =>\n"</span>,demo_series)<span class="hljs-comment">#Sorting series in ascending order</span>asc_series = demo_series.sort_values()<span class="hljs-comment">#Sorting series in descending order</span>dsc_series = demo_series.sort_values(ascending=<span class="hljs-literal">False</span>)<span class="hljs-comment">#To make changes in original series</span>demo_series.sort_values(inplace=<span class="hljs-literal">True</span>)print(<span class="hljs-string">"Sorted Series in ascending order =>\n"</span>,asc_series)print(<span class="hljs-string">"Sorted Series in descending order =>\n"</span>,dsc_series)</code></pre><h2 id="heading-python-pandas-dataframes"><strong>Python Pandas - Dataframes</strong></h2><p>DataFrames are Pandas' workhorses, and they're based on the R programming language. A DataFrame can be thought of as a collection of Series objects that share the same index. Pandas DataFrame is a possibly heterogeneous two-dimensional size-mutable tabular data format with labelled axes (rows and columns). A data frame is a two-dimensional data structure in which data is organized in rows and columns in a tabular format.</p><p>Key features -</p><p>1. It can consist of columns with different data types.</p><p>2. We can perform arithmetic operations on rows and columns.</p><p>3. It is mutable.</p><h3 id="heading-how-to-create-pandas-dataframe-in-python"><strong>How to create Pandas Dataframe in Python?</strong></h3><p>We can create a Pandas Dataframe from a dictionary or list:</p><pre><code class="lang-python"><span class="hljs-comment">#Example 1</span><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating one dimensional list</span>demo_list = [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]<span class="hljs-comment">#Creating Dataframe from list</span>pd.DataFrame(demo_list)</code></pre><pre><code class="lang-python"><span class="hljs-comment">#Example 2</span><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating two dimensional list</span>demo_list = [[<span class="hljs-string">'Roy'</span>,<span class="hljs-number">1</span>],[<span class="hljs-string">'Jason'</span>,<span class="hljs-number">2</span>],[<span class="hljs-string">'Sancho'</span>,<span class="hljs-number">3</span>]]<span class="hljs-comment">#Creating dataframe from list</span>demo_dataframe = pd.DataFrame(demo_list,columns=[<span class="hljs-string">'Name'</span>,<span class="hljs-string">'Roll No'</span>])</code></pre><pre><code class="lang-python">Using <span class="hljs-string">"Dictionary"</span><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Roll No"</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict)</code></pre><h3 id="heading-how-to-add-a-column-to-pandas-dataframe-in-python"><strong>How to add a column to Pandas Dataframe in Python?</strong></h3><p>Let's assume we have a "Students" Dataframe having two columns as "Name" and "Roll no". Now we want to add a new column as "Marks" in the pre-existing students Dataframe.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Roll No"</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict)print(<span class="hljs-string">"Original dataframe =>\n"</span>,demo_dataframe)<span class="hljs-comment">#Adding new column "Marks"</span>demo_dataframe[<span class="hljs-string">"Marks"</span>] = [<span class="hljs-number">90</span>,<span class="hljs-number">87</span>,<span class="hljs-number">95</span>]<span class="hljs-comment">#Dataframe after adding a new column.</span>demo_dataframe</code></pre><h3 id="heading-how-to-append-a-pandas-dataframe-to-another-dataframe-in-python"><strong>How to append a Pandas Dataframe to another Dataframe in Python?</strong></h3><p>The append() function adds rows from another Dataframe to the end of the current Dataframe and returns a new Dataframe object. Columns not present in the original Data Frames are created as new columns, and the new cells are filled with a NaN value.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dataframe1</span>demo_dataframe1 = pd.DataFrame({<span class="hljs-string">'Name'</span>: [<span class="hljs-string">'A'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-string">'C'</span>, <span class="hljs-string">'D'</span>], <span class="hljs-string">'Roll No'</span>: [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>]})<span class="hljs-comment">#Creating dataframe2</span>demo_dataframe2 = pd.DataFrame({<span class="hljs-string">'Name'</span>: [<span class="hljs-string">'E'</span>, <span class="hljs-string">'F'</span>, <span class="hljs-string">'G'</span>, <span class="hljs-string">'H'</span>], <span class="hljs-string">'Roll No'</span>: [<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">7</span>,<span class="hljs-number">8</span>]})<span class="hljs-comment">#Appending dataframe2 to dataframe1</span>demo_dataframe1.append(demo_dataframe2)</code></pre><h3 id="heading-how-to-sort-pandas-dataframe-in-python"><strong>How to sort Pandas Dataframe in Python?</strong></h3><p>The Dataframe.sort_values() function is used to sort a Dataframe object in ascending or descending order according to a set of criteria. The function also gives you the option of using a sorting algorithm of your choice.</p><p>Let's assume we have a "Students'' Dataframe. Now we want to sort this data frame according to the "Marks" column.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Roll No"</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],<span class="hljs-string">"Marks"</span>:[<span class="hljs-number">90</span>,<span class="hljs-number">87</span>,<span class="hljs-number">95</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict)print(<span class="hljs-string">"Original dataframe =>\n"</span>,demo_dataframe)<span class="hljs-comment">#Sorting on the basis of the "Marks" column in ascending order.</span>demo_dataframe = demo_dataframe.sort_values(by=<span class="hljs-string">"Marks"</span>)print(<span class="hljs-string">"\n Sorted dataframe in ascending order of Marks =>\n"</span>,demo_dataframe)<span class="hljs-comment">#Sorting on the basis of the "Marks" column in descending order.</span>demo_dataframe = demo_dataframe.sort_values(by=<span class="hljs-string">"Marks"</span>,ascending=<span class="hljs-literal">False</span>)print(<span class="hljs-string">"\n Sorted dataframe in descending order of Marks =>\n"</span>,demo_dataframe)</code></pre><h3 id="heading-how-to-export-pandas-dataframe-to-csv-in-python"><strong>How to export Pandas Dataframe to CSV in Python?</strong></h3><p>We can use the to_csv() function to export Pandas Dataframe to CSV.</p><p>Let's assume we have a "Students" Dataframe. Now we want to export this Dataframe to CSV.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Roll No"</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],<span class="hljs-string">"Marks"</span>:[<span class="hljs-number">90</span>,<span class="hljs-number">87</span>,<span class="hljs-number">95</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict)<span class="hljs-comment">#Exporting dataframe to "Student.csv" file.</span>demo_dataframe.to_csv(<span class="hljs-string">"Student.csv"</span>)</code></pre><h3 id="heading-what-is-pandas-dataframe-index"><strong>What is Pandas Dataframe index?</strong></h3><p>Indexing, also called subset selection, involves picking specific rows and columns of data from a DataFrame- you can either select all rows and a few columns, all columns and a few rows, or a few rows and columns as needed.</p><p>Let's assume we have a "Students" Dataframe. Now Let's play with the index to get some part of the data from a Dataframe.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Roll No"</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],<span class="hljs-string">"Marks"</span>:[<span class="hljs-number">90</span>,<span class="hljs-number">87</span>,<span class="hljs-number">95</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict)demo_dataframe<span class="hljs-comment">#Selecting only the "Name" column</span>demo_dataframe[<span class="hljs-string">'Name'</span>]<span class="hljs-comment">#Selecting only 2nd row</span>demo_dataframe.iloc[<span class="hljs-number">1</span>,:]<span class="hljs-comment">#Selecting top 2 rows</span>demo_dataframe.iloc[<span class="hljs-number">0</span>:<span class="hljs-number">2</span>,:]</code></pre><h3 id="heading-how-to-read-csv-files-using-pandas-in-python"><strong>How to read CSV files using Pandas in Python?</strong></h3><p>We can use the read_csv() function to read CSV files in Pandas.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)</code></pre><h2 id="heading-python-pandas-aggregations"><strong>Python Pandas- Aggregations</strong></h2><h3 id="heading-how-to-use-pandas-to-perform-aggregations-such-as-sum-min-max-on-dataframe-columns-in-python"><strong>How to use Pandas to perform aggregations such as sum, min, max on Dataframe columns in Python?</strong></h3><p>Python Pandas provides built-in functions to perform aggregations on Dataframe columns. To find the minimum and maximum elements, we can use the min() and max() functions, respectively. The sum() function can be used to find the sum of elements.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Sum aggregation function</span>data[<span class="hljs-string">'Points'</span>].sum()<span class="hljs-comment">#Min aggregation function</span>data[<span class="hljs-string">'Points'</span>].min()<span class="hljs-comment">#Max aggregation function</span>data[<span class="hljs-string">'Points'</span>].max()</code></pre><h3 id="heading-how-to-apply-the-pandas-group-by-aggregation-in-python"><strong>How to apply the Pandas group by aggregation in Python?</strong></h3><p>In Pandas, the groupby operation combines or splits the data frame object by applying some function and combining the results obtained. The groupby function is used to group large amounts of data and perform computations on the groups created.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Find out total points scored by each category of gender.</span>data[[<span class="hljs-string">'Artist.gender'</span>,<span class="hljs-string">'Points'</span>]].groupby(<span class="hljs-string">'Artist.gender'</span>).sum()<span class="hljs-comment">#Find out the minimum point scored by each category of gender.</span>data[[<span class="hljs-string">'Artist.gender'</span>,<span class="hljs-string">'Points'</span>]].groupby(<span class="hljs-string">'Artist.gender'</span>).min()<span class="hljs-comment">#Find out the maximum point scored by each category of gender.</span>data[[<span class="hljs-string">'Artist.gender'</span>,<span class="hljs-string">'Points'</span>]].groupby(<span class="hljs-string">'Artist.gender'</span>).max()</code></pre><h3 id="heading-how-to-use-pandas-to-apply-groupby-with-two-different-aggregationssum-and-mean-on-dataframe-columns-in-python"><strong>How to use Pandas to apply groupby with two different aggregations(sum and mean) on Dataframe columns in Python?</strong></h3><p>Here, we demonstrate the use of grouping the dataset to perform further aggregations based on the grouping. </p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Find out mean points scored as well as total points scored by each category of gender.</span>data.groupby(<span class="hljs-string">"Artist.gender"</span>).agg({<span class="hljs-string">"Points"</span>: [np.mean, np.sum]})</code></pre><h3 id="heading-how-to-apply-pandas-median-aggregation-in-python"><strong>How to apply Pandas median aggregation in Python?</strong></h3><p>The Pandas library has the built-in function median() that can find the median value in a particular column.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data <span class="hljs-comment">#Median aggregation function</span>data[<span class="hljs-string">'Points'</span>].median()</code></pre><h2 id="heading-python-pandas-missing-data"><strong>Python Pandas- Missing Data</strong></h2><h3 id="heading-how-to-find-missing-values-in-pandas-dataframe"><strong>How to find missing values in Pandas Dataframe?</strong></h3><p>Once data is gathered, it is often found that there are several missing values that can interfere with the analysis. It is essential first to identify the missing values so that they can be handled accordingly.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h3 id="heading-how-to-remove-missing-data-using-pandas-in-python"><strong>How to remove missing data using Pandas in Python?</strong></h3><p>Null values appear as NaN in Dataframe when a CSV file contains null values. The dropna() method in Pandas allows the user to evaluate and drop Null Rows/Columns in various methods.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>There are so many missing values present in the data. We have 422 missing values in both 'Artist.gender' and 'Group.Solo' columns. There are 367 missing values in the '<a target="_blank" href="http://Semi.Final">Semi.Final</a>.Number' column. And so on. Let's see how we can remove all missing values.</p><pre><code class="lang-python"><span class="hljs-comment">#Remove missing values.</span>data.dropna(inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h3 id="heading-how-to-iterate-over-rows-to-find-missing-data-in-pandas-dataframe"><strong>How to iterate over rows to find missing data in Pandas Dataframe?</strong></h3><p>You can use the iterrows() function to find missing values by iterating over the rows in a Python Pandas data frame.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Let's try to find out which rows contain missing data in the 'Artist.gender' column.</span><span class="hljs-keyword">for</span> i,d <span class="hljs-keyword">in</span> data.iterrows():<span class="hljs-keyword">if</span> pd.isna(d[<span class="hljs-string">'Artist.gender'</span>]):print(<span class="hljs-string">"Row number => "</span>,i,<span class="hljs-string">"Data => "</span>,d[<span class="hljs-string">'Artist.gender'</span>])</code></pre><h3 id="heading-how-to-drop-rows-which-contain-missing-data-in-pandas-dataframe"><strong>How to drop rows which contain missing data in Pandas Dataframe?</strong></h3><p>Null values appear as NaN in Data Frame when a CSV file contains null values. The dropna() method in Pandas allows the user to evaluate and drop Null Rows/Columns in various methods. The dropna() method removes all the rows which contain at least one missing data.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>There are so many missing values present in data. We have 422 missing values in 'Artist.gender' and 'Group.Solo' columns. There are 367 missing values in '<a target="_blank" href="http://Semi.Final">Semi.Final</a>.Number' column. And so on. Let's see how we can remove all missing values.</p><pre><code class="lang-python"><span class="hljs-comment">#Remove missing values.</span>data.dropna(inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h3 id="heading-how-to-transform-missing-data-in-pandas-dataframe"><strong>How to transform missing data in Pandas Dataframe?</strong></h3><p>The fillna() method is used to replace missing data with some value. For numerical columns, we can replace missing values either with mean or median. For categorical columns, we can replace missing values with mode.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>As we can see, we have 422 missing values in Artist.gender column, which is a categorical variable. We will replace all the missing values with the mode of this column.</p><pre><code class="lang-python"><span class="hljs-comment">#Finding out the Mode of 'Artist.gender' column.</span>mode = data[<span class="hljs-string">'Artist.gender'</span>].mode()[<span class="hljs-number">0</span>]<span class="hljs-comment">#Replacing all the missing value with Mode of 'Artist.gender' column.</span>data[<span class="hljs-string">'Artist.gender'</span>].fillna(value=mode,inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>Now let's see one numerical column. We have 344 missing values in the Happiness column, which is a numerical variable. We will replace all the missing values with the mean of this column.</p><pre><code class="lang-python"><span class="hljs-comment">#Finding out the Mean of the 'Happiness' column.</span>mean = data[<span class="hljs-string">'Happiness'</span>].mean()<span class="hljs-comment">#Replacing all the missing values with the mean of the 'Happiness' column.</span>data[<span class="hljs-string">'Happiness'</span>].fillna(value=mean,inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h3 id="heading-how-to-replace-the-mean-of-the-column-with-missing-data-in-the-pandas-dataframe"><strong>How to replace the mean of the column with missing data in the Pandas Dataframe?</strong></h3><p>The fillna() method is used to replace missing data with some value. For numerical columns, we can replace missing values either with mean or median. For the categorical column, we can replace missing values with mode.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>As we can see, we have 344 missing values in the Happiness column, which is a numerical variable. We will replace all the missing values with the Mean of this column.</p><pre><code class="lang-python"><span class="hljs-comment">#Finding out the Mean of 'Happiness' column.</span>mean = data[<span class="hljs-string">'Happiness'</span>].mean()<span class="hljs-comment">#Replacing all the missing values with the mean of the 'Happiness' column.</span>data[<span class="hljs-string">'Happiness'</span>].fillna(value=mean,inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h3 id="heading-how-to-impute-values-if-there-are-missing-values-in-the-particular-column-in-pandas"><strong>How to impute values if there are missing values in the particular column in Pandas?</strong></h3><p>The fillna() method is used to replace missing data with some value. For numerical columns, we can replace missing values either with mean or median. For the categorical column, we can replace missing values with mode.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>As we can see, we have 422 missing values in the Artist.gender column, which is a categorical variable. We will replace all the missing values with the mode of this column.</p><pre><code class="lang-python"><span class="hljs-comment">#Finding out the Mode of 'Artist.gender' column.</span>mode = data[<span class="hljs-string">'Artist.gender'</span>].mode()[<span class="hljs-number">0</span>]<span class="hljs-comment">#Replacing all the missing values with Mode of 'Artist.gender' column.</span>data[<span class="hljs-string">'Artist.gender'</span>].fillna(value=mode,inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><p>Now let's see one numerical column. We have 344 missing values in the Happiness column, which is a numerical variable. We will replace all the missing values with the Mean of this column.</p><pre><code class="lang-python"><span class="hljs-comment">#Finding out the mean of the 'Happiness' column.</span>mean = data[<span class="hljs-string">'Happiness'</span>].mean()<span class="hljs-comment">#Replacing all the missing values with the mean of the 'Happiness' column.</span>data[<span class="hljs-string">'Happiness'</span>].fillna(value=mean,inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Checking for missing values</span>data.isna().sum()</code></pre><h2 id="heading-python-pandas-reindexing"><strong>Python Pandas - Reindexing</strong></h2><h3 id="heading-how-to-reindex-a-dataframe-in-pandas"><strong>How to reindex a Dataframe in Pandas?</strong></h3><p>A DataFrame's row and column labels are changed when it is reindexed. The term "reindex" refers to the process of aligning data to a specific set of tags along a single axis.</p><p>It rearranges the data to correspond to a new set of labels.</p><p>It adds missing value (NA) markers to label positions if there is no data for the label.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pdindex = [<span class="hljs-number">1</span>, <span class="hljs-number">2</span>, <span class="hljs-number">3</span>]<span class="hljs-comment">#Creating dictionary</span>demo_dict = {<span class="hljs-string">"Name"</span>:[<span class="hljs-string">"Roy"</span>,<span class="hljs-string">"Jason"</span>,<span class="hljs-string">"Sancho"</span>],<span class="hljs-string">"Marks"</span>:[<span class="hljs-number">90</span>,<span class="hljs-number">95</span>,<span class="hljs-number">90</span>]}<span class="hljs-comment">#Creating dataframe from dictionary</span>demo_dataframe = pd.DataFrame(demo_dict,index=index)<span class="hljs-comment">#Reindexing</span>new_index = [<span class="hljs-number">2</span>, <span class="hljs-number">1</span>, <span class="hljs-number">3</span>]demo_dataframe.reindex(new_index)</code></pre><h3 id="heading-how-to-reset-the-index-using-concatenation-in-pandas"><strong>How to reset the index using concatenation in Pandas?</strong></h3><p>We can reset the index using concat() function as well. Example-</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pddemo_df1 = pd.DataFrame([[<span class="hljs-number">1</span>,<span class="hljs-string">'Roy'</span>,<span class="hljs-number">90</span>],[<span class="hljs-number">2</span>,<span class="hljs-string">'James'</span>,<span class="hljs-number">95</span>]],columns=[<span class="hljs-string">'Roll no'</span>, <span class="hljs-string">'Name'</span>, <span class="hljs-string">'Marks'</span>])demo_df2 = pd.DataFrame([[<span class="hljs-number">3</span>,<span class="hljs-string">'Cris'</span>,<span class="hljs-number">98</span>],[<span class="hljs-number">4</span>,<span class="hljs-string">'Jeff'</span>,<span class="hljs-number">80</span>]],columns=[<span class="hljs-string">'Roll no'</span>, <span class="hljs-string">'Name'</span>, <span class="hljs-string">'Marks'</span>])<span class="hljs-comment">#reset index while concatenation</span>final_df = pd.concat([demo_df1, demo_df2], ignore_index=<span class="hljs-literal">True</span>)<span class="hljs-comment">#print dataframe</span>print(final_df)</code></pre><h3 id="heading-how-to-reset-index-after-sorting-in-pandas-dataframe"><strong>How to reset index after sorting in Pandas Dataframe?</strong></h3><p>The reset_index() function reset the index of the data frame and use the default one.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Let's try to sort our data on the basis of the 'Place' column.</span>data = data.sort_values(by=<span class="hljs-string">"Place"</span>)data.head()<span class="hljs-comment">#Resetting index</span><span class="hljs-comment">#Also add, 'drop=True' indicates that you wish to drop the existing index rather than adding it as a new column to your dataframe.</span>data = data.reset_index(drop=<span class="hljs-literal">True</span>)</code></pre><h2 id="heading-python-pandas-categorical-data"><strong>Python Pandas - Categorical Data</strong></h2><h3 id="heading-how-to-perform-binning-on-categorical-data-in-pandas-dataframe"><strong>How to perform binning on categorical data in Pandas Dataframe?</strong></h3><p>When working with continuous numeric data, it's common to divide it into different buckets for additional analysis. The cut function is used to convert data to a set of discrete buckets.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Let's remove all the rows which contain missing data</span>data.dropna(inplace=<span class="hljs-literal">True</span>)</code></pre><p>As we can see, 'Happiness' column contains continuous numeric data. Let's try to divide it into different buckets/bins using the cut function. We are creating a new column as 'Happiness_bins' where Happiness data is divided into Low, Medium and High categories.</p><pre><code class="lang-python"><span class="hljs-comment">#Performing binning</span>data[<span class="hljs-string">'Happiness_bins'</span>] = pd.cut(data[<span class="hljs-string">'Happiness'</span>],<span class="hljs-number">3</span>,labels=[<span class="hljs-string">'Low'</span>,<span class="hljs-string">'Medium'</span>,<span class="hljs-string">'High'</span>])data.head()</code></pre><h3 id="heading-how-to-list-categorical-variables-in-the-data-in-pandas-dataframe"><strong>How to list categorical variables in the data in Pandas Dataframe?</strong></h3><p>Here, we place all the columns that are not numerical columns into the list of categorical columns. The final list of categorical variables contains the name of all the columns that are not numerical columns.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#List of all columns</span>all_columns = list(data.columns)<span class="hljs-comment">#List of only numeric columns</span>numerical_columns = list(data._get_numeric_data().columns)<span class="hljs-comment">#Columns that are not there in numerical columns are categorical columns</span><span class="hljs-comment">#Creating an empty list for categorical columns.</span>categorical_columns=list()<span class="hljs-keyword">for</span> column <span class="hljs-keyword">in</span> all_columns: <span class="hljs-comment">#Checking if the column is not there in the list of numerical_columns.</span> <span class="hljs-keyword">if</span> column <span class="hljs-keyword">not</span> <span class="hljs-keyword">in</span> numerical_columns: <span class="hljs-comment">#Appending it into categorical_column list</span> categorical_columns.append(column)<span class="hljs-comment">#List of only categorical columns</span>categorical_columns</code></pre><h3 id="heading-how-to-convert-categorical-data-into-dummy-variables-in-pandas-dataframe"><strong>How to convert categorical data into dummy variables in Pandas Dataframe?</strong></h3><p>The get_dummies() function converts categorical data into dummy or indicator variables.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>datapd.get_dummies(data[<span class="hljs-string">'Artist.gender'</span>])</code></pre><h3 id="heading-how-to-convert-categorical-data-to-numeric-using-catcodeshttpcatcodes-in-pandas-dataframe"><strong>How to convert categorical data to numeric using</strong> <a target="_blank" href="http://cat.codes"><strong>cat.codes</strong></a> <strong>in Pandas Dataframe?</strong></h3><p>The <a target="_blank" href="http://cat.codes">cat.codes</a> function converts categorical data into codes. We can only apply <a target="_blank" href="http://cat.codes">cat.codes</a> on </p><p>columns that have the data type as 'category'.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Removing all rows having missing values.</span>data.dropna(inplace=<span class="hljs-literal">True</span>)<span class="hljs-comment">#Converting data type of 'Artist.gender' column into category</span>data[<span class="hljs-string">'Artist.gender'</span>] = data[<span class="hljs-string">'Artist.gender'</span>].astype(<span class="hljs-string">'category'</span>)<span class="hljs-comment">#Converting categorical data to numeric using cat.codes</span>data[<span class="hljs-string">'Artist.gender_codes'</span>] = data[<span class="hljs-string">'Artist.gender'</span>].cat.codesdata.head()</code></pre><h3 id="heading-how-to-plot-a-countplot-for-categorical-data-in-a-pandas-dataframe"><strong>How to plot a countplot for categorical data in a Pandas Dataframe?</strong></h3><p>The value counts() function returns a Series with unique value counts. The resulting object will be sorted in descending order, with the first member being the most common. By default, NA values are excluded.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data</code></pre><p>Let's try to plot a countplot for 'Artist.gender' column.</p><pre><code class="lang-python"><span class="hljs-comment">#Printing the value count of each category</span>print(data[<span class="hljs-string">'Artist.gender'</span>].value_counts())<span class="hljs-comment">#Plotting countplot</span>data[<span class="hljs-string">'Artist.gender'</span>].value_counts().plot(kind=<span class="hljs-string">"bar"</span>)</code></pre><h2 id="heading-data-visualization-with-pandas"><strong>Data visualization with Pandas</strong></h2><p>Exploratory data analysis requires the use of data visualization. When it comes to delivering an overview or summary of data, it is more effective than just numbers. Data visualizations assist us in comprehending the underlying structure of a dataset or investigating the correlations between variables. Let's see how to visualize data with Pandas.</p><pre><code class="lang-python"><span class="hljs-comment">#Importing libraries</span><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd<span class="hljs-comment">#Reading data using read_csv() function</span>data = pd.read_csv(<span class="hljs-string">"https://data.hemath.com/access/file_csv/euro_vision.csv"</span>)<span class="hljs-comment">#Displaying data</span>data<span class="hljs-comment">#Scatter plot</span>data.plot(x=<span class="hljs-string">'Place'</span>, y=<span class="hljs-string">'Points'</span>, kind=<span class="hljs-string">'scatter'</span>, figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">6</span>), title=<span class="hljs-string">'Place x Points'</span>)<span class="hljs-comment">#Histogram</span>data[<span class="hljs-string">'Points'</span>].plot(kind=<span class="hljs-string">'hist'</span>, figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">6</span>), title=<span class="hljs-string">'Distribution of Points'</span>)<span class="hljs-comment">#Box plot</span>data.boxplot(column=<span class="hljs-string">'Points'</span>, by=<span class="hljs-string">'Artist.gender'</span>, figsize=(<span class="hljs-number">10</span>,<span class="hljs-number">6</span>))<span class="hljs-comment">#Bar plot</span>data[<span class="hljs-string">'Artist.gender'</span>].value_counts().plot(kind=<span class="hljs-string">'bar'</span>)</code></pre>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1686586995326/c3910db9-563d-48ea-9a17-b93b5d897859.png<![CDATA[Get to know - What is Amazon Route53?]]>https://blog.hemath.com/get-to-know-what-is-amazon-route53https://blog.hemath.com/get-to-know-what-is-amazon-route53Thu, 03 Nov 2022 16:15:19 GMT<![CDATA[<p>Amazon Route 53 is widely used and defined as a service that is highly available and scalable cloud Domain Name System (DNS) web service. Amazon Route 53 is designed to give developers and businesses an extremely reliable and cost-effective way to route end users to Internet applications by translating names like <a target="_blank" href="http://www.example.com">www.example.com</a> into a numeric IP address like 192.0.2.1 which computers use to connect to each other.</p><p>Amazon Route 53 is a fully compliant service with IPv6 as well. Amazon Route 53 links user requests to AWS infrastructure, such as the Amazon EC2 instances, Elastic Load Balancing load balancers, or Amazon S3 buckets, and may also be used to route users to infrastructure that is not hosted by AWS. Users can also use Amazon Route 53 to establish the DNS health checks, then utilize Route 53 Application Recovery Controller to continually monitor their applications' capacity to recover from failures and regulate application recovery.</p><p>Amazon Route 53 Traffic Flow enables users to manage traffic worldwide by utilizing a range of routing types, such as Latency Based Routing, Geo DNS, Geoproximity, and Weighted Round Robinall of which may be used with DNS Failover to provide a variety of low-latency, fault-tolerant designs. Users can quickly configure how their end-users are routed to their application's endpoints with the Amazon Route 53 Traffic Flow's intuitive visual editor, whether they are in a single AWS region or dispersed across the world.</p><p>Amazon Route 53 also provides Domain Name Registration, which allows users to buy and administer domain names like <a target="_blank" href="http://example.com">example.com</a>, and Amazon Route 53 will automatically create DNS settings for their domains.</p><h3 id="heading-benefits-of-amazon-route-53">Benefits of Amazon Route 53</h3><ul><li><p>The Amazon Route 53 is built on Amazon Web Services' highly available and dependable infrastructure and its DNS servers are spread, which helps to provide a constant ability to route users end customers to their application. Amazon Route 53 Traffic Flow and routing management, for example, can help users increase dependability by rerouting their customers to an alternate destination if the user's original application endpoint becomes unavailable.</p></li><li><p>Amazon Route 53 is intended to provide the reliability demanded by critical applications and thus is highly available and reliable. Amazon Route 53 on Amazon Traffic Flow directs the traffic depending on a variety of factors, including endpoint health, geographic location, and latency. Users may set up various traffic regulations and choose which ones to use at any given moment. Users may build and change traffic policies via the Route 53 interface, AWS SDKs, or the Route 53 API using the easy visual editor So, the versioning function in Traffic Flow keeps track of changes to users' traffic policies, allowing users to quickly roll back to a prior version through the interface or API and thus it provides flexibility. Amazon Route 53 intends to complement other AWS technologies and offerings.</p></li><li><p>Amazon Route 53 may be used to map domain names to Amazon EC2 instances, Amazon S3 buckets, Amazon CloudFront distributions, and other AWS services and also Users can fine-tune who may change their DNS data by combining the AWS Identity and Access Management (IAM) service with the Amazon Route 53. Using a feature called Alias record, Users may utilize Amazon Route 53 to link their zone apex (<a target="_blank" href="http://example.com">example.com</a> versus <a target="_blank" href="http://www.example.com">www.example.com</a>) to their Elastic Load Balancing instance, Amazon CloudFront distribution, AWS Elastic Beanstalk environment, API Gateway, VPC endpoint, or the Amazon S3 website bucket.</p></li></ul><h3 id="heading-system-requirements">System Requirements</h3><ul><li>Any Operating System(Mac, Windows, Linux)</li></ul><h3 id="heading-use-cases-of-amazon-route-53">Use cases of Amazon Route 53</h3><ul><li>It provides Traffic Flow</li></ul><p>Amazon Route 53 provides Easy-to-use & cost-effective global traffic management that is route end users to the best endpoint for their application based on proximity, latency, health, and other considerations. Thus Amazon Route 53 is widely used across the world.</p><ul><li>It provides Latency based Routing</li></ul><p>Amazon Route 53 provides Latency Based Routing (LBR) which is AWSs highly reliable cost-effective DNS service. The Latency Based Routing which is one of Amazon Route 53s most requested features, helps users to improve their applications performance for the global audience. Amazon Route 53 Latency-Based Routing works by routing the user's customers to the AWS endpoint (e.g. EC2 instances, Elastic IPs or ELBs) which provides the fastest experience based on actual performance measurements of the different AWS regions where the user's application is running.</p><ul><li>It provides Geo DNS</li></ul><p>Amazon Route 53 enables users to purchase a new domain name or transfer the management of their existing domain name to Route 53. When users purchase the new domains via Route 53, the service will automatically configure a Hosted Zone for each domain. Amazon Route 53 offers privacy protection for users' WHOIS records at no additional charge. In addition, users benefit from AWS's consolidated billing to manage their domain name expenses alongside all of their other AWS resources. Amazon Route 53 offers a selection of more than 150 top-level domains (TLDs), including the major generic TLDs.</p><ul><li>It provides Private DNS for Amazon VPC</li></ul><p>Amazon Route 53 provides private DNS for Amazon VPC(Virtual Private Cloud) which helps in managing custom domain names for users' internal AWS resources without further exposing the DNS data to the public Internet.</p>]]><![CDATA[<p>Amazon Route 53 is widely used and defined as a service that is highly available and scalable cloud Domain Name System (DNS) web service. Amazon Route 53 is designed to give developers and businesses an extremely reliable and cost-effective way to route end users to Internet applications by translating names like <a target="_blank" href="http://www.example.com">www.example.com</a> into a numeric IP address like 192.0.2.1 which computers use to connect to each other.</p><p>Amazon Route 53 is a fully compliant service with IPv6 as well. Amazon Route 53 links user requests to AWS infrastructure, such as the Amazon EC2 instances, Elastic Load Balancing load balancers, or Amazon S3 buckets, and may also be used to route users to infrastructure that is not hosted by AWS. Users can also use Amazon Route 53 to establish the DNS health checks, then utilize Route 53 Application Recovery Controller to continually monitor their applications' capacity to recover from failures and regulate application recovery.</p><p>Amazon Route 53 Traffic Flow enables users to manage traffic worldwide by utilizing a range of routing types, such as Latency Based Routing, Geo DNS, Geoproximity, and Weighted Round Robinall of which may be used with DNS Failover to provide a variety of low-latency, fault-tolerant designs. Users can quickly configure how their end-users are routed to their application's endpoints with the Amazon Route 53 Traffic Flow's intuitive visual editor, whether they are in a single AWS region or dispersed across the world.</p><p>Amazon Route 53 also provides Domain Name Registration, which allows users to buy and administer domain names like <a target="_blank" href="http://example.com">example.com</a>, and Amazon Route 53 will automatically create DNS settings for their domains.</p><h3 id="heading-benefits-of-amazon-route-53">Benefits of Amazon Route 53</h3><ul><li><p>The Amazon Route 53 is built on Amazon Web Services' highly available and dependable infrastructure and its DNS servers are spread, which helps to provide a constant ability to route users end customers to their application. Amazon Route 53 Traffic Flow and routing management, for example, can help users increase dependability by rerouting their customers to an alternate destination if the user's original application endpoint becomes unavailable.</p></li><li><p>Amazon Route 53 is intended to provide the reliability demanded by critical applications and thus is highly available and reliable. Amazon Route 53 on Amazon Traffic Flow directs the traffic depending on a variety of factors, including endpoint health, geographic location, and latency. Users may set up various traffic regulations and choose which ones to use at any given moment. Users may build and change traffic policies via the Route 53 interface, AWS SDKs, or the Route 53 API using the easy visual editor So, the versioning function in Traffic Flow keeps track of changes to users' traffic policies, allowing users to quickly roll back to a prior version through the interface or API and thus it provides flexibility. Amazon Route 53 intends to complement other AWS technologies and offerings.</p></li><li><p>Amazon Route 53 may be used to map domain names to Amazon EC2 instances, Amazon S3 buckets, Amazon CloudFront distributions, and other AWS services and also Users can fine-tune who may change their DNS data by combining the AWS Identity and Access Management (IAM) service with the Amazon Route 53. Using a feature called Alias record, Users may utilize Amazon Route 53 to link their zone apex (<a target="_blank" href="http://example.com">example.com</a> versus <a target="_blank" href="http://www.example.com">www.example.com</a>) to their Elastic Load Balancing instance, Amazon CloudFront distribution, AWS Elastic Beanstalk environment, API Gateway, VPC endpoint, or the Amazon S3 website bucket.</p></li></ul><h3 id="heading-system-requirements">System Requirements</h3><ul><li>Any Operating System(Mac, Windows, Linux)</li></ul><h3 id="heading-use-cases-of-amazon-route-53">Use cases of Amazon Route 53</h3><ul><li>It provides Traffic Flow</li></ul><p>Amazon Route 53 provides Easy-to-use & cost-effective global traffic management that is route end users to the best endpoint for their application based on proximity, latency, health, and other considerations. Thus Amazon Route 53 is widely used across the world.</p><ul><li>It provides Latency based Routing</li></ul><p>Amazon Route 53 provides Latency Based Routing (LBR) which is AWSs highly reliable cost-effective DNS service. The Latency Based Routing which is one of Amazon Route 53s most requested features, helps users to improve their applications performance for the global audience. Amazon Route 53 Latency-Based Routing works by routing the user's customers to the AWS endpoint (e.g. EC2 instances, Elastic IPs or ELBs) which provides the fastest experience based on actual performance measurements of the different AWS regions where the user's application is running.</p><ul><li>It provides Geo DNS</li></ul><p>Amazon Route 53 enables users to purchase a new domain name or transfer the management of their existing domain name to Route 53. When users purchase the new domains via Route 53, the service will automatically configure a Hosted Zone for each domain. Amazon Route 53 offers privacy protection for users' WHOIS records at no additional charge. In addition, users benefit from AWS's consolidated billing to manage their domain name expenses alongside all of their other AWS resources. Amazon Route 53 offers a selection of more than 150 top-level domains (TLDs), including the major generic TLDs.</p><ul><li>It provides Private DNS for Amazon VPC</li></ul><p>Amazon Route 53 provides private DNS for Amazon VPC(Virtual Private Cloud) which helps in managing custom domain names for users' internal AWS resources without further exposing the DNS data to the public Internet.</p>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1686585863876/1bbafa61-47ab-4899-9870-5a3c440e9fa1.png<![CDATA[Let's break down - Numpy for beginners]]>https://blog.hemath.com/lets-break-down-numpy-for-beginnershttps://blog.hemath.com/lets-break-down-numpy-for-beginnersWed, 02 Nov 2022 09:58:37 GMT<![CDATA[<p>In this blog, we will start with the basics of the famous python library called Numpy and gradually advance to the complex, and advanced topics. Let us begin this tutorial on NumPy with a brief introduction of what the library is all about.</p><h3 id="heading-introduction-to-numpy"><strong>Introduction to NumPy</strong></h3><p>Numpy (Numerical Python) is a scientific computation library that helps us work with various derived data types such as arrays, matrices, 3D matrices and much more. You might be wondering that as these provisions are already available in vanilla python, why one needs NumPy. Here are few reasons to work with NumPy.</p><blockquote><p><strong><em>Note:</em></strong> <em>Its completely fine if you do not understand the code snippets that are shown below. The point here is to show the advantage of using NumPy over standard python. Dont worry; we will be going through everything one step at a time :)</em></p></blockquote><p><strong>1. NumPy consumes Less Memory</strong></p><p>NumPy consumes approximately 6 to 7 times less memory for storing the data than normal Python does. The following code cell compares the memory consumed for storing a list of numbers from 0 to 100 by Numpy and normal Python. You can see the difference yourself!</p><pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<span class="hljs-keyword">import</span> syspython_list = list(range(<span class="hljs-number">100</span>))numpy_array = np.array(list(range(<span class="hljs-number">100</span>)))sizeof_python_list = sys.getsizeof(<span class="hljs-number">1</span>) * len(python_list) <span class="hljs-comment"># Size = 168</span>sizeof_numpy_array = numpy_array.itemsize * numpy_array.sizeprint(<span class="hljs-string">f'Numpy array consumes <span class="hljs-subst">{sizeof_python_list/sizeof_numpy_array}</span> times less memory than python lists'</span>)</code></pre><p><strong>2. NumPy Computations are Faster!</strong></p><p>Numpy computations are generally faster than normal python computations. The main reason for this is that Numpy leverages the power of the language called C. Most of the NumPy functions are implemented in C in the ba, making it much faster than normal python lists. Let us experiment with the speed on our own below:</p><pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<span class="hljs-keyword">import</span> timepython_list = list(range(<span class="hljs-number">100000</span>))numpy_array = np.array(list(range(<span class="hljs-number">100000</span>)))start_py = time.time()result_python_list=python_list*<span class="hljs-number">75</span>end_py = time.time()print(<span class="hljs-string">f'Python list takes <span class="hljs-subst">{round(end_py-start_py,<span class="hljs-number">4</span>)}</span> seconds for computation'</span>)start_np = time.time()result_numpy_array=numpy_array*<span class="hljs-number">75</span>end_np = time.time()print(<span class="hljs-string">f'Numpy arrays take <span class="hljs-subst">{end_np-start_np}</span> seconds for computation'</span>)</code></pre><p><strong>3. Numpy is Powerful</strong></p><p>Numpy has a ton of built-in functions that can come to aid. It also makes it easier to work with higher dimensional data such as multi-dimensional matrices and so on.</p><p>And now, with no further ado, lets begin with the basics of NumPy!</p><h2 id="heading-setting-up-the-numpy-environment"><strong>Setting up the NumPy Environment</strong></h2><h3 id="heading-installing-numpy"><strong>Installing Numpy</strong></h3><p>We can install NumPy in our Python environment with the following command</p><p><code>pip install numpy or conda install numpy</code> </p><h2 id="heading-creating-numpy-arrays"><strong>Creating NumPy Arrays</strong></h2><h3 id="heading-creating-a-1d-numpy-array"><strong>Creating a 1D Numpy array</strong></h3><p>An array in NumPy in various ways.</p><ul><li><p>Create an empty array and append values to the array later.</p></li><li><p>Create a NumPy array with values. (You can still append values to it as and when needed).</p></li><li><p>Create a Python list and convert that Python list to a NumPy array.</p></li></ul><pre><code class="lang-python"><span class="hljs-comment"># Importing the NumPy package</span><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<span class="hljs-comment"># Creating an empty array</span>array1 = np.array([])print(array1, type(array1))<span class="hljs-comment"># Creating an array with values</span>array2 = np.array([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>])print(array2, type(array2))<span class="hljs-comment"># Convert a python list to numpy array</span>py_list = list([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>])array3= np.array(py_list)print(array3, type(array3))</code></pre><h3 id="heading-creating-2d-and-3d-numpy-arrays"><strong>Creating 2D and 3D NumPy Arrays</strong></h3><p>You can also create numpy arrays with more than one dimension such as 2D and 3D arrays</p><pre><code class="lang-python"><span class="hljs-comment"># creating 2D array</span>array2d = np.array([ [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>]])print(array2d, type(array2d))<span class="hljs-comment"># creating 3D array</span>array3d = np.array([[[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>]],[[<span class="hljs-number">10</span>,<span class="hljs-number">20</span>],[<span class="hljs-number">30</span>,<span class="hljs-number">40</span>],[<span class="hljs-number">50</span>,<span class="hljs-number">60</span>]]])print(array3d, type(array3d))</code></pre><h3 id="heading-creating-a-numpy-array-from-pandas-dataframes"><strong>Creating a NumPy array from Pandas Dataframes</strong></h3><p><strong>Pandas Dataframes</strong> can also be converted to a NumPy array easily as shown below</p><pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pddf = pd.DataFrame({<span class="hljs-string">'a'</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],<span class="hljs-string">'b'</span>:[<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>],<span class="hljs-string">'c'</span>:[<span class="hljs-number">100</span>,<span class="hljs-number">200</span>,<span class="hljs-number">300</span>]})arr_df = np.array(df)print(arr_df, type(arr_df))</code></pre><p><strong>Converting a NumPy Array to a Python List</strong></p><p>Just like you created a numpy array from a python list, you can also conveniently convert a numpy array to a python list.</p><pre><code class="lang-python"><span class="hljs-comment"># numpy array to python list</span>np_arr = np.array([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>])py_list = list(np_arr)print(py_list, type(py_list))</code></pre><h3 id="heading-creating-special-numpy-arrays"><strong>Creating Special NumPy Arrays</strong></h3><p>Besides everything that we saw about creating NumPy arrays, we can also create specific unique arrays easily using the built-in functions of NumPy. Let us see how to make the following.</p><ol><li><p>A NumPy array with random values</p></li><li><p>A NumPy array full of zeros</p></li><li><p>A NumPy array full of ones</p></li><li><p>A NumPy array with values lying within a specified range</p></li><li><p>An Identity matrix</p></li></ol><h4 id="heading-numpy-array-with-random-values"><strong>NumPy Array with Random Values</strong></h4><p>Lets say we want to create 2 x 2 a numpy array with random integer values which should lie between the range of 10 to 20. Heres how we can do ot with the <strong>np.random.randint()</strong> function. These values change everytime we run the function</p><pre><code class="lang-python">rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">20</span>, size=(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(rand_array, type(rand_array))</code></pre><p>In the above function, we have specified a single low and high value. What if we want each column to have a different range of values? That is also possible, and the code snippet below demonstrates how to do it!</p><pre><code class="lang-python"><span class="hljs-comment"># Gives an array with first column values ranging between 10 - 20 and second column values ranging between 100 - 200</span>rand_array = np.random.randint(low=[<span class="hljs-number">10</span>,<span class="hljs-number">100</span>], high=[<span class="hljs-number">20</span>,<span class="hljs-number">200</span>], size=(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(rand_array, type(rand_array))</code></pre><p>We can also create arrays with values that follow a normal distribution using the <strong>np.random.rand()</strong> function</p><pre><code class="lang-python">rand_normal_arr = np.random.rand(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>)print(rand_normal_arr, type(rand_normal_arr))</code></pre><h4 id="heading-numpy-arrays-with-only-zerosones"><strong>NumPy Arrays with only Zeros/ones</strong></h4><p>Let us now create a numpy array of zeroes and ones using the <strong>np.zeros()</strong> and <strong>np.ones()</strong> functions.</p><pre><code class="lang-python"><span class="hljs-comment"># creating a 1D array of zeros with length 5</span>arr_zeros_1d = np.zeros(<span class="hljs-number">5</span>)print(<span class="hljs-string">"1D array of zeros \n"</span>,arr_zeros_1d, type(arr_zeros_1d))<span class="hljs-comment"># creating a 2 x 2 matrix of zeros</span>arr_zeros = np.zeros((<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"2D matrix of zeros \n"</span>,arr_zeros, type(arr_zeros))<span class="hljs-comment"># creating a 1D array of ones with length 5</span>arr_ones_1d = np.ones(<span class="hljs-number">5</span>)print(<span class="hljs-string">"1D array of ones \n"</span>,arr_ones_1d, type(arr_ones_1d))<span class="hljs-comment"># creating a 2 x 2 matrix of ones</span>arr_ones = np.ones((<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"2D matrix of ones \n"</span>,arr_ones, type(arr_ones))</code></pre><h4 id="heading-numpy-arrays-with-values-between-a-specified-range"><strong>NumPy Arrays with values between a specified range</strong></h4><p>With the <strong>np.arange()</strong> function in NumPy, we can create arrays with a range of values. This function takes three arguments namely start, stop and step. The start and stop specify the upper and lower limit respectively. The step parameter refers to the spacing between the values. The default value for the step is 1. If the step parameter is negative, the spacings are calculated in a reverse manner. <em>When the steps are negative, the start value should be greater than the stop value. Otherwise, an empty array will be returned.</em></p><pre><code class="lang-python">arr_range_1 = np.arange(start=<span class="hljs-number">1</span>, stop=<span class="hljs-number">12</span>, step=<span class="hljs-number">1</span>)print(<span class="hljs-string">'Array with start = 1; stop = 12; step = 1 \n'</span>, arr_range_1)arr_range_2 = np.arange(start=<span class="hljs-number">1</span>, stop=<span class="hljs-number">12</span>, step=<span class="hljs-number">2</span>)print(<span class="hljs-string">'Array with start = 1; stop = 12; step = 2 \n'</span>, arr_range_2)arr_range_3 = np.arange(start=<span class="hljs-number">12</span>, stop=<span class="hljs-number">1</span>, step= <span class="hljs-number">-1</span>)print(<span class="hljs-string">'Array with start = 12; stop = 1; step = -1 \n'</span>, arr_range_3)</code></pre><h4 id="heading-creating-an-identity-matrix-in-numpy"><strong>Creating an Identity matrix in NumPy</strong></h4><p>Indentity matrices can also be created using Numpy using the <strong>np.identity()</strong> function. Identity matrices are square matrices with its main diagonal elements as 1 and the remaining elements as 0.</p><pre><code class="lang-python">identity_Arr = np.identity(<span class="hljs-number">4</span>)print(<span class="hljs-string">"4 x 4 identity matrix \n"</span>, identity_Arr)</code></pre><h2 id="heading-manipulating-numpy-arrays"><strong>Manipulating NumPy Arrays</strong></h2><h3 id="heading-adding-elements-to-a-numpy-array"><strong>Adding elements to a NumPy array</strong></h3><p>So far we saw how to create a NumPy array. Let us now understand how to add elements to a numpy array</p><pre><code class="lang-python"><span class="hljs-comment"># creating a numpy array</span>org_array = np.array([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>])print(org_array, type(org_array))<span class="hljs-comment"># appending values to that array</span>appended_array = np.append(org_array, [<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>])print(appended_array, type(appended_array))</code></pre><p>We can also append values to a two dimensional array and heres how we do it using the <strong>np.append()</strong> function</p><pre><code class="lang-python"><span class="hljs-comment"># creating 2D array</span>org_array2d = np.array([[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>]])print(<span class="hljs-string">'Before appending \n'</span>,org_array2d, type(org_array2d))<span class="hljs-comment"># appending values to that array with axis = 0</span>appended_array2d = np.append(org_array2d, [[<span class="hljs-number">10</span>,<span class="hljs-number">20</span>]], axis=<span class="hljs-number">0</span>)print(<span class="hljs-string">'After appending with axis = 0\n'</span>,appended_array2d, type(appended_array2d))<span class="hljs-comment"># appending values to that array without specifying axis parameter</span>appended_array2d = np.append(org_array2d, [[<span class="hljs-number">10</span>,<span class="hljs-number">20</span>]])print(<span class="hljs-string">'After appending without axis\n'</span>,appended_array2d, type(appended_array2d))</code></pre><p>In the np.append() function, the axis value is set to None by default. It means that both the original array and the array to be appended will be flattened to its lower dimension, and then the new array will be appended. When we set axis by axis = 0, the values are appended row-wise.</p><h3 id="heading-removing-elements-from-a-numpy-array"><strong>Removing elements from a NumPy array</strong></h3><p>We can remove any desired element from a numpy array using the <strong>np.delete()</strong> function. The <strong>np.delete()</strong> function basically takes the list and the index positions to be deleted as the parameters. Indexing in numpy is same as that of indexing python lists.</p><pre><code class="lang-python">rand_array = np.random.randint(low = <span class="hljs-number">50</span>, size = <span class="hljs-number">10</span>)print(<span class="hljs-string">"Array before deleting \n"</span>, rand_array)new_array = np.delete(rand_array, obj = [<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>])print(<span class="hljs-string">"Array After deleting \n"</span>, new_array)</code></pre><p>We can notice that the 3rd, 4th and 5th elements are deleted (Indexing in python starts from 0)</p><h3 id="heading-reshaping-a-numpy-array"><strong>Reshaping a NumPy array</strong></h3><p>We can use the <strong>shape</strong> method to determine the dimension of any NumPy array.</p><pre><code class="lang-python">array_np_1 = np.array([[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>]])print(<span class="hljs-string">"The shape is :"</span>,array_np_1.shape)array_np_2 = np.array([<span class="hljs-number">1</span>,<span class="hljs-number">3</span>,<span class="hljs-number">5</span>,<span class="hljs-number">7</span>,<span class="hljs-number">9</span>,<span class="hljs-number">11</span>])print(<span class="hljs-string">"The shape is :"</span>,array_np_2.shape)</code></pre><p>It is also possible to change the shape of an array at any given point using the <strong>np.reshape()</strong> function in numpy</p><pre><code class="lang-python">rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">20</span>, size=(<span class="hljs-number">3</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"The original array is \n"</span>,rand_array)print(<span class="hljs-string">'The shape of the original array is :'</span>, rand_array.shape)reshaped_array = np.reshape(rand_array, newshape =(<span class="hljs-number">2</span>, <span class="hljs-number">3</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><p>In the above example, we changed the dimensions of a 3 x 2 matrix to 2 x 3. But a lot more can be done using the reshape function. We can even change a 1D array to a 2D array or change a 2D array to 1D and much more!</p><pre><code class="lang-python"><span class="hljs-comment">#2D to 1D</span>rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">20</span>, size=(<span class="hljs-number">3</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"The original array is \n"</span>,rand_array)print(<span class="hljs-string">'The shape of the original array is :'</span>, rand_array.shape)reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">6</span>,))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><pre><code class="lang-python"><span class="hljs-comment">#1D to 2D</span>rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">20</span>, size=<span class="hljs-number">10</span>)print(<span class="hljs-string">"The original array is \n"</span>,rand_array)print(<span class="hljs-string">'The shape of the original array is :'</span>, rand_array.shape)reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">5</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><pre><code class="lang-python"><span class="hljs-comment">#3D to 2D</span>rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">20</span>, size=(<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>))print(<span class="hljs-string">"The original array is \n"</span>,rand_array)print(<span class="hljs-string">'The shape of the original array is :'</span>, rand_array.shape)reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">6</span>,<span class="hljs-number">4</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><p>We have now seen a lot of combinations using reshape. In all the above examples, every time we provide a new shape, we have made sure that it is compatible with the original shape. For example, let us say that the original array is of the dimension 4 x 2. It means that there are 8 elements in the array. Now when we try to reshape this array, you can reshape it into either of these following combinations</p><ul><li><p>2 x 4</p></li><li><p>1 x 8</p></li><li><p>8 x 1</p></li><li><p>2 x 2 x 2</p></li></ul><p>Any other combinations are not possible. So, when we provide the new dimension to the <strong>np.reshape()</strong> reshape function, we offer the above dimensions in the form of tuples such as (2,4),(1,8),(8,1),(2,2,2). Numpy gives us the provision to provide one of the new shape parameters as -1. It implies that it is an unknown dimension, and NumPy will figure out by itself using the number of elements in the array. Let us understand it with some examples..</p><pre><code class="lang-python">rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">50</span>, size=(<span class="hljs-number">4</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"The original array is \n"</span>,rand_array)print(<span class="hljs-string">'The shape of the original array is :'</span>, rand_array.shape)reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">-1</span>,<span class="hljs-number">4</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><p>Notice that we provided (-1,4) as a newshape value to the <strong>np.reshape()</strong> function but it automatically figured out that it is 2. Let us continue exploring</p><pre><code class="lang-python"><span class="hljs-comment"># reshape using (-1,1)</span>reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">-1</span>,<span class="hljs-number">1</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><pre><code class="lang-python"><span class="hljs-comment"># reshape using (1,-1)</span>reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">1</span>,<span class="hljs-number">-1</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><pre><code class="lang-python"><span class="hljs-comment"># reshape using (-1,-1) ---> throws an error</span>reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">-1</span>,<span class="hljs-number">-1</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><p>You must have by now understood why cant we specify more than one dimension value to be -1. The NumPy package can identify the unknown dimension only if the other dimensions are explicitly mentioned and this is obvious because, if we have an original array of 4 x 3 dimension, and if want the new reshaped array to have 2 columns, then the number of rows can be easily figured out by calculating (4 x 3) / 2 which is equal to 6. But when we dont know the number of columns, it is impossible to calculate the unknown dimension value. </p><h3 id="heading-sorting-numpy-arrays"><strong>Sorting NumPy arrays</strong></h3><p>We can easily sort a numpy array using the <strong>np.sort()</strong> function.</p><pre><code class="lang-python">arr = np.random.randint(low=<span class="hljs-number">20</span>, size = <span class="hljs-number">12</span>)print(<span class="hljs-string">"The actual array is \n"</span>, arr)print(<span class="hljs-string">"The sorted array is \n"</span>,np.sort(arr))</code></pre><p>We can also sort 2D arrays both row wise and column wise sing the axis parameter.</p><pre><code class="lang-python">arr = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">30</span>, size = (<span class="hljs-number">4</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"The actual array is \n"</span>, arr)print(<span class="hljs-string">"The column wise sorted array is \n"</span>,np.sort(arr, axis=<span class="hljs-number">0</span>))print(<span class="hljs-string">"The row wise sorted array is \n"</span>,np.sort(arr, axis=<span class="hljs-number">1</span>))</code></pre><h3 id="heading-flattening-numpy-arrays"><strong>Flattening NumPy arrays</strong></h3><p>Flattening an array is nothing but crushing down higher dimensional arrays to one dimension. There are 2 functions to execute this in Numpy. They are <strong>np.ndarray.flatten()</strong> and <strong>np.matrix.flatten()</strong> . The former is meant for all higher dimensional arrays while the latter is meant specifically for matrices</p><pre><code class="lang-python">arr_2d = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">30</span>, size = (<span class="hljs-number">4</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"The actual array is :\n"</span>, arr_2d)print(<span class="hljs-string">'Array after flattening using method 1 :\n'</span>, np.ndarray.flatten(arr_2d))print(<span class="hljs-string">'Array after flattening using method 2 :\n'</span>, np.matrix.flatten(arr_2d))</code></pre><h3 id="heading-rotating-numpy-arrays"><strong>Rotating NumPy arrays</strong></h3><p>The <strong>np.rot90()</strong> function can be used to rotate a NumPy array by 90 degrees. This is explained with the example below</p><p>The parameter <strong>k</strong> specifies how many times the matrix has to be rotated. The <strong>axes</strong> parameter specifies on what axis the matrix has to be rotated. Considering a 2D array, if the axes value is specified as (0,1), then the array is rotated in the anti-clockwise direction and when the axes value is specified as (1,0) then the array is rotated in the clockwise direction.</p><pre><code class="lang-python">arr = np.array([[<span class="hljs-number">10</span>,<span class="hljs-number">20</span>],[<span class="hljs-number">30</span>,<span class="hljs-number">40</span>]])print(<span class="hljs-string">'The original array is \n'</span>, arr)print(<span class="hljs-string">'The array after rotation it by 90 degrees once in the anti-clock wise direction is \n'</span>, np.rot90(arr,k=<span class="hljs-number">1</span>, axes=(<span class="hljs-number">0</span>,<span class="hljs-number">1</span>)))print(<span class="hljs-string">'The array after rotation it by 90 degrees once in the clock wise direction is \n'</span>, np.rot90(arr,k=<span class="hljs-number">1</span>, axes=(<span class="hljs-number">1</span>,<span class="hljs-number">0</span>)))print(<span class="hljs-string">'The array after rotation it by 90 degrees twice in the clock wise direction is \n'</span>, np.rot90(arr,k=<span class="hljs-number">2</span>, axes=(<span class="hljs-number">1</span>,<span class="hljs-number">0</span>)))print(<span class="hljs-string">'The array after rotation it by 90 degrees thrice in the anti-clock wise direction is \n'</span>, np.rot90(arr,k=<span class="hljs-number">3</span>, axes=(<span class="hljs-number">0</span>,<span class="hljs-number">1</span>)))</code></pre><h2 id="heading-matrix-operations-in-numpy"><strong>Matrix Operations in Numpy</strong></h2><p>If you are familiar with matrices in mathematics, then you would probably know about various matrix operations such as</p><ul><li><p>Matrix addition</p></li><li><p>Matrix subtraction</p></li><li><p>Matrix multiplication</p></li><li><p>Matrix vector multiplication</p></li><li><p>Matrix division</p></li><li><p>Matrix transpose</p></li><li><p>Matrix inverse</p></li><li><p>Matrix Power</p></li><li><p>Determining Diagonal Elements of a matrix</p></li><li><p>Evaluating Upper and Lower triangle elements of a matrix</p></li></ul><p>We will be understanding how to implement all the above operations using NumPy one by one.</p><h3 id="heading-matrix-addition-and-subtraction"><strong>Matrix Addition and Subtraction</strong></h3><p>Matrix addition and subtraction can be easily done just by using the <strong>+</strong> and <strong>-</strong> operators directly unless and until the dimensions of both the matrices are the same.</p><pre><code class="lang-python">mat1 = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">3</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"Matrix 1 \n"</span>, mat1)mat2 = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">3</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"Matrix 2 \n"</span>, mat2)print(<span class="hljs-string">"Matrix 1 + Matrix 2 \n"</span>, mat1 + mat2)print(<span class="hljs-string">"Matrix 1 - Matrix 2 \n"</span>, mat1 - mat2)</code></pre><p>We can also use the <strong>np.add()</strong> function to add two matrices and <strong>np.subtract()</strong> too subtract two matrices.</p><pre><code class="lang-python">mat1 = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">3</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"Matrix 1 \n"</span>, mat1)mat2 = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">3</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"Matrix 2 \n"</span>, mat2)print(<span class="hljs-string">"Matrix 1 + Matrix 2 \n"</span>, np.add(mat1, mat2))print(<span class="hljs-string">"Matrix 1 - Matrix 2 \n"</span>, np.subtract(mat1,mat2))</code></pre><h3 id="heading-matrix-multiplication"><strong>Matrix Multiplication</strong></h3><p>The <strong>np.matmul()</strong> function multiplies two matrices in the conventional manner.</p><pre><code class="lang-python">mat1 = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">3</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix 1 \n"</span>, mat1)mat2 = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"Matrix 2 \n"</span>, mat2)print(<span class="hljs-string">"Matrix 1 x Matrix 2 (Matrix multiplication) \n"</span>, np.matmul(mat1, mat2))</code></pre><p>If the two matrices are of the same dimension, then the <strong>*</strong> operator does element wise multiplication</p><pre><code class="lang-python">mat1 = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix 1 \n"</span>, mat1)mat2 = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix 2 \n"</span>, mat2)print(<span class="hljs-string">"Matrix 1 x Matrix 2 (Element wise multiplication) \n"</span>, mat1*mat2)</code></pre><h3 id="heading-matrix-division"><strong>Matrix Division</strong></h3><p>The <strong>np.divide()</strong> function helps us to divide two matrices. Division of two matrices can also be pulled off using the <strong>/</strong> operator.</p><pre><code class="lang-python">mat1 = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix 1 \n"</span>, mat1)mat2 = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix 2 \n"</span>, mat2)print(<span class="hljs-string">"Matrix 1 / Matrix 2 (using / operator) \n"</span>, mat1/mat2)print(<span class="hljs-string">"Matrix 1 / Matrix 2 (using np.divide function) \n"</span>, np.divide(mat1,mat2))</code></pre><p>Numpy is smart enough to broadcast the elements of the smaller matrix over the larger matrix if that is possible. Let us understand that with the example below</p><pre><code class="lang-python">mat = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix\n"</span>, mat)vec = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">1</span>))print(<span class="hljs-string">"Vector\n"</span>, vec)print(<span class="hljs-string">"Matrix / Vector (using / operator) \n"</span>, mat/vec)print(<span class="hljs-string">"Matrix / vector (using np.divide function) \n"</span>, np.divide(mat,vec))</code></pre><p>In the above example, mat is a 2 x 2 matrix while vec is a 2 x 1 vector. It can be seen from the results that the vector has been devided from each column of the matrix. Or in other terms, the vector has been broadcasted over the matrix during division.</p><h3 id="heading-matrix-transpose"><strong>Matrix Transpose</strong></h3><p>Transpose is nothing but reversing the axes of a matrix. When a matrix is transposed, the row elements become the column elements and the column elements become the row elements. Transpose of a matrix can be obtained using the <strong>np.transpose()</strong> function.</p><pre><code class="lang-python">mat = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Original Matrix\n"</span>, mat)print(<span class="hljs-string">"Transposed Matrix\n"</span>, np.transpose(mat))</code></pre><h3 id="heading-matrix-inverse"><strong>Matrix inverse</strong></h3><p>The <strong>np.linalg.inv()</strong> function provides the inverse of a matrix.</p><p><em>Calculating the inverse of a matrix is a little critical job and requires a stronghold in the basics of matrices. This tutorial mainly focuses on how we can leverage the NumPy package to perform these kinds of scientific computations rather than getting into its mathematical aspects, which is why we wont be going through its mathematical background.</em></p><pre><code class="lang-python">mat = np.array([[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">0</span>]])print(<span class="hljs-string">"Original matrix \n"</span>, mat)print(<span class="hljs-string">"The inverse of the matrix is\n"</span>, np.linalg.inv(mat))</code></pre><h3 id="heading-matrix-power"><strong>Matrix Power</strong></h3><p>Matrix powers are not the same as we do for normal numbers. If A is a 2 x 2 matrix, then A^2 is A times A. Here A times A means matrix multiplication and not element-wise multiplication. Such kind of matrix powers can be derived using the <strong>np.linalg.matrix_power()</strong> function in NumPy. In some cases, we might just need the individual squares of all the elements in the matrix. In such cases, we can use the <strong>**</strong> operator. </p><pre><code class="lang-python">mat = np.random.randint(low=<span class="hljs-number">1</span>, high=<span class="hljs-number">10</span>, size=(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">'The original matrix is \n'</span>, mat)mat_square = np.linalg.matrix_power(mat,<span class="hljs-number">2</span>)print(<span class="hljs-string">'The square of the matrix is \n'</span>, mat_square)print(<span class="hljs-string">'Square of all the elements in the matrix is \n'</span>, mat**<span class="hljs-number">2</span>)</code></pre><h3 id="heading-extracting-diagonal-of-a-matrix-in-numpy"><strong>Extracting Diagonal of a matrix in NumPy</strong></h3><p>The Main Diagonal of a square matrix has the elements that are present main diagonal (top left to bottom right). These diagonal elements can be extracted from the matrix using the <strong>np.diag()</strong> function. The parameter k denotes the diagonal that is required. When k=0, it returns the main diagonal elements. When k=1, it returns the diagonal elements above the main diagonal and when k=-1, it returns the diagonal elements one step below the main diagonal. The value of k is 0 by default. Below is an example.</p><pre><code class="lang-python">mat = np.array([[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">0</span>]])print(<span class="hljs-string">"Original matrix \n"</span>, mat)print(<span class="hljs-string">"Main Diagonal elements of the matrix are: \n"</span>, np.diag(mat))print(<span class="hljs-string">"Elements of the matrix one step above the main diagonal are: \n"</span>, np.diag(mat, k=<span class="hljs-number">1</span>))print(<span class="hljs-string">"Elements of the matrix one step below the main diagonal are: \n"</span>, np.diag(mat, k=<span class="hljs-number">-1</span>))</code></pre><h3 id="heading-evaluating-upper-and-lower-triangular-matrix-in-numpy"><strong>Evaluating Upper and lower triangular matrix in NumPy</strong></h3><p>Generally in a square matrix, the elements present above the main diagonal form the Upper triangle and the elements below the main diagonal from the lower triangle. These upper and lower triangular elements can be easily extracted using <strong>np.triu()</strong> and <strong>np.tril()</strong> functions respectively. Just like we specified the k parameter for the <strong>np.diag()</strong> function, we can also specify the k parameter here to return a matrix with elements above/below the specified diagonal as 0.</p><pre><code class="lang-python">mat = np.array([[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">0</span>]])print(<span class="hljs-string">"Original matrix \n"</span>, mat)print(<span class="hljs-string">'The upper triangular matrix \n'</span>, np.triu(mat))print(<span class="hljs-string">'The lower triangular matrix \n'</span>, np.tril(mat))print(<span class="hljs-string">'The lower triangular matrix with k=1 \n'</span>, np.tril(mat, k=<span class="hljs-number">1</span>))print(<span class="hljs-string">'The upper triangular matrix with k=-1 \n'</span>, np.tril(mat, k=<span class="hljs-number">-1</span>))</code></pre><h2 id="heading-indexing-numpy-arrays"><strong>Indexing Numpy arrays</strong></h2><p>Indexing is the most crucial part when it comes to array manipulations. Just like list indexing in python, indexing in numpy also begins with 0. The Numpy package has really powerful indexing methods. There are various kinds of indexing in Numpy. But in this tutorial, we will be categorising the indexing methods in the following manner.</p><ul><li><p>Basic indexing</p></li><li><p>Indexing using slicing operator</p></li><li><p>Indexing 2D arrays</p></li><li><p>Indexing 3D arrays</p></li><li><p>Advanced indexing using integer arrays</p></li><li><p>Advanced indexing using Boolean conditions</p></li></ul><h3 id="heading-basic-indexing"><strong>Basic Indexing</strong></h3><p>Let us start with the basics of indexing. Let us create an array using the <strong>np.arange()</strong> function.</p><pre><code class="lang-python">arr = np.arange(<span class="hljs-number">0</span>,<span class="hljs-number">150</span>,<span class="hljs-number">10</span>)print(<span class="hljs-string">'The original array is \n'</span>, arr)Output-The original array <span class="hljs-keyword">is</span> [ <span class="hljs-number">0</span> <span class="hljs-number">10</span> <span class="hljs-number">20</span> <span class="hljs-number">30</span> <span class="hljs-number">40</span> <span class="hljs-number">50</span> <span class="hljs-number">60</span> <span class="hljs-number">70</span> <span class="hljs-number">80</span> <span class="hljs-number">90</span> <span class="hljs-number">100</span> <span class="hljs-number">110</span> <span class="hljs-number">120</span> <span class="hljs-number">130</span> <span class="hljs-number">140</span>]</code></pre><p>Let us try to grab the 5th element of the array. We can easily do that in the following manner</p><pre><code class="lang-python">print(<span class="hljs-string">'The 5th element of the array is :'</span>, arr[<span class="hljs-number">5</span>])</code></pre><h3 id="heading-indexing-using-slicing-operator"><strong>Indexing using slicing operator</strong></h3><p>We can also obtain values within a range of index using the slicing technique. Indexing using slicing works in the <strong>array[start:stop:step]</strong> manner. The start and stop specify the index ranges upper and lower limits, and the step specifies the spacing between each index. Let us understand it with some examples</p><pre><code class="lang-python">print(<span class="hljs-string">'The element of the array from 4th index to 10th index: \n'</span>, arr[<span class="hljs-number">4</span>:<span class="hljs-number">10</span>])print(<span class="hljs-string">'The element of the array from 2nd index to 12th index in steps of 2: \n'</span>, arr[<span class="hljs-number">2</span>:<span class="hljs-number">12</span>:<span class="hljs-number">2</span>])print(<span class="hljs-string">'The element of the array from 14th index to 6th index in steps of -2:\n'</span>, arr[<span class="hljs-number">14</span>:<span class="hljs-number">6</span>:<span class="hljs-number">-2</span>])print(<span class="hljs-string">'The element of the array from 3rd index to the end of the array in steps of 2:\n'</span>, arr[<span class="hljs-number">3</span>::<span class="hljs-number">2</span>])<span class="hljs-comment"># returns the array in a reversed manner</span>print(<span class="hljs-string">'All the element of the array steps of -1: \n'</span>, arr[::<span class="hljs-number">-1</span>])</code></pre><h3 id="heading-indexing-2d-arrays"><strong>Indexing 2D arrays</strong></h3><p>Indexing a two-dimensional array is always done in an <strong>array[row, col]</strong> manner. All sorts of indexing techniques that we used previously while indexing 1D arrays like slicing, indexing using index lists can also be used here. Let us see a few examples to understand it better.</p><pre><code class="lang-python">arr = np.arange(<span class="hljs-number">0</span>,<span class="hljs-number">250</span>,<span class="hljs-number">10</span>).reshape(<span class="hljs-number">5</span>,<span class="hljs-number">5</span>)print(<span class="hljs-string">'The original array is \n'</span>, arr)</code></pre><pre><code class="lang-python">print(<span class="hljs-string">'The element in 2nd row and 3rd column is:'</span>, arr[<span class="hljs-number">2</span>, <span class="hljs-number">3</span>])print(<span class="hljs-string">'The element in 3rd row and 1st column is:'</span>, arr[<span class="hljs-number">3</span>, <span class="hljs-number">1</span>])print(<span class="hljs-string">'All the elements in 3rd row are:'</span>, arr[<span class="hljs-number">3</span>, :])print(<span class="hljs-string">'All the elements in 2nd column are:'</span>, arr[:, <span class="hljs-number">2</span>])print(<span class="hljs-string">'Elements in 2nd column with row indices ranging between 1 and 3 are:'</span>, arr[<span class="hljs-number">1</span>:<span class="hljs-number">3</span>, <span class="hljs-number">2</span>])print(<span class="hljs-string">'Elements in 4th row with column indices ranging between 0 and 3 are:'</span>, arr[<span class="hljs-number">4</span>, <span class="hljs-number">0</span>:<span class="hljs-number">3</span>])print(<span class="hljs-string">'Elements with row indices ranging between 1 and 3 and column indices ranging between 2 and 4 are: \n'</span>, arr[<span class="hljs-number">1</span>:<span class="hljs-number">3</span>, <span class="hljs-number">2</span>:<span class="hljs-number">4</span>])</code></pre><h3 id="heading-indexing-3d-arrays"><strong>Indexing 3D arrays</strong></h3><p>Imagine 3D arrays as different matrices stacked one on top of the other. So while indexing 3D arrays, we dont just mention the row and column index, but also mention on which matrix we should be looking for the specified row and column indices. 3D arrays are indexed in <strong>array[matrix, row, col]</strong> manner Let us see a few examples</p><pre><code class="lang-python">arr = np.arange(<span class="hljs-number">45</span>).reshape(<span class="hljs-number">3</span>,<span class="hljs-number">3</span>,<span class="hljs-number">5</span>)print(<span class="hljs-string">'The original array is \n'</span>, arr)</code></pre><pre><code class="lang-python">print(<span class="hljs-string">'The 0th 2D matrix: \n'</span>, arr[<span class="hljs-number">0</span>])print(<span class="hljs-string">'The 1st 2D matrix: \n'</span>, arr[<span class="hljs-number">1</span>])print(<span class="hljs-string">'The 2nd row of the 1st 2D matrix: \n'</span>, arr[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,:])print(<span class="hljs-string">'The 1st row of the 2nd 2D matrix: \n'</span>, arr[<span class="hljs-number">2</span>,<span class="hljs-number">1</span>,:])print(<span class="hljs-string">'The 0th column of the 1st 2D matrix: \n'</span>, arr[<span class="hljs-number">1</span>,:,<span class="hljs-number">0</span>])print(<span class="hljs-string">'The 3rd column of the 0th 2D matrix: \n'</span>, arr[<span class="hljs-number">0</span>,:,<span class="hljs-number">3</span>])print(<span class="hljs-string">'The element present in the 2nd row and 4th column of the 1st 2D matrix:'</span>, arr[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">4</span>])print(<span class="hljs-string">'The element present in the 0th row and 3rd column of the 0th 2D matrix:'</span>, arr[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>,<span class="hljs-number">3</span>])</code></pre><h3 id="heading-advanced-indexing-using-integer-arrays"><strong>Advanced indexing using integer arrays</strong></h3><p>As we discussed earlier, numpy has really powerful and sophisticated indexing methods and indexing using integer arrays is one among them. Let us consider the following array</p><pre><code class="lang-python">arr = np.arange(<span class="hljs-number">10</span>)print(<span class="hljs-string">'The original array is \n'</span>, arr)</code></pre><p>If we want only the 3rd, 5th and 9th elements we can easily extract those using integer array indexing. To do that, we will first have to create an integer array of indices</p><pre><code class="lang-python">index_arr_1 = np.array([<span class="hljs-number">3</span>,<span class="hljs-number">5</span>,<span class="hljs-number">9</span>])print(<span class="hljs-string">'The index array is \n'</span>, index_arr_1)The index array <span class="hljs-keyword">is</span> [<span class="hljs-number">3</span> <span class="hljs-number">5</span> <span class="hljs-number">9</span>]</code></pre><p>Now we can easily pass this index array to our original array as follows</p><pre><code class="lang-python">print(<span class="hljs-string">'The 3rd, 5th and 9th elements of the array are \n'</span>, arr[index_arr_1])</code></pre><p>We can also repeat an index more than once using index arrays!</p><pre><code class="lang-python">index_arr_2 = np.array([<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>])print(<span class="hljs-string">'The array returned after indexing is \n'</span>, arr[index_arr_2])</code></pre><h3 id="heading-advanced-indexing-using-boolean-conditions"><strong>Advanced indexing using boolean conditions</strong></h3><p>We can also specify a logical condition to extract elements from the array. It returns the elements of the array for which the specified condition is true. The following example explains how to extract elements that are greater than 5 from an array.</p><pre><code class="lang-python">arr = np.arange(<span class="hljs-number">12</span>) print(<span class="hljs-string">'The original array is \n'</span>, arr) print(<span class="hljs-string">'The elements that are greater than 5 are \n'</span>, arr[arr><span class="hljs-number">5</span>]) print(<span class="hljs-string">'The elements that are lesser than 5 are \n'</span>, arr[arr<<span class="hljs-number">5</span>]) print(<span class="hljs-string">'The elements that equal to 5 \n'</span>, arr[arr==<span class="hljs-number">5</span>]) print(<span class="hljs-string">'The elements are even \n'</span>, arr[arr%<span class="hljs-number">2</span>==<span class="hljs-number">0</span>])</code></pre><h2 id="heading-saving-numpy-arrays"><strong>Saving NumPy arrays</strong></h2><p>We can easily save any numpy array as a .npy file using the <a target="_blank" href="http://np.save"><strong>np.save</strong></a><strong>()</strong> function and here is how to do!</p><pre><code class="lang-python">arr_to_save = np.arange(<span class="hljs-number">1</span>,<span class="hljs-number">10</span>).reshape(<span class="hljs-number">3</span>,<span class="hljs-number">3</span>)np.save(file=<span class="hljs-string">'array.npy'</span>, arr=arr_to_save).</code></pre>]]><![CDATA[<p>In this blog, we will start with the basics of the famous python library called Numpy and gradually advance to the complex, and advanced topics. Let us begin this tutorial on NumPy with a brief introduction of what the library is all about.</p><h3 id="heading-introduction-to-numpy"><strong>Introduction to NumPy</strong></h3><p>Numpy (Numerical Python) is a scientific computation library that helps us work with various derived data types such as arrays, matrices, 3D matrices and much more. You might be wondering that as these provisions are already available in vanilla python, why one needs NumPy. Here are few reasons to work with NumPy.</p><blockquote><p><strong><em>Note:</em></strong> <em>Its completely fine if you do not understand the code snippets that are shown below. The point here is to show the advantage of using NumPy over standard python. Dont worry; we will be going through everything one step at a time :)</em></p></blockquote><p><strong>1. NumPy consumes Less Memory</strong></p><p>NumPy consumes approximately 6 to 7 times less memory for storing the data than normal Python does. The following code cell compares the memory consumed for storing a list of numbers from 0 to 100 by Numpy and normal Python. You can see the difference yourself!</p><pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<span class="hljs-keyword">import</span> syspython_list = list(range(<span class="hljs-number">100</span>))numpy_array = np.array(list(range(<span class="hljs-number">100</span>)))sizeof_python_list = sys.getsizeof(<span class="hljs-number">1</span>) * len(python_list) <span class="hljs-comment"># Size = 168</span>sizeof_numpy_array = numpy_array.itemsize * numpy_array.sizeprint(<span class="hljs-string">f'Numpy array consumes <span class="hljs-subst">{sizeof_python_list/sizeof_numpy_array}</span> times less memory than python lists'</span>)</code></pre><p><strong>2. NumPy Computations are Faster!</strong></p><p>Numpy computations are generally faster than normal python computations. The main reason for this is that Numpy leverages the power of the language called C. Most of the NumPy functions are implemented in C in the ba, making it much faster than normal python lists. Let us experiment with the speed on our own below:</p><pre><code class="lang-python"><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<span class="hljs-keyword">import</span> timepython_list = list(range(<span class="hljs-number">100000</span>))numpy_array = np.array(list(range(<span class="hljs-number">100000</span>)))start_py = time.time()result_python_list=python_list*<span class="hljs-number">75</span>end_py = time.time()print(<span class="hljs-string">f'Python list takes <span class="hljs-subst">{round(end_py-start_py,<span class="hljs-number">4</span>)}</span> seconds for computation'</span>)start_np = time.time()result_numpy_array=numpy_array*<span class="hljs-number">75</span>end_np = time.time()print(<span class="hljs-string">f'Numpy arrays take <span class="hljs-subst">{end_np-start_np}</span> seconds for computation'</span>)</code></pre><p><strong>3. Numpy is Powerful</strong></p><p>Numpy has a ton of built-in functions that can come to aid. It also makes it easier to work with higher dimensional data such as multi-dimensional matrices and so on.</p><p>And now, with no further ado, lets begin with the basics of NumPy!</p><h2 id="heading-setting-up-the-numpy-environment"><strong>Setting up the NumPy Environment</strong></h2><h3 id="heading-installing-numpy"><strong>Installing Numpy</strong></h3><p>We can install NumPy in our Python environment with the following command</p><p><code>pip install numpy or conda install numpy</code> </p><h2 id="heading-creating-numpy-arrays"><strong>Creating NumPy Arrays</strong></h2><h3 id="heading-creating-a-1d-numpy-array"><strong>Creating a 1D Numpy array</strong></h3><p>An array in NumPy in various ways.</p><ul><li><p>Create an empty array and append values to the array later.</p></li><li><p>Create a NumPy array with values. (You can still append values to it as and when needed).</p></li><li><p>Create a Python list and convert that Python list to a NumPy array.</p></li></ul><pre><code class="lang-python"><span class="hljs-comment"># Importing the NumPy package</span><span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np<span class="hljs-comment"># Creating an empty array</span>array1 = np.array([])print(array1, type(array1))<span class="hljs-comment"># Creating an array with values</span>array2 = np.array([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>])print(array2, type(array2))<span class="hljs-comment"># Convert a python list to numpy array</span>py_list = list([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>])array3= np.array(py_list)print(array3, type(array3))</code></pre><h3 id="heading-creating-2d-and-3d-numpy-arrays"><strong>Creating 2D and 3D NumPy Arrays</strong></h3><p>You can also create numpy arrays with more than one dimension such as 2D and 3D arrays</p><pre><code class="lang-python"><span class="hljs-comment"># creating 2D array</span>array2d = np.array([ [<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>]])print(array2d, type(array2d))<span class="hljs-comment"># creating 3D array</span>array3d = np.array([[[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>]],[[<span class="hljs-number">10</span>,<span class="hljs-number">20</span>],[<span class="hljs-number">30</span>,<span class="hljs-number">40</span>],[<span class="hljs-number">50</span>,<span class="hljs-number">60</span>]]])print(array3d, type(array3d))</code></pre><h3 id="heading-creating-a-numpy-array-from-pandas-dataframes"><strong>Creating a NumPy array from Pandas Dataframes</strong></h3><p><strong>Pandas Dataframes</strong> can also be converted to a NumPy array easily as shown below</p><pre><code class="lang-python"><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pddf = pd.DataFrame({<span class="hljs-string">'a'</span>:[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],<span class="hljs-string">'b'</span>:[<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>],<span class="hljs-string">'c'</span>:[<span class="hljs-number">100</span>,<span class="hljs-number">200</span>,<span class="hljs-number">300</span>]})arr_df = np.array(df)print(arr_df, type(arr_df))</code></pre><p><strong>Converting a NumPy Array to a Python List</strong></p><p>Just like you created a numpy array from a python list, you can also conveniently convert a numpy array to a python list.</p><pre><code class="lang-python"><span class="hljs-comment"># numpy array to python list</span>np_arr = np.array([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>])py_list = list(np_arr)print(py_list, type(py_list))</code></pre><h3 id="heading-creating-special-numpy-arrays"><strong>Creating Special NumPy Arrays</strong></h3><p>Besides everything that we saw about creating NumPy arrays, we can also create specific unique arrays easily using the built-in functions of NumPy. Let us see how to make the following.</p><ol><li><p>A NumPy array with random values</p></li><li><p>A NumPy array full of zeros</p></li><li><p>A NumPy array full of ones</p></li><li><p>A NumPy array with values lying within a specified range</p></li><li><p>An Identity matrix</p></li></ol><h4 id="heading-numpy-array-with-random-values"><strong>NumPy Array with Random Values</strong></h4><p>Lets say we want to create 2 x 2 a numpy array with random integer values which should lie between the range of 10 to 20. Heres how we can do ot with the <strong>np.random.randint()</strong> function. These values change everytime we run the function</p><pre><code class="lang-python">rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">20</span>, size=(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(rand_array, type(rand_array))</code></pre><p>In the above function, we have specified a single low and high value. What if we want each column to have a different range of values? That is also possible, and the code snippet below demonstrates how to do it!</p><pre><code class="lang-python"><span class="hljs-comment"># Gives an array with first column values ranging between 10 - 20 and second column values ranging between 100 - 200</span>rand_array = np.random.randint(low=[<span class="hljs-number">10</span>,<span class="hljs-number">100</span>], high=[<span class="hljs-number">20</span>,<span class="hljs-number">200</span>], size=(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(rand_array, type(rand_array))</code></pre><p>We can also create arrays with values that follow a normal distribution using the <strong>np.random.rand()</strong> function</p><pre><code class="lang-python">rand_normal_arr = np.random.rand(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>)print(rand_normal_arr, type(rand_normal_arr))</code></pre><h4 id="heading-numpy-arrays-with-only-zerosones"><strong>NumPy Arrays with only Zeros/ones</strong></h4><p>Let us now create a numpy array of zeroes and ones using the <strong>np.zeros()</strong> and <strong>np.ones()</strong> functions.</p><pre><code class="lang-python"><span class="hljs-comment"># creating a 1D array of zeros with length 5</span>arr_zeros_1d = np.zeros(<span class="hljs-number">5</span>)print(<span class="hljs-string">"1D array of zeros \n"</span>,arr_zeros_1d, type(arr_zeros_1d))<span class="hljs-comment"># creating a 2 x 2 matrix of zeros</span>arr_zeros = np.zeros((<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"2D matrix of zeros \n"</span>,arr_zeros, type(arr_zeros))<span class="hljs-comment"># creating a 1D array of ones with length 5</span>arr_ones_1d = np.ones(<span class="hljs-number">5</span>)print(<span class="hljs-string">"1D array of ones \n"</span>,arr_ones_1d, type(arr_ones_1d))<span class="hljs-comment"># creating a 2 x 2 matrix of ones</span>arr_ones = np.ones((<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"2D matrix of ones \n"</span>,arr_ones, type(arr_ones))</code></pre><h4 id="heading-numpy-arrays-with-values-between-a-specified-range"><strong>NumPy Arrays with values between a specified range</strong></h4><p>With the <strong>np.arange()</strong> function in NumPy, we can create arrays with a range of values. This function takes three arguments namely start, stop and step. The start and stop specify the upper and lower limit respectively. The step parameter refers to the spacing between the values. The default value for the step is 1. If the step parameter is negative, the spacings are calculated in a reverse manner. <em>When the steps are negative, the start value should be greater than the stop value. Otherwise, an empty array will be returned.</em></p><pre><code class="lang-python">arr_range_1 = np.arange(start=<span class="hljs-number">1</span>, stop=<span class="hljs-number">12</span>, step=<span class="hljs-number">1</span>)print(<span class="hljs-string">'Array with start = 1; stop = 12; step = 1 \n'</span>, arr_range_1)arr_range_2 = np.arange(start=<span class="hljs-number">1</span>, stop=<span class="hljs-number">12</span>, step=<span class="hljs-number">2</span>)print(<span class="hljs-string">'Array with start = 1; stop = 12; step = 2 \n'</span>, arr_range_2)arr_range_3 = np.arange(start=<span class="hljs-number">12</span>, stop=<span class="hljs-number">1</span>, step= <span class="hljs-number">-1</span>)print(<span class="hljs-string">'Array with start = 12; stop = 1; step = -1 \n'</span>, arr_range_3)</code></pre><h4 id="heading-creating-an-identity-matrix-in-numpy"><strong>Creating an Identity matrix in NumPy</strong></h4><p>Indentity matrices can also be created using Numpy using the <strong>np.identity()</strong> function. Identity matrices are square matrices with its main diagonal elements as 1 and the remaining elements as 0.</p><pre><code class="lang-python">identity_Arr = np.identity(<span class="hljs-number">4</span>)print(<span class="hljs-string">"4 x 4 identity matrix \n"</span>, identity_Arr)</code></pre><h2 id="heading-manipulating-numpy-arrays"><strong>Manipulating NumPy Arrays</strong></h2><h3 id="heading-adding-elements-to-a-numpy-array"><strong>Adding elements to a NumPy array</strong></h3><p>So far we saw how to create a NumPy array. Let us now understand how to add elements to a numpy array</p><pre><code class="lang-python"><span class="hljs-comment"># creating a numpy array</span>org_array = np.array([<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>,<span class="hljs-number">6</span>])print(org_array, type(org_array))<span class="hljs-comment"># appending values to that array</span>appended_array = np.append(org_array, [<span class="hljs-number">10</span>,<span class="hljs-number">20</span>,<span class="hljs-number">30</span>])print(appended_array, type(appended_array))</code></pre><p>We can also append values to a two dimensional array and heres how we do it using the <strong>np.append()</strong> function</p><pre><code class="lang-python"><span class="hljs-comment"># creating 2D array</span>org_array2d = np.array([[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>]])print(<span class="hljs-string">'Before appending \n'</span>,org_array2d, type(org_array2d))<span class="hljs-comment"># appending values to that array with axis = 0</span>appended_array2d = np.append(org_array2d, [[<span class="hljs-number">10</span>,<span class="hljs-number">20</span>]], axis=<span class="hljs-number">0</span>)print(<span class="hljs-string">'After appending with axis = 0\n'</span>,appended_array2d, type(appended_array2d))<span class="hljs-comment"># appending values to that array without specifying axis parameter</span>appended_array2d = np.append(org_array2d, [[<span class="hljs-number">10</span>,<span class="hljs-number">20</span>]])print(<span class="hljs-string">'After appending without axis\n'</span>,appended_array2d, type(appended_array2d))</code></pre><p>In the np.append() function, the axis value is set to None by default. It means that both the original array and the array to be appended will be flattened to its lower dimension, and then the new array will be appended. When we set axis by axis = 0, the values are appended row-wise.</p><h3 id="heading-removing-elements-from-a-numpy-array"><strong>Removing elements from a NumPy array</strong></h3><p>We can remove any desired element from a numpy array using the <strong>np.delete()</strong> function. The <strong>np.delete()</strong> function basically takes the list and the index positions to be deleted as the parameters. Indexing in numpy is same as that of indexing python lists.</p><pre><code class="lang-python">rand_array = np.random.randint(low = <span class="hljs-number">50</span>, size = <span class="hljs-number">10</span>)print(<span class="hljs-string">"Array before deleting \n"</span>, rand_array)new_array = np.delete(rand_array, obj = [<span class="hljs-number">3</span>,<span class="hljs-number">4</span>,<span class="hljs-number">5</span>])print(<span class="hljs-string">"Array After deleting \n"</span>, new_array)</code></pre><p>We can notice that the 3rd, 4th and 5th elements are deleted (Indexing in python starts from 0)</p><h3 id="heading-reshaping-a-numpy-array"><strong>Reshaping a NumPy array</strong></h3><p>We can use the <strong>shape</strong> method to determine the dimension of any NumPy array.</p><pre><code class="lang-python">array_np_1 = np.array([[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>],[<span class="hljs-number">3</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>]])print(<span class="hljs-string">"The shape is :"</span>,array_np_1.shape)array_np_2 = np.array([<span class="hljs-number">1</span>,<span class="hljs-number">3</span>,<span class="hljs-number">5</span>,<span class="hljs-number">7</span>,<span class="hljs-number">9</span>,<span class="hljs-number">11</span>])print(<span class="hljs-string">"The shape is :"</span>,array_np_2.shape)</code></pre><p>It is also possible to change the shape of an array at any given point using the <strong>np.reshape()</strong> function in numpy</p><pre><code class="lang-python">rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">20</span>, size=(<span class="hljs-number">3</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"The original array is \n"</span>,rand_array)print(<span class="hljs-string">'The shape of the original array is :'</span>, rand_array.shape)reshaped_array = np.reshape(rand_array, newshape =(<span class="hljs-number">2</span>, <span class="hljs-number">3</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><p>In the above example, we changed the dimensions of a 3 x 2 matrix to 2 x 3. But a lot more can be done using the reshape function. We can even change a 1D array to a 2D array or change a 2D array to 1D and much more!</p><pre><code class="lang-python"><span class="hljs-comment">#2D to 1D</span>rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">20</span>, size=(<span class="hljs-number">3</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"The original array is \n"</span>,rand_array)print(<span class="hljs-string">'The shape of the original array is :'</span>, rand_array.shape)reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">6</span>,))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><pre><code class="lang-python"><span class="hljs-comment">#1D to 2D</span>rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">20</span>, size=<span class="hljs-number">10</span>)print(<span class="hljs-string">"The original array is \n"</span>,rand_array)print(<span class="hljs-string">'The shape of the original array is :'</span>, rand_array.shape)reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">5</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><pre><code class="lang-python"><span class="hljs-comment">#3D to 2D</span>rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">20</span>, size=(<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">4</span>))print(<span class="hljs-string">"The original array is \n"</span>,rand_array)print(<span class="hljs-string">'The shape of the original array is :'</span>, rand_array.shape)reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">6</span>,<span class="hljs-number">4</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><p>We have now seen a lot of combinations using reshape. In all the above examples, every time we provide a new shape, we have made sure that it is compatible with the original shape. For example, let us say that the original array is of the dimension 4 x 2. It means that there are 8 elements in the array. Now when we try to reshape this array, you can reshape it into either of these following combinations</p><ul><li><p>2 x 4</p></li><li><p>1 x 8</p></li><li><p>8 x 1</p></li><li><p>2 x 2 x 2</p></li></ul><p>Any other combinations are not possible. So, when we provide the new dimension to the <strong>np.reshape()</strong> reshape function, we offer the above dimensions in the form of tuples such as (2,4),(1,8),(8,1),(2,2,2). Numpy gives us the provision to provide one of the new shape parameters as -1. It implies that it is an unknown dimension, and NumPy will figure out by itself using the number of elements in the array. Let us understand it with some examples..</p><pre><code class="lang-python">rand_array = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">50</span>, size=(<span class="hljs-number">4</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"The original array is \n"</span>,rand_array)print(<span class="hljs-string">'The shape of the original array is :'</span>, rand_array.shape)reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">-1</span>,<span class="hljs-number">4</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><p>Notice that we provided (-1,4) as a newshape value to the <strong>np.reshape()</strong> function but it automatically figured out that it is 2. Let us continue exploring</p><pre><code class="lang-python"><span class="hljs-comment"># reshape using (-1,1)</span>reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">-1</span>,<span class="hljs-number">1</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><pre><code class="lang-python"><span class="hljs-comment"># reshape using (1,-1)</span>reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">1</span>,<span class="hljs-number">-1</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><pre><code class="lang-python"><span class="hljs-comment"># reshape using (-1,-1) ---> throws an error</span>reshaped_array = np.reshape(rand_array, newshape = (<span class="hljs-number">-1</span>,<span class="hljs-number">-1</span>))print(<span class="hljs-string">"The reshaped array is \n"</span>,reshaped_array)print(<span class="hljs-string">'The shape of the array after reshaping is :'</span>, reshaped_array.shape)</code></pre><p>You must have by now understood why cant we specify more than one dimension value to be -1. The NumPy package can identify the unknown dimension only if the other dimensions are explicitly mentioned and this is obvious because, if we have an original array of 4 x 3 dimension, and if want the new reshaped array to have 2 columns, then the number of rows can be easily figured out by calculating (4 x 3) / 2 which is equal to 6. But when we dont know the number of columns, it is impossible to calculate the unknown dimension value. </p><h3 id="heading-sorting-numpy-arrays"><strong>Sorting NumPy arrays</strong></h3><p>We can easily sort a numpy array using the <strong>np.sort()</strong> function.</p><pre><code class="lang-python">arr = np.random.randint(low=<span class="hljs-number">20</span>, size = <span class="hljs-number">12</span>)print(<span class="hljs-string">"The actual array is \n"</span>, arr)print(<span class="hljs-string">"The sorted array is \n"</span>,np.sort(arr))</code></pre><p>We can also sort 2D arrays both row wise and column wise sing the axis parameter.</p><pre><code class="lang-python">arr = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">30</span>, size = (<span class="hljs-number">4</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"The actual array is \n"</span>, arr)print(<span class="hljs-string">"The column wise sorted array is \n"</span>,np.sort(arr, axis=<span class="hljs-number">0</span>))print(<span class="hljs-string">"The row wise sorted array is \n"</span>,np.sort(arr, axis=<span class="hljs-number">1</span>))</code></pre><h3 id="heading-flattening-numpy-arrays"><strong>Flattening NumPy arrays</strong></h3><p>Flattening an array is nothing but crushing down higher dimensional arrays to one dimension. There are 2 functions to execute this in Numpy. They are <strong>np.ndarray.flatten()</strong> and <strong>np.matrix.flatten()</strong> . The former is meant for all higher dimensional arrays while the latter is meant specifically for matrices</p><pre><code class="lang-python">arr_2d = np.random.randint(low=<span class="hljs-number">10</span>, high=<span class="hljs-number">30</span>, size = (<span class="hljs-number">4</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"The actual array is :\n"</span>, arr_2d)print(<span class="hljs-string">'Array after flattening using method 1 :\n'</span>, np.ndarray.flatten(arr_2d))print(<span class="hljs-string">'Array after flattening using method 2 :\n'</span>, np.matrix.flatten(arr_2d))</code></pre><h3 id="heading-rotating-numpy-arrays"><strong>Rotating NumPy arrays</strong></h3><p>The <strong>np.rot90()</strong> function can be used to rotate a NumPy array by 90 degrees. This is explained with the example below</p><p>The parameter <strong>k</strong> specifies how many times the matrix has to be rotated. The <strong>axes</strong> parameter specifies on what axis the matrix has to be rotated. Considering a 2D array, if the axes value is specified as (0,1), then the array is rotated in the anti-clockwise direction and when the axes value is specified as (1,0) then the array is rotated in the clockwise direction.</p><pre><code class="lang-python">arr = np.array([[<span class="hljs-number">10</span>,<span class="hljs-number">20</span>],[<span class="hljs-number">30</span>,<span class="hljs-number">40</span>]])print(<span class="hljs-string">'The original array is \n'</span>, arr)print(<span class="hljs-string">'The array after rotation it by 90 degrees once in the anti-clock wise direction is \n'</span>, np.rot90(arr,k=<span class="hljs-number">1</span>, axes=(<span class="hljs-number">0</span>,<span class="hljs-number">1</span>)))print(<span class="hljs-string">'The array after rotation it by 90 degrees once in the clock wise direction is \n'</span>, np.rot90(arr,k=<span class="hljs-number">1</span>, axes=(<span class="hljs-number">1</span>,<span class="hljs-number">0</span>)))print(<span class="hljs-string">'The array after rotation it by 90 degrees twice in the clock wise direction is \n'</span>, np.rot90(arr,k=<span class="hljs-number">2</span>, axes=(<span class="hljs-number">1</span>,<span class="hljs-number">0</span>)))print(<span class="hljs-string">'The array after rotation it by 90 degrees thrice in the anti-clock wise direction is \n'</span>, np.rot90(arr,k=<span class="hljs-number">3</span>, axes=(<span class="hljs-number">0</span>,<span class="hljs-number">1</span>)))</code></pre><h2 id="heading-matrix-operations-in-numpy"><strong>Matrix Operations in Numpy</strong></h2><p>If you are familiar with matrices in mathematics, then you would probably know about various matrix operations such as</p><ul><li><p>Matrix addition</p></li><li><p>Matrix subtraction</p></li><li><p>Matrix multiplication</p></li><li><p>Matrix vector multiplication</p></li><li><p>Matrix division</p></li><li><p>Matrix transpose</p></li><li><p>Matrix inverse</p></li><li><p>Matrix Power</p></li><li><p>Determining Diagonal Elements of a matrix</p></li><li><p>Evaluating Upper and Lower triangle elements of a matrix</p></li></ul><p>We will be understanding how to implement all the above operations using NumPy one by one.</p><h3 id="heading-matrix-addition-and-subtraction"><strong>Matrix Addition and Subtraction</strong></h3><p>Matrix addition and subtraction can be easily done just by using the <strong>+</strong> and <strong>-</strong> operators directly unless and until the dimensions of both the matrices are the same.</p><pre><code class="lang-python">mat1 = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">3</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"Matrix 1 \n"</span>, mat1)mat2 = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">3</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"Matrix 2 \n"</span>, mat2)print(<span class="hljs-string">"Matrix 1 + Matrix 2 \n"</span>, mat1 + mat2)print(<span class="hljs-string">"Matrix 1 - Matrix 2 \n"</span>, mat1 - mat2)</code></pre><p>We can also use the <strong>np.add()</strong> function to add two matrices and <strong>np.subtract()</strong> too subtract two matrices.</p><pre><code class="lang-python">mat1 = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">3</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"Matrix 1 \n"</span>, mat1)mat2 = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">3</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"Matrix 2 \n"</span>, mat2)print(<span class="hljs-string">"Matrix 1 + Matrix 2 \n"</span>, np.add(mat1, mat2))print(<span class="hljs-string">"Matrix 1 - Matrix 2 \n"</span>, np.subtract(mat1,mat2))</code></pre><h3 id="heading-matrix-multiplication"><strong>Matrix Multiplication</strong></h3><p>The <strong>np.matmul()</strong> function multiplies two matrices in the conventional manner.</p><pre><code class="lang-python">mat1 = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">3</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix 1 \n"</span>, mat1)mat2 = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">3</span>))print(<span class="hljs-string">"Matrix 2 \n"</span>, mat2)print(<span class="hljs-string">"Matrix 1 x Matrix 2 (Matrix multiplication) \n"</span>, np.matmul(mat1, mat2))</code></pre><p>If the two matrices are of the same dimension, then the <strong>*</strong> operator does element wise multiplication</p><pre><code class="lang-python">mat1 = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix 1 \n"</span>, mat1)mat2 = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix 2 \n"</span>, mat2)print(<span class="hljs-string">"Matrix 1 x Matrix 2 (Element wise multiplication) \n"</span>, mat1*mat2)</code></pre><h3 id="heading-matrix-division"><strong>Matrix Division</strong></h3><p>The <strong>np.divide()</strong> function helps us to divide two matrices. Division of two matrices can also be pulled off using the <strong>/</strong> operator.</p><pre><code class="lang-python">mat1 = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix 1 \n"</span>, mat1)mat2 = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix 2 \n"</span>, mat2)print(<span class="hljs-string">"Matrix 1 / Matrix 2 (using / operator) \n"</span>, mat1/mat2)print(<span class="hljs-string">"Matrix 1 / Matrix 2 (using np.divide function) \n"</span>, np.divide(mat1,mat2))</code></pre><p>Numpy is smart enough to broadcast the elements of the smaller matrix over the larger matrix if that is possible. Let us understand that with the example below</p><pre><code class="lang-python">mat = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Matrix\n"</span>, mat)vec = np.random.randint(low = <span class="hljs-number">10</span>, high= <span class="hljs-number">20</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">1</span>))print(<span class="hljs-string">"Vector\n"</span>, vec)print(<span class="hljs-string">"Matrix / Vector (using / operator) \n"</span>, mat/vec)print(<span class="hljs-string">"Matrix / vector (using np.divide function) \n"</span>, np.divide(mat,vec))</code></pre><p>In the above example, mat is a 2 x 2 matrix while vec is a 2 x 1 vector. It can be seen from the results that the vector has been devided from each column of the matrix. Or in other terms, the vector has been broadcasted over the matrix during division.</p><h3 id="heading-matrix-transpose"><strong>Matrix Transpose</strong></h3><p>Transpose is nothing but reversing the axes of a matrix. When a matrix is transposed, the row elements become the column elements and the column elements become the row elements. Transpose of a matrix can be obtained using the <strong>np.transpose()</strong> function.</p><pre><code class="lang-python">mat = np.random.randint(low = <span class="hljs-number">1</span>, high= <span class="hljs-number">10</span>, size = (<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">"Original Matrix\n"</span>, mat)print(<span class="hljs-string">"Transposed Matrix\n"</span>, np.transpose(mat))</code></pre><h3 id="heading-matrix-inverse"><strong>Matrix inverse</strong></h3><p>The <strong>np.linalg.inv()</strong> function provides the inverse of a matrix.</p><p><em>Calculating the inverse of a matrix is a little critical job and requires a stronghold in the basics of matrices. This tutorial mainly focuses on how we can leverage the NumPy package to perform these kinds of scientific computations rather than getting into its mathematical aspects, which is why we wont be going through its mathematical background.</em></p><pre><code class="lang-python">mat = np.array([[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">0</span>]])print(<span class="hljs-string">"Original matrix \n"</span>, mat)print(<span class="hljs-string">"The inverse of the matrix is\n"</span>, np.linalg.inv(mat))</code></pre><h3 id="heading-matrix-power"><strong>Matrix Power</strong></h3><p>Matrix powers are not the same as we do for normal numbers. If A is a 2 x 2 matrix, then A^2 is A times A. Here A times A means matrix multiplication and not element-wise multiplication. Such kind of matrix powers can be derived using the <strong>np.linalg.matrix_power()</strong> function in NumPy. In some cases, we might just need the individual squares of all the elements in the matrix. In such cases, we can use the <strong>**</strong> operator. </p><pre><code class="lang-python">mat = np.random.randint(low=<span class="hljs-number">1</span>, high=<span class="hljs-number">10</span>, size=(<span class="hljs-number">2</span>,<span class="hljs-number">2</span>))print(<span class="hljs-string">'The original matrix is \n'</span>, mat)mat_square = np.linalg.matrix_power(mat,<span class="hljs-number">2</span>)print(<span class="hljs-string">'The square of the matrix is \n'</span>, mat_square)print(<span class="hljs-string">'Square of all the elements in the matrix is \n'</span>, mat**<span class="hljs-number">2</span>)</code></pre><h3 id="heading-extracting-diagonal-of-a-matrix-in-numpy"><strong>Extracting Diagonal of a matrix in NumPy</strong></h3><p>The Main Diagonal of a square matrix has the elements that are present main diagonal (top left to bottom right). These diagonal elements can be extracted from the matrix using the <strong>np.diag()</strong> function. The parameter k denotes the diagonal that is required. When k=0, it returns the main diagonal elements. When k=1, it returns the diagonal elements above the main diagonal and when k=-1, it returns the diagonal elements one step below the main diagonal. The value of k is 0 by default. Below is an example.</p><pre><code class="lang-python">mat = np.array([[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">0</span>]])print(<span class="hljs-string">"Original matrix \n"</span>, mat)print(<span class="hljs-string">"Main Diagonal elements of the matrix are: \n"</span>, np.diag(mat))print(<span class="hljs-string">"Elements of the matrix one step above the main diagonal are: \n"</span>, np.diag(mat, k=<span class="hljs-number">1</span>))print(<span class="hljs-string">"Elements of the matrix one step below the main diagonal are: \n"</span>, np.diag(mat, k=<span class="hljs-number">-1</span>))</code></pre><h3 id="heading-evaluating-upper-and-lower-triangular-matrix-in-numpy"><strong>Evaluating Upper and lower triangular matrix in NumPy</strong></h3><p>Generally in a square matrix, the elements present above the main diagonal form the Upper triangle and the elements below the main diagonal from the lower triangle. These upper and lower triangular elements can be easily extracted using <strong>np.triu()</strong> and <strong>np.tril()</strong> functions respectively. Just like we specified the k parameter for the <strong>np.diag()</strong> function, we can also specify the k parameter here to return a matrix with elements above/below the specified diagonal as 0.</p><pre><code class="lang-python">mat = np.array([[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>],[<span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">4</span>],[<span class="hljs-number">5</span>,<span class="hljs-number">6</span>,<span class="hljs-number">0</span>]])print(<span class="hljs-string">"Original matrix \n"</span>, mat)print(<span class="hljs-string">'The upper triangular matrix \n'</span>, np.triu(mat))print(<span class="hljs-string">'The lower triangular matrix \n'</span>, np.tril(mat))print(<span class="hljs-string">'The lower triangular matrix with k=1 \n'</span>, np.tril(mat, k=<span class="hljs-number">1</span>))print(<span class="hljs-string">'The upper triangular matrix with k=-1 \n'</span>, np.tril(mat, k=<span class="hljs-number">-1</span>))</code></pre><h2 id="heading-indexing-numpy-arrays"><strong>Indexing Numpy arrays</strong></h2><p>Indexing is the most crucial part when it comes to array manipulations. Just like list indexing in python, indexing in numpy also begins with 0. The Numpy package has really powerful indexing methods. There are various kinds of indexing in Numpy. But in this tutorial, we will be categorising the indexing methods in the following manner.</p><ul><li><p>Basic indexing</p></li><li><p>Indexing using slicing operator</p></li><li><p>Indexing 2D arrays</p></li><li><p>Indexing 3D arrays</p></li><li><p>Advanced indexing using integer arrays</p></li><li><p>Advanced indexing using Boolean conditions</p></li></ul><h3 id="heading-basic-indexing"><strong>Basic Indexing</strong></h3><p>Let us start with the basics of indexing. Let us create an array using the <strong>np.arange()</strong> function.</p><pre><code class="lang-python">arr = np.arange(<span class="hljs-number">0</span>,<span class="hljs-number">150</span>,<span class="hljs-number">10</span>)print(<span class="hljs-string">'The original array is \n'</span>, arr)Output-The original array <span class="hljs-keyword">is</span> [ <span class="hljs-number">0</span> <span class="hljs-number">10</span> <span class="hljs-number">20</span> <span class="hljs-number">30</span> <span class="hljs-number">40</span> <span class="hljs-number">50</span> <span class="hljs-number">60</span> <span class="hljs-number">70</span> <span class="hljs-number">80</span> <span class="hljs-number">90</span> <span class="hljs-number">100</span> <span class="hljs-number">110</span> <span class="hljs-number">120</span> <span class="hljs-number">130</span> <span class="hljs-number">140</span>]</code></pre><p>Let us try to grab the 5th element of the array. We can easily do that in the following manner</p><pre><code class="lang-python">print(<span class="hljs-string">'The 5th element of the array is :'</span>, arr[<span class="hljs-number">5</span>])</code></pre><h3 id="heading-indexing-using-slicing-operator"><strong>Indexing using slicing operator</strong></h3><p>We can also obtain values within a range of index using the slicing technique. Indexing using slicing works in the <strong>array[start:stop:step]</strong> manner. The start and stop specify the index ranges upper and lower limits, and the step specifies the spacing between each index. Let us understand it with some examples</p><pre><code class="lang-python">print(<span class="hljs-string">'The element of the array from 4th index to 10th index: \n'</span>, arr[<span class="hljs-number">4</span>:<span class="hljs-number">10</span>])print(<span class="hljs-string">'The element of the array from 2nd index to 12th index in steps of 2: \n'</span>, arr[<span class="hljs-number">2</span>:<span class="hljs-number">12</span>:<span class="hljs-number">2</span>])print(<span class="hljs-string">'The element of the array from 14th index to 6th index in steps of -2:\n'</span>, arr[<span class="hljs-number">14</span>:<span class="hljs-number">6</span>:<span class="hljs-number">-2</span>])print(<span class="hljs-string">'The element of the array from 3rd index to the end of the array in steps of 2:\n'</span>, arr[<span class="hljs-number">3</span>::<span class="hljs-number">2</span>])<span class="hljs-comment"># returns the array in a reversed manner</span>print(<span class="hljs-string">'All the element of the array steps of -1: \n'</span>, arr[::<span class="hljs-number">-1</span>])</code></pre><h3 id="heading-indexing-2d-arrays"><strong>Indexing 2D arrays</strong></h3><p>Indexing a two-dimensional array is always done in an <strong>array[row, col]</strong> manner. All sorts of indexing techniques that we used previously while indexing 1D arrays like slicing, indexing using index lists can also be used here. Let us see a few examples to understand it better.</p><pre><code class="lang-python">arr = np.arange(<span class="hljs-number">0</span>,<span class="hljs-number">250</span>,<span class="hljs-number">10</span>).reshape(<span class="hljs-number">5</span>,<span class="hljs-number">5</span>)print(<span class="hljs-string">'The original array is \n'</span>, arr)</code></pre><pre><code class="lang-python">print(<span class="hljs-string">'The element in 2nd row and 3rd column is:'</span>, arr[<span class="hljs-number">2</span>, <span class="hljs-number">3</span>])print(<span class="hljs-string">'The element in 3rd row and 1st column is:'</span>, arr[<span class="hljs-number">3</span>, <span class="hljs-number">1</span>])print(<span class="hljs-string">'All the elements in 3rd row are:'</span>, arr[<span class="hljs-number">3</span>, :])print(<span class="hljs-string">'All the elements in 2nd column are:'</span>, arr[:, <span class="hljs-number">2</span>])print(<span class="hljs-string">'Elements in 2nd column with row indices ranging between 1 and 3 are:'</span>, arr[<span class="hljs-number">1</span>:<span class="hljs-number">3</span>, <span class="hljs-number">2</span>])print(<span class="hljs-string">'Elements in 4th row with column indices ranging between 0 and 3 are:'</span>, arr[<span class="hljs-number">4</span>, <span class="hljs-number">0</span>:<span class="hljs-number">3</span>])print(<span class="hljs-string">'Elements with row indices ranging between 1 and 3 and column indices ranging between 2 and 4 are: \n'</span>, arr[<span class="hljs-number">1</span>:<span class="hljs-number">3</span>, <span class="hljs-number">2</span>:<span class="hljs-number">4</span>])</code></pre><h3 id="heading-indexing-3d-arrays"><strong>Indexing 3D arrays</strong></h3><p>Imagine 3D arrays as different matrices stacked one on top of the other. So while indexing 3D arrays, we dont just mention the row and column index, but also mention on which matrix we should be looking for the specified row and column indices. 3D arrays are indexed in <strong>array[matrix, row, col]</strong> manner Let us see a few examples</p><pre><code class="lang-python">arr = np.arange(<span class="hljs-number">45</span>).reshape(<span class="hljs-number">3</span>,<span class="hljs-number">3</span>,<span class="hljs-number">5</span>)print(<span class="hljs-string">'The original array is \n'</span>, arr)</code></pre><pre><code class="lang-python">print(<span class="hljs-string">'The 0th 2D matrix: \n'</span>, arr[<span class="hljs-number">0</span>])print(<span class="hljs-string">'The 1st 2D matrix: \n'</span>, arr[<span class="hljs-number">1</span>])print(<span class="hljs-string">'The 2nd row of the 1st 2D matrix: \n'</span>, arr[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,:])print(<span class="hljs-string">'The 1st row of the 2nd 2D matrix: \n'</span>, arr[<span class="hljs-number">2</span>,<span class="hljs-number">1</span>,:])print(<span class="hljs-string">'The 0th column of the 1st 2D matrix: \n'</span>, arr[<span class="hljs-number">1</span>,:,<span class="hljs-number">0</span>])print(<span class="hljs-string">'The 3rd column of the 0th 2D matrix: \n'</span>, arr[<span class="hljs-number">0</span>,:,<span class="hljs-number">3</span>])print(<span class="hljs-string">'The element present in the 2nd row and 4th column of the 1st 2D matrix:'</span>, arr[<span class="hljs-number">1</span>,<span class="hljs-number">2</span>,<span class="hljs-number">4</span>])print(<span class="hljs-string">'The element present in the 0th row and 3rd column of the 0th 2D matrix:'</span>, arr[<span class="hljs-number">0</span>,<span class="hljs-number">0</span>,<span class="hljs-number">3</span>])</code></pre><h3 id="heading-advanced-indexing-using-integer-arrays"><strong>Advanced indexing using integer arrays</strong></h3><p>As we discussed earlier, numpy has really powerful and sophisticated indexing methods and indexing using integer arrays is one among them. Let us consider the following array</p><pre><code class="lang-python">arr = np.arange(<span class="hljs-number">10</span>)print(<span class="hljs-string">'The original array is \n'</span>, arr)</code></pre><p>If we want only the 3rd, 5th and 9th elements we can easily extract those using integer array indexing. To do that, we will first have to create an integer array of indices</p><pre><code class="lang-python">index_arr_1 = np.array([<span class="hljs-number">3</span>,<span class="hljs-number">5</span>,<span class="hljs-number">9</span>])print(<span class="hljs-string">'The index array is \n'</span>, index_arr_1)The index array <span class="hljs-keyword">is</span> [<span class="hljs-number">3</span> <span class="hljs-number">5</span> <span class="hljs-number">9</span>]</code></pre><p>Now we can easily pass this index array to our original array as follows</p><pre><code class="lang-python">print(<span class="hljs-string">'The 3rd, 5th and 9th elements of the array are \n'</span>, arr[index_arr_1])</code></pre><p>We can also repeat an index more than once using index arrays!</p><pre><code class="lang-python">index_arr_2 = np.array([<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>,<span class="hljs-number">2</span>,<span class="hljs-number">3</span>])print(<span class="hljs-string">'The array returned after indexing is \n'</span>, arr[index_arr_2])</code></pre><h3 id="heading-advanced-indexing-using-boolean-conditions"><strong>Advanced indexing using boolean conditions</strong></h3><p>We can also specify a logical condition to extract elements from the array. It returns the elements of the array for which the specified condition is true. The following example explains how to extract elements that are greater than 5 from an array.</p><pre><code class="lang-python">arr = np.arange(<span class="hljs-number">12</span>) print(<span class="hljs-string">'The original array is \n'</span>, arr) print(<span class="hljs-string">'The elements that are greater than 5 are \n'</span>, arr[arr><span class="hljs-number">5</span>]) print(<span class="hljs-string">'The elements that are lesser than 5 are \n'</span>, arr[arr<<span class="hljs-number">5</span>]) print(<span class="hljs-string">'The elements that equal to 5 \n'</span>, arr[arr==<span class="hljs-number">5</span>]) print(<span class="hljs-string">'The elements are even \n'</span>, arr[arr%<span class="hljs-number">2</span>==<span class="hljs-number">0</span>])</code></pre><h2 id="heading-saving-numpy-arrays"><strong>Saving NumPy arrays</strong></h2><p>We can easily save any numpy array as a .npy file using the <a target="_blank" href="http://np.save"><strong>np.save</strong></a><strong>()</strong> function and here is how to do!</p><pre><code class="lang-python">arr_to_save = np.arange(<span class="hljs-number">1</span>,<span class="hljs-number">10</span>).reshape(<span class="hljs-number">3</span>,<span class="hljs-number">3</span>)np.save(file=<span class="hljs-string">'array.npy'</span>, arr=arr_to_save).</code></pre>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1685873098644/db70c021-1e75-4a80-bc4b-2a18e82ca357.png<![CDATA[Get to know - What is Amazon CloudFront?]]>https://blog.hemath.com/get-to-know-what-is-amazon-cloudfronthttps://blog.hemath.com/get-to-know-what-is-amazon-cloudfrontTue, 01 Nov 2022 10:46:21 GMT<![CDATA[<h2 id="heading-what-is-cloudfront">What is CloudFront?</h2><p>Amazon CloudFront is a fast content delivery network (CDN) service that securely delivers data, videos, applications, and APIs to customers worldwide with low latency and high transfer speeds, all while remaining developer-friendly</p><p>Amazon CloudFront is a web service that accelerates the distribution of static and dynamic web content to your users, such as .html,.css,.js, and image files. CloudFront distributes your content via a global network of data centers known as edge locations.</p><p>When a user requests content that you're serving with CloudFront, the request is routed to the edge location with the lowest latency (time delay), ensuring that the content is delivered as quickly as possible.</p><p>CloudFront delivers content immediately if it is already in the edge location with the lowest latency.</p><p>If the content isn't in that edge location, CloudFront retrieves it from a predefined origin, such as an Amazon S3 bucket, a MediaPackage channel, or an HTTP server (such as a web server) that you've designated as the source for the definitive version of your content</p><p>CloudFront accelerates content distribution by routing each user request through the AWS backbone network to the nearest edge location that can best serve your content. This is typically a CloudFront edge server that provides the quickest delivery to the viewer.</p><p>Using the AWS network reduces the number of networks that your users' requests must traverse, improving performance. Users benefit from reduced latency (the time it takes to load the first byte of a file) and faster data transfer rates.</p><p>You also benefit from increased reliability and availability because copies of your files (also known as objects) are now stored (or cached) in multiple edge locations around the world.</p><h2 id="heading-how-does-amazon-cloudfront-work">How Does Amazon CloudFront Work?</h2><p>CloudFront integrates with any AWS origin, including Amazon S3, Amazon EC2, Elastic Load Balancing, and any custom HTTP origin. CloudFront's secure and programmable edge computing features CloudFront Functions and AWS Lambda@Edge allows you to customize your content delivery.</p><h2 id="heading-key-benefits-of-cloudfront">Key Benefits of CloudFront</h2><ul><li><p><strong>Global Scaled Network for Fast Content Delivery</strong></p><p> Amazon CloudFront is massively scalable and distributed globally. The CloudFront network has over 225 points of presence (PoPs) that are interconnected via the AWS backbone to provide your end users with ultra-low latency performance and high availability</p><p> The Amazon Web Services backbone is a private network built on a global, fully redundant, parallel 100 GbE metro fiber network connected by trans-oceanic cables across the Atlantic, Pacific, and Indian Oceans, as well as the Mediterranean, Red Sea, and South China Seas.</p><p> Amazon CloudFront intelligently routes your users' traffic to the most performant AWS edge location to serve cached or dynamic content based on network conditions. CloudFront includes a multi-tiered caching architecture by default, which provides improved cache width and origin protection.</p></li><li><p><strong>Deep Integration with AWS</strong></p><p> Amazon CloudFront is easily configured with AWS services like Amazon S3, Amazon EC2, Elastic Load Balancing, Amazon Route 53, and AWS Elemental Media Services.</p><p> You can use the AWS management console or familiar developer tools like CloudFormation templates, the AWS Cloud Development Kit, and APIs as a developer. The integration of CloudFront with Amazon Cloudwatch and Kinesis provides real-time observability via metrics and logs.</p></li></ul><ul><li><p><strong>Security at the Edge</strong></p><p> Amazon CloudFront is a highly secure CDN that protects both the network and the application. AWS Shield Standard protects all of your CloudFront distributions by default against the most common network and transport layer DDoS attacks that target your websites or applications.</p><p> To defend against more complex attacks, integrate CloudFront with AWS Shield Advanced and AWS Web Application Firewall to create a flexible, layered security perimeter (WAF). AWS Managed Rules for AWS WAF provide you with firewall rules curated and managed by Amazon security experts to protect against common CVEs and OWASP Top 10 security risks.</p></li><li><p><strong>Highly Programmable and Secure Edge Computing</strong></p><p> You can easily run code across AWS locations globally with edge compute features CloudFront Functions and Lambda@Edge, allowing you to personalize content and respond to your end users with reduced latency.</p><p> For example, you can use CloudFront Functions to deliver personalized content based on visitor attributes, generate custom responses, or run A/B testing on AWS infrastructure using your own custom code. You can supplement or completely replace your origin servers with Lambda@Edge. Lambda@Edge can be used to render web pages on the server, manipulate streaming manifest files on the fly for ad insertion, or add security tokens. With built-in security isolation, CloudFront Functions and Lambda@Edge both protect your data from attack.</p></li></ul><ul><li><p><strong>Cost-Effective</strong></p><p> Amazon CloudFront provides global content delivery at a low cost. There are no transfer fees for origin fetches from any AWS origin, and AWS Certificate Manager (ACM) provides free custom TLS certificates.</p><p> CloudFront offers a variety of pricing options, including simple pay-as-you-go pricing with no upfront fees and the CloudFront Security Savings Bundle, which can save you up to 30% more. Custom pricing is available for minimum traffic commitments (typically 10 TB/month or higher) for steeper discounts. Your existing AWS Support subscription includes CDN support.</p></li></ul>]]><![CDATA[<h2 id="heading-what-is-cloudfront">What is CloudFront?</h2><p>Amazon CloudFront is a fast content delivery network (CDN) service that securely delivers data, videos, applications, and APIs to customers worldwide with low latency and high transfer speeds, all while remaining developer-friendly</p><p>Amazon CloudFront is a web service that accelerates the distribution of static and dynamic web content to your users, such as .html,.css,.js, and image files. CloudFront distributes your content via a global network of data centers known as edge locations.</p><p>When a user requests content that you're serving with CloudFront, the request is routed to the edge location with the lowest latency (time delay), ensuring that the content is delivered as quickly as possible.</p><p>CloudFront delivers content immediately if it is already in the edge location with the lowest latency.</p><p>If the content isn't in that edge location, CloudFront retrieves it from a predefined origin, such as an Amazon S3 bucket, a MediaPackage channel, or an HTTP server (such as a web server) that you've designated as the source for the definitive version of your content</p><p>CloudFront accelerates content distribution by routing each user request through the AWS backbone network to the nearest edge location that can best serve your content. This is typically a CloudFront edge server that provides the quickest delivery to the viewer.</p><p>Using the AWS network reduces the number of networks that your users' requests must traverse, improving performance. Users benefit from reduced latency (the time it takes to load the first byte of a file) and faster data transfer rates.</p><p>You also benefit from increased reliability and availability because copies of your files (also known as objects) are now stored (or cached) in multiple edge locations around the world.</p><h2 id="heading-how-does-amazon-cloudfront-work">How Does Amazon CloudFront Work?</h2><p>CloudFront integrates with any AWS origin, including Amazon S3, Amazon EC2, Elastic Load Balancing, and any custom HTTP origin. CloudFront's secure and programmable edge computing features CloudFront Functions and AWS Lambda@Edge allows you to customize your content delivery.</p><h2 id="heading-key-benefits-of-cloudfront">Key Benefits of CloudFront</h2><ul><li><p><strong>Global Scaled Network for Fast Content Delivery</strong></p><p> Amazon CloudFront is massively scalable and distributed globally. The CloudFront network has over 225 points of presence (PoPs) that are interconnected via the AWS backbone to provide your end users with ultra-low latency performance and high availability</p><p> The Amazon Web Services backbone is a private network built on a global, fully redundant, parallel 100 GbE metro fiber network connected by trans-oceanic cables across the Atlantic, Pacific, and Indian Oceans, as well as the Mediterranean, Red Sea, and South China Seas.</p><p> Amazon CloudFront intelligently routes your users' traffic to the most performant AWS edge location to serve cached or dynamic content based on network conditions. CloudFront includes a multi-tiered caching architecture by default, which provides improved cache width and origin protection.</p></li><li><p><strong>Deep Integration with AWS</strong></p><p> Amazon CloudFront is easily configured with AWS services like Amazon S3, Amazon EC2, Elastic Load Balancing, Amazon Route 53, and AWS Elemental Media Services.</p><p> You can use the AWS management console or familiar developer tools like CloudFormation templates, the AWS Cloud Development Kit, and APIs as a developer. The integration of CloudFront with Amazon Cloudwatch and Kinesis provides real-time observability via metrics and logs.</p></li></ul><ul><li><p><strong>Security at the Edge</strong></p><p> Amazon CloudFront is a highly secure CDN that protects both the network and the application. AWS Shield Standard protects all of your CloudFront distributions by default against the most common network and transport layer DDoS attacks that target your websites or applications.</p><p> To defend against more complex attacks, integrate CloudFront with AWS Shield Advanced and AWS Web Application Firewall to create a flexible, layered security perimeter (WAF). AWS Managed Rules for AWS WAF provide you with firewall rules curated and managed by Amazon security experts to protect against common CVEs and OWASP Top 10 security risks.</p></li><li><p><strong>Highly Programmable and Secure Edge Computing</strong></p><p> You can easily run code across AWS locations globally with edge compute features CloudFront Functions and Lambda@Edge, allowing you to personalize content and respond to your end users with reduced latency.</p><p> For example, you can use CloudFront Functions to deliver personalized content based on visitor attributes, generate custom responses, or run A/B testing on AWS infrastructure using your own custom code. You can supplement or completely replace your origin servers with Lambda@Edge. Lambda@Edge can be used to render web pages on the server, manipulate streaming manifest files on the fly for ad insertion, or add security tokens. With built-in security isolation, CloudFront Functions and Lambda@Edge both protect your data from attack.</p></li></ul><ul><li><p><strong>Cost-Effective</strong></p><p> Amazon CloudFront provides global content delivery at a low cost. There are no transfer fees for origin fetches from any AWS origin, and AWS Certificate Manager (ACM) provides free custom TLS certificates.</p><p> CloudFront offers a variety of pricing options, including simple pay-as-you-go pricing with no upfront fees and the CloudFront Security Savings Bundle, which can save you up to 30% more. Custom pricing is available for minimum traffic commitments (typically 10 TB/month or higher) for steeper discounts. Your existing AWS Support subscription includes CDN support.</p></li></ul>]]>https://cdn.hashnode.com/res/hashnode/image/upload/v1685961852849/0a5aa447-2821-4fa7-b10d-15cd9a7bf67d.png