Let's break down

In this blog, we will start with the basics of the famous Python library called Pandas and gradually advance to complex and advanced topics. Let us begin this tutorial on Pandas with a brief introduction to what the library is all about.

What’s Pandas in Python for?

Data Science involves collecting, storing, and aggregating data, followed by its cleaning, exploration, and analysis. There is a heavy emphasis on the cleaning of data before it can be further processed. As a result, care is taken to perform a thorough exploratory data analysis to generate a dataset with the utmost quality. Python offers the Pandas library, in-built with features that support data pre-processing throughout the lifeline of the data analysis process. A clean dataset is an excellent starting point for hypothesis testing and can also be used for further modeling and application of data analysis and machine learning algorithms.

Developed by Wes McKinney, Pandas is a high-level data manipulation library built on the Python programming language. Python Pandas is a quick, powerful, versatile, easy-to-use open-source data analysis and manipulation tool. It is based on the Numpy package, and the dataframe is its primary data structure.

Python Pandas - Series

A Series is quite similar to a NumPy array (it is built on top of the NumPy array object). A Series can include axis labels, which means it can be indexed by a label instead of just a number location, which distinguishes the NumPy array from a Series. It can also hold any arbitrary Python Object rather than just numeric data.

Create a Pandas Series

We can convert a NumPy array, dictionary, or list to a Series:

#Importing libraries
import numpy as np
import pandas as pd
my_labels = ['x','y','z']
demo_list = [1,2,3]
demo_array = np.array([1,2,3])
demo_dict = {'x':1,'y':2,'z':3}
Using "Numpy array"
pd.Series(data = demo_array,index = my_labels)

What is the main difference between a Pandas series and a single-column Dataframe in Python?

A Pandas Series has only one dimension, but a DataFrame has two. As a result, whereas a single-column DataFrame can have a name for its single column, a Series cannot. In reality, a DataFrame's columns can all be turned into Series.

There are a few intriguing points to consider -

1. Indexes and columns in Pandas Dataframes and Series make data access and retrieval simple. They're also changeable.

2. In a Dataframe, a column is essentially a Series. Series operations are used when you simply wish to manipulate a single column of data. They are commonly used in graph plotting.

3. Dataframes are often used to represent data in a tabular format. It simplifies the analysis, extraction, and alteration of two-dimensional data.

How to convert a Pandas Series to a list in Python?

To convert a series to a list, use Pandas tolist(). The Series is initially of the type pandas.core.series. It is transformed to a list data type by using the tolist() method.

#Importing library
import pandas as pd
#Creating series
demo_series = pd.Series([1,2,3,4,5])
#Converting series to list
demo_list = demo_series.tolist()
print("Data type before converting = ",type(demo_series))
print("Data type after converting = ",type(demo_list))

How to convert a list to a Pandas Series in Python?

We can directly convert a list to Pandas Series by just passing a list object in Series.

#Importing library
import pandas as pd
#Creating list
demo_list = [1,2,3]
#Converting list to series
demo_series = pd.Series(demo_list)
print("Data type before converting = ",type(demo_list))
print("Data type after converting = ",type(demo_series))

How to convert Pandas series to Dataframe in Python?

To convert a series to a Dataframe, use Pandas to_frame(). The Series is initially of the type pandas.core.series. It is transformed into a Dataframe data type by using the to_frame() method.

#Importing library
import pandas as pd
#Creating series
demo_series = pd.Series([1,2,3,4,5])
#Converting series to dataframe
demo_dataframe = demo_series.to_frame()
print("Data type before converting = ",type(demo_series))
print("Data type after converting = ",type(demo_dataframe))

How to sort a Pandas series in Python?

The Series.sort_values() function is used to sort a series object in ascending or descending order according to a set of criteria. The function also gives you the option of using your own sorting algorithm.

#Importing library
import pandas as pd
#Creating series
demo_series = pd.Series([23,10,5,16,30])
print("Original Series =>\n",demo_series)
#Sorting series in ascending order
asc_series = demo_series.sort_values()
#Sorting series in descending order
dsc_series = demo_series.sort_values(ascending=False)
#To make changes in original series
demo_series.sort_values(inplace=True)
print("Sorted Series in ascending order =>\n",asc_series)
print("Sorted Series in descending order =>\n",dsc_series)

Python Pandas - Dataframes

DataFrames are Pandas' workhorses, and they're based on the R programming language. A DataFrame can be thought of as a collection of Series objects that share the same index. Pandas DataFrame is a possibly heterogeneous two-dimensional size-mutable tabular data format with labelled axes (rows and columns). A data frame is a two-dimensional data structure in which data is organized in rows and columns in a tabular format.

Key features -

1. It can consist of columns with different data types.

2. We can perform arithmetic operations on rows and columns.

3. It is mutable.

How to create Pandas Dataframe in Python?

We can create a Pandas Dataframe from a dictionary or list:

#Example 1
#Importing libraries
import pandas as pd
#Creating one dimensional list
demo_list = [1,2,3]
#Creating Dataframe from list
pd.DataFrame(demo_list)

#Example 2
#Importing libraries
import pandas as pd
#Creating two dimensional list
demo_list = [['Roy',1],['Jason',2],['Sancho',3]]
#Creating dataframe from list
demo_dataframe = pd.DataFrame(demo_list,columns=['Name','Roll No'])

Using "Dictionary"
#Importing libraries
import pandas as pd
#Creating dictionary
demo_dict = {"Name":["Roy","Jason","Sancho"],"Roll No":[1,2,3]}
#Creating dataframe from dictionary
demo_dataframe = pd.DataFrame(demo_dict)

How to add a column to Pandas Dataframe in Python?

Let's assume we have a "Students" Dataframe having two columns as "Name" and "Roll no". Now we want to add a new column as "Marks" in the pre-existing students Dataframe.

#Importing libraries
import pandas as pd
#Creating dictionary
demo_dict = {"Name":["Roy","Jason","Sancho"],"Roll No":[1,2,3]}
#Creating dataframe from dictionary
demo_dataframe = pd.DataFrame(demo_dict)
print("Original dataframe =>\n",demo_dataframe)
#Adding new column "Marks"
demo_dataframe["Marks"] = [90,87,95]
#Dataframe after adding a new column.
demo_dataframe

How to append a Pandas Dataframe to another Dataframe in Python?

The append() function adds rows from another Dataframe to the end of the current Dataframe and returns a new Dataframe object. Columns not present in the original Data Frames are created as new columns, and the new cells are filled with a NaN value.

#Importing libraries
import pandas as pd
#Creating dataframe1
demo_dataframe1 = pd.DataFrame({'Name': ['A', 'B', 'C', 'D'], 'Roll No': [1,2,3,4]})
#Creating dataframe2
demo_dataframe2 = pd.DataFrame({'Name': ['E', 'F', 'G', 'H'], 'Roll No': [5,6,7,8]})
#Appending dataframe2 to dataframe1
demo_dataframe1.append(demo_dataframe2)

How to sort Pandas Dataframe in Python?

The Dataframe.sort_values() function is used to sort a Dataframe object in ascending or descending order according to a set of criteria. The function also gives you the option of using a sorting algorithm of your choice.

Let's assume we have a "Students'' Dataframe. Now we want to sort this data frame according to the "Marks" column.

#Importing libraries
import pandas as pd
#Creating dictionary
demo_dict = {"Name":["Roy","Jason","Sancho"],"Roll No":[1,2,3],"Marks":[90,87,95]}
#Creating dataframe from dictionary
demo_dataframe = pd.DataFrame(demo_dict)
print("Original dataframe =>\n",demo_dataframe)
#Sorting on the basis of the "Marks" column in ascending order.
demo_dataframe = demo_dataframe.sort_values(by="Marks")
print("\n Sorted dataframe in ascending order of Marks =>\n",demo_dataframe)
#Sorting on the basis of the "Marks" column in descending order.
demo_dataframe = demo_dataframe.sort_values(by="Marks",ascending=False)
print("\n Sorted dataframe in descending order of Marks =>\n",demo_dataframe)

How to export Pandas Dataframe to CSV in Python?

We can use the to_csv() function to export Pandas Dataframe to CSV.

Let's assume we have a "Students" Dataframe. Now we want to export this Dataframe to CSV.

#Importing libraries
import pandas as pd
#Creating dictionary
demo_dict = {"Name":["Roy","Jason","Sancho"],"Roll No":[1,2,3],"Marks":[90,87,95]}
#Creating dataframe from dictionary
demo_dataframe = pd.DataFrame(demo_dict)
#Exporting dataframe to "Student.csv" file.
demo_dataframe.to_csv("Student.csv")

What is Pandas Dataframe index?

Indexing, also called subset selection, involves picking specific rows and columns of data from a DataFrame- you can either select all rows and a few columns, all columns and a few rows, or a few rows and columns as needed.

Let's assume we have a "Students" Dataframe. Now Let's play with the index to get some part of the data from a Dataframe.

#Importing libraries
import pandas as pd
#Creating dictionary
demo_dict = {"Name":["Roy","Jason","Sancho"],"Roll No":[1,2,3],"Marks":[90,87,95]}
#Creating dataframe from dictionary
demo_dataframe = pd.DataFrame(demo_dict)
demo_dataframe
#Selecting only the "Name" column
demo_dataframe['Name']
#Selecting only 2nd row
demo_dataframe.iloc[1,:]
#Selecting top 2 rows
demo_dataframe.iloc[0:2,:]

How to read CSV files using Pandas in Python?

We can use the read_csv() function to read CSV files in Pandas.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")

Python Pandas- Aggregations

How to use Pandas to perform aggregations such as sum, min, max on Dataframe columns in Python?

Python Pandas provides built-in functions to perform aggregations on Dataframe columns. To find the minimum and maximum elements, we can use the min() and max() functions, respectively. The sum() function can be used to find the sum of elements.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Sum aggregation function
data['Points'].sum()

#Min aggregation function
data['Points'].min()

#Max aggregation function
data['Points'].max()

How to apply the Pandas group by aggregation in Python?

In Pandas, the groupby operation combines or splits the data frame object by applying some function and combining the results obtained. The groupby function is used to group large amounts of data and perform computations on the groups created.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Find out total points scored by each category of gender.
data[['Artist.gender','Points']].groupby('Artist.gender').sum()

#Find out the minimum point scored by each category of gender.
data[['Artist.gender','Points']].groupby('Artist.gender').min()

#Find out the maximum point scored by each category of gender.
data[['Artist.gender','Points']].groupby('Artist.gender').max()

How to use Pandas to apply groupby with two different aggregations(sum and mean) on Dataframe columns in Python?

Here, we demonstrate the use of grouping the dataset to perform further aggregations based on the grouping.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Find out mean points scored as well as total points scored by each category of gender.
data.groupby("Artist.gender").agg({"Points": [np.mean, np.sum]})

How to apply Pandas median aggregation in Python?

The Pandas library has the built-in function median() that can find the median value in a particular column.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data #Median aggregation function
data['Points'].median()

Python Pandas- Missing Data

How to find missing values in Pandas Dataframe?

Once data is gathered, it is often found that there are several missing values that can interfere with the analysis. It is essential first to identify the missing values so that they can be handled accordingly.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Checking for missing values
data.isna().sum()

How to remove missing data using Pandas in Python?

Null values appear as NaN in Dataframe when a CSV file contains null values. The dropna() method in Pandas allows the user to evaluate and drop Null Rows/Columns in various methods.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Checking for missing values
data.isna().sum()

There are so many missing values present in the data. We have 422 missing values in both 'Artist.gender' and 'Group.Solo' columns. There are 367 missing values in the 'Semi.Final.Number' column. And so on. Let's see how we can remove all missing values.

#Remove missing values.
data.dropna(inplace=True)

#Checking for missing values
data.isna().sum()

How to iterate over rows to find missing data in Pandas Dataframe?

You can use the iterrows() function to find missing values by iterating over the rows in a Python Pandas data frame.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Let's try to find out which rows contain missing data in the 'Artist.gender' column.
for i,d in data.iterrows():
if pd.isna(d['Artist.gender']):
print("Row number => ",i,"Data => ",d['Artist.gender'])

How to drop rows which contain missing data in Pandas Dataframe?

Null values appear as NaN in Data Frame when a CSV file contains null values. The dropna() method in Pandas allows the user to evaluate and drop Null Rows/Columns in various methods. The dropna() method removes all the rows which contain at least one missing data.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Checking for missing values
data.isna().sum()

There are so many missing values present in data. We have 422 missing values in 'Artist.gender' and 'Group.Solo' columns. There are 367 missing values in 'Semi.Final.Number' column. And so on. Let's see how we can remove all missing values.

#Remove missing values.
data.dropna(inplace=True)
#Checking for missing values
data.isna().sum()

How to transform missing data in Pandas Dataframe?

The fillna() method is used to replace missing data with some value. For numerical columns, we can replace missing values either with mean or median. For categorical columns, we can replace missing values with mode.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Checking for missing values
data.isna().sum()

As we can see, we have 422 missing values in Artist.gender column, which is a categorical variable. We will replace all the missing values with the mode of this column.

#Finding out the Mode of 'Artist.gender' column.
mode = data['Artist.gender'].mode()[0]
#Replacing all the missing value with Mode of 'Artist.gender' column.
data['Artist.gender'].fillna(value=mode,inplace=True)
#Checking for missing values
data.isna().sum()

Now let's see one numerical column. We have 344 missing values in the Happiness column, which is a numerical variable. We will replace all the missing values with the mean of this column.

#Finding out the Mean of the 'Happiness' column.
mean = data['Happiness'].mean()
#Replacing all the missing values with the mean of the 'Happiness' column.
data['Happiness'].fillna(value=mean,inplace=True)
#Checking for missing values
data.isna().sum()

How to replace the mean of the column with missing data in the Pandas Dataframe?

The fillna() method is used to replace missing data with some value. For numerical columns, we can replace missing values either with mean or median. For the categorical column, we can replace missing values with mode.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Checking for missing values
data.isna().sum()

As we can see, we have 344 missing values in the Happiness column, which is a numerical variable. We will replace all the missing values with the Mean of this column.

#Finding out the Mean of 'Happiness' column.
mean = data['Happiness'].mean()
#Replacing all the missing values with the mean of the 'Happiness' column.
data['Happiness'].fillna(value=mean,inplace=True)
#Checking for missing values
data.isna().sum()

How to impute values if there are missing values in the particular column in Pandas?

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Checking for missing values
data.isna().sum()

As we can see, we have 422 missing values in the Artist.gender column, which is a categorical variable. We will replace all the missing values with the mode of this column.

#Finding out the Mode of 'Artist.gender' column.
mode = data['Artist.gender'].mode()[0]
#Replacing all the missing values with Mode of 'Artist.gender' column.
data['Artist.gender'].fillna(value=mode,inplace=True)
#Checking for missing values
data.isna().sum()

Now let's see one numerical column. We have 344 missing values in the Happiness column, which is a numerical variable. We will replace all the missing values with the Mean of this column.

#Finding out the mean of the 'Happiness' column.
mean = data['Happiness'].mean()
#Replacing all the missing values with the mean of the 'Happiness' column.
data['Happiness'].fillna(value=mean,inplace=True)
#Checking for missing values
data.isna().sum()

Python Pandas - Reindexing

How to reindex a Dataframe in Pandas?

A DataFrame's row and column labels are changed when it is reindexed. The term "reindex" refers to the process of aligning data to a specific set of tags along a single axis.

It rearranges the data to correspond to a new set of labels.

It adds missing value (NA) markers to label positions if there is no data for the label.

#Importing libraries
import pandas as pd
index = [1, 2, 3]
#Creating dictionary
demo_dict = {"Name":["Roy","Jason","Sancho"],"Marks":[90,95,90]}
#Creating dataframe from dictionary
demo_dataframe = pd.DataFrame(demo_dict,index=index)
#Reindexing
new_index = [2, 1, 3]
demo_dataframe.reindex(new_index)

How to reset the index using concatenation in Pandas?

We can reset the index using concat() function as well. Example-

#Importing libraries
import pandas as pd

demo_df1 = pd.DataFrame([[1,'Roy',90],[2,'James',95]],columns=['Roll no', 'Name', 'Marks'])

demo_df2 = pd.DataFrame([[3,'Cris',98],[4,'Jeff',80]],columns=['Roll no', 'Name', 'Marks'])

#reset index while concatenation
final_df = pd.concat([demo_df1, demo_df2], ignore_index=True)

#print dataframe
print(final_df)

How to reset index after sorting in Pandas Dataframe?

The reset_index() function reset the index of the data frame and use the default one.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Let's try to sort our data on the basis of the 'Place' column.
data = data.sort_values(by="Place")
data.head()
#Resetting index
#Also add, 'drop=True' indicates that you wish to drop the existing index rather than adding it as a new column to your dataframe.
data = data.reset_index(drop=True)

Python Pandas - Categorical Data

How to perform binning on categorical data in Pandas Dataframe?

When working with continuous numeric data, it's common to divide it into different buckets for additional analysis. The cut function is used to convert data to a set of discrete buckets.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Let's remove all the rows which contain missing data
data.dropna(inplace=True)

As we can see, 'Happiness' column contains continuous numeric data. Let's try to divide it into different buckets/bins using the cut function. We are creating a new column as 'Happiness_bins' where Happiness data is divided into Low, Medium and High categories.

#Performing binning
data['Happiness_bins'] = pd.cut(data['Happiness'],3,labels=['Low','Medium','High'])
data.head()

How to list categorical variables in the data in Pandas Dataframe?

Here, we place all the columns that are not numerical columns into the list of categorical columns. The final list of categorical variables contains the name of all the columns that are not numerical columns.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#List of all columns
all_columns = list(data.columns)
#List of only numeric columns
numerical_columns = list(data._get_numeric_data().columns)
#Columns that are not there in numerical columns are categorical columns
#Creating an empty list for categorical columns.
categorical_columns=list()
for column in all_columns:
  #Checking if the column is not there in the list of numerical_columns.
   if column not in numerical_columns:
    #Appending it into categorical_column list
    categorical_columns.append(column)
#List of only categorical columns
categorical_columns

How to convert categorical data into dummy variables in Pandas Dataframe?

The get_dummies() function converts categorical data into dummy or indicator variables.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
pd.get_dummies(data['Artist.gender'])

How to convert categorical data to numeric using cat.codes in Pandas Dataframe?

The cat.codes function converts categorical data into codes. We can only apply cat.codes on

columns that have the data type as 'category'.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Removing all rows having missing values.
data.dropna(inplace=True)
#Converting data type of 'Artist.gender' column into category
data['Artist.gender'] = data['Artist.gender'].astype('category')
#Converting categorical data to numeric using cat.codes
data['Artist.gender_codes'] = data['Artist.gender'].cat.codes
data.head()

How to plot a countplot for categorical data in a Pandas Dataframe?

The value counts() function returns a Series with unique value counts. The resulting object will be sorted in descending order, with the first member being the most common. By default, NA values are excluded.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data

Let's try to plot a countplot for 'Artist.gender' column.

#Printing the value count of each category
print(data['Artist.gender'].value_counts())
#Plotting countplot
data['Artist.gender'].value_counts().plot(kind="bar")

Data visualization with Pandas

Exploratory data analysis requires the use of data visualization. When it comes to delivering an overview or summary of data, it is more effective than just numbers. Data visualizations assist us in comprehending the underlying structure of a dataset or investigating the correlations between variables. Let's see how to visualize data with Pandas.

#Importing libraries
import pandas as pd
#Reading data using read_csv() function
data = pd.read_csv("https://data.hemath.com/access/file_csv/euro_vision.csv")
#Displaying data
data
#Scatter plot
data.plot(x='Place', y='Points', kind='scatter', figsize=(10,6), title='Place x Points')
#Histogram
data['Points'].plot(kind='hist', figsize=(10,6), title='Distribution of Points')
#Box plot
data.boxplot(column='Points', by='Artist.gender', figsize=(10,6))
#Bar plot
data['Artist.gender'].value_counts().plot(kind='bar')

Command Palette