Handling Missing Values, Managing Duplicates, and Data Filtering and Sorting in Pandas
Table of Contents
Customizing the Index
In Pandas, the index is like a label for each row in your DataFrame. It helps you quickly locate and manipulate specific rows of data.
By default, Pandas assigns a numeric index (0, 1, 2, …) to your DataFrame. However, you can customize the index to be more meaningful. For example, you might want to use one of your columns as the index, such as a unique identifier or a date.
Example:
Let’s say you want to use the book titles as the index for your DataFrame.
# Let's create a dataset of books books_data = {'Title': ['The Hobbit', 'To Kill a Mockingbird', '1984'], 'Author': ['J.R.R. Tolkien', 'Harper Lee', 'George Orwell'], 'Publication_Year': [1937, 1960, 1949], 'Genre': ['Fantasy', 'Fiction', 'Dystopian']} books_df = pd.DataFrame(books_data) # You want to use 'Title' as the index books_df.set_index('Title', inplace=True) # Now, 'Title' is your index, and your DataFrame is organized accordingly print(books_df)
Here, we used the set_index()
function to set the 'Title'
column as the index.
Output:
Author Publication_Year Genre
Title
The Hobbit J.R.R. Tolkien 1937 Fantasy
To Kill a Mockingbird Harper Lee 1960 Fiction
1984 George Orwell 1949 Dystopian
Resetting the Index
If you’ve customized the index but later want to go back to the default numeric index, you can reset it.
Example:
# Let's create a dataset of books books_data = {'Title': ['The Hobbit', 'To Kill a Mockingbird', '1984'], 'Author': ['J.R.R. Tolkien', 'Harper Lee', 'George Orwell'], 'Publication_Year': [1937, 1960, 1949], 'Genre': ['Fantasy', 'Fiction', 'Dystopian']} books_df = pd.DataFrame(books_data) # You want to use 'Title' as the index books_df.set_index('Title', inplace=True) # Now, 'Title' is your index, and your DataFrame is organized accordingly print(books_df) print("------------------------------------") # You want to reset the index to the default integer index books_df.reset_index(drop=False, inplace=True) # Now, you have the regular integer index again print(books_df)
The reset_index()
function resets the index back to the default numeric index. This can be useful if you no longer need the customized index or want to start fresh.
Output:
Author Publication_Year Genre
Title
The Hobbit J.R.R. Tolkien 1937 Fantasy
To Kill a Mockingbird Harper Lee 1960 Fiction
1984 George Orwell 1949 Dystopian
------------------------------------
Title Author Publication_Year Genre
0 The Hobbit J.R.R. Tolkien 1937 Fantasy
1 To Kill a Mockingbird Harper Lee 1960 Fiction
2 1984 George Orwell 1949 Dystopian
Selecting Columns by Label
In Pandas, selecting specific data from a DataFrame is crucial for data analysis and manipulation.
To select one or multiple columns from a DataFrame based on their labels (column names), you can use the square bracket notation []
.
Example:
Suppose you have a dataset of books. You can access specific columns by their labels, which is handy when you’re interested in particular book attributes.
# Let's create a dataset of books books_data = {'Title': ['The Hobbit', 'To Kill a Mockingbird', '1984'], 'Author': ['J.R.R. Tolkien', 'Harper Lee', 'George Orwell'], 'Publication_Year': [1937, 1960, 1949], 'Genre': ['Fantasy', 'Fiction', 'Dystopian']} books_df = pd.DataFrame(books_data) # Selecting the 'Title' and 'Author' columns selected_columns = books_df[['Title', 'Author']] # This will give you a DataFrame with only those columns print(selected_columns)
Output:
Title Author
0 The Hobbit J.R.R. Tolkien
1 To Kill a Mockingbird Harper Lee
2 1984 George Orwell
Selecting Rows by Condition
You can use boolean indexing to select rows from a DataFrame based on certain conditions. This means you create a condition that evaluates to True
or False
for each row and then use that condition to select the desired rows.
Example:
Now, let’s imagine you want to find books published after 1950. Pandas allow you to filter your data with conditional statements.
# Let's create a dataset of books books_data = {'Title': ['The Hobbit', 'To Kill a Mockingbird', '1984'], 'Author': ['J.R.R. Tolkien', 'Harper Lee', 'George Orwell'], 'Publication_Year': [1937, 1960, 1949], 'Genre': ['Fantasy', 'Fiction', 'Dystopian']} books_df = pd.DataFrame(books_data) # You're searching for books published after 1950 selected_rows = books_df[books_df['Publication_Year'] > 1950] # This creates a new DataFrame with only the rows that meet the condition print(selected_rows)
Output:
Title Author Publication_Year Genre
1 To Kill a Mockingbird Harper Lee 1960 Fiction
Selecting with loc[] and iloc[]
Pandas provides two methods, loc[]
and iloc[]
, for a more precise selection of data based on labels or integer positions, respectively.
loc[]
is primarily label-based, which means you specify rows and columns based on their labels.iloc[]
is primarily integer position-based, which means you specify rows and columns based on their integer position.
Here’s how you can use these methods:
Example:
# Let's create a dataset of books books_data = {'Title': ['The Hobbit', 'To Kill a Mockingbird', '1984'], 'Author': ['J.R.R. Tolkien', 'Harper Lee', 'George Orwell'], 'Publication_Year': [1937, 1960, 1949], 'Genre': ['Fantasy', 'Fiction', 'Dystopian']} books_df = pd.DataFrame(books_data) # loc[] allows label-based selection selected_data_label = books_df.loc[1:2, ['Title', 'Author']] # iloc[] enables integer-based selection selected_data_integer = books_df.iloc[1:3, 0:2] # These methods give you flexibility in choosing data print(selected_data_label) print("----------------------------------") print(selected_data_integer)
Output:
Title Author
1 To Kill a Mockingbird Harper Lee
2 1984 George Orwell
----------------------------------
Title Author
1 To Kill a Mockingbird Harper Lee
2 1984 George Orwell
Finding Missing Values
Handling missing data is a common and important task in data analysis. Pandas makes it easy to identify and deal with missing values in your dataset.
In Pandas, missing values are typically represented as NaN
(Not a Number). To find these missing values, you can use the isna()
function, which returns a DataFrame of the same size with True
where data is missing and False
where data is present.
Example:
import pandas as pd # Suppose you have a dataset with missing values data = {'A': [1, 2, None, 4, 5], 'B': [None, 2, 3, None, 5]} df = pd.DataFrame(data) # Use .isna() to identify missing values missing_values = df.isna() # This creates a DataFrame with 'True' where values are missing print(missing_values)
Output:
A B
0 False True
1 False False
2 True False
3 False True
4 False False
Counting Missing Values
To count the number of missing values in each column, you can use the isna().sum()
function. This tells you how many missing values there are in each column, helping you understand the extent of the missing data issue.
Example:
import pandas as pd # Suppose you have a dataset with missing values data = {'A': [1, 2, None, 4, 5], 'B': [None, 2, 3, None, 5]} df = pd.DataFrame(data) # Count missing values in each column missing_count = df.isna().sum() # This gives you a count of missing values for each column print(missing_count)
Output:
A 1
B 2
dtype: int64
Removing Missing Data
Sometimes, it’s easiest to remove rows or columns that contain missing values. The dropna()
function allows you to do this. By default, dropna()
removes any row that contains at least one missing value.
Example:
import pandas as pd # Suppose you have a dataset with missing values data = {'A': [1, 2, None, 4, 5], 'B': [None, 2, 3, None, 5]} df = pd.DataFrame(data) # Remove rows with missing values df_cleaned = df.dropna() # This creates a new DataFrame without the rows containing missing values print(df_cleaned)
Output:
A B
1 2.0 2.0
4 5.0 5.0
If you want to remove columns that have missing values, you can use the axis=1
parameter.
# Remove columns with any missing values df_dropped_columns = df.dropna(axis=1) print(df_dropped_columns)
This can help clean your data by removing incomplete rows or columns, but it can also lead to a significant reduction in your data if many values are missing.
Filling Missing Data
Instead of removing data, you might want to fill in the missing values with a specific value. The fillna()
function allows you to replace missing values with a specified value.
Example:
import pandas as pd # Suppose you have a dataset with missing values data = {'A': [1, 2, None, 4, 5], 'B': [None, 2, 3, None, 5]} df = pd.DataFrame(data) # Replace missing values with a specific value, let's say 0 df_filled = df.fillna(0) # This creates a new DataFrame with missing values replaced by 0 print(df_filled)
Output:
A B
0 1.0 0.0
1 2.0 2.0
2 0.0 3.0
3 4.0 0.0
4 5.0 5.0
Finding Duplicates
Handling duplicate data is crucial for maintaining the quality and accuracy of your data analysis. Duplicate entries can distort your results, so it’s important to find and remove them. Pandas provides easy-to-use functions to identify and eliminate duplicate rows from your DataFrame. Let’s explore these functions.
To find duplicate rows in your DataFrame, use the duplicated()
function. This function checks each row and returns a Series of True
or False
values, where True
indicates that the row is a duplicate of a previous row.
Example:
import pandas as pd # Create a sample DataFrame with duplicate sales records data = { 'TransactionID': [1, 2, 3, 4, 1], 'Product': ['Laptop', 'Tablet', 'Laptop', 'Smartphone', 'Laptop'], 'Price': [1000, 500, 1000, 800, 1000] } df = pd.DataFrame(data) # Find duplicate rows duplicates = df.duplicated() print(duplicates)
In this example, duplicates
will show True
for the last row because it is a duplicate of the first row based on all columns.
Output:
0 False
1 False
2 False
3 False
4 True
dtype: bool
Removing Duplicates
To remove duplicate rows from your DataFrame, use the drop_duplicates()
function. By default, this function removes all but the first occurrence of each duplicate row.
Example:
# Remove duplicate rows df_no_duplicates = df.drop_duplicates() print(df_no_duplicates)
Output:
TransactionID Product Price
0 1 Laptop 1000
1 2 Tablet 500
2 3 Laptop 1000
3 4 Smartphone 800
Customizing Duplicate Removal
If you want to remove duplicates based on specific columns rather than the entire row, you can use the subset
parameter. This allows you to check for duplicates only in the specified columns.
Example:
# Create a sample DataFrame with duplicate employee records data = { 'EmployeeID': [101, 102, 103, 104, 101], 'Department': ['HR', 'Finance', 'IT', 'Marketing', 'HR'], 'Salary': [60000, 65000, 70000, 55000, 80000] } df = pd.DataFrame(data) # Remove duplicate rows based on 'EmployeeID' only df_no_duplicates_subset = df.drop_duplicates(subset=['EmployeeID']) print(df_no_duplicates_subset)
This will remove duplicates based on the 'EmployeeID'
column, ensuring each employee is listed only once.
Output:
EmployeeID Department Salary
0 101 HR 60000
1 102 Finance 65000
2 103 IT 70000
3 104 Marketing 55000
Keeping the Last Occurrence
By default, drop_duplicates()
keeps the first occurrence of each duplicate and removes the subsequent ones. If you prefer to keep the last occurrence and remove the earlier ones, you can use the keep='last'
parameter.
Example:
# Create a sample DataFrame with duplicate inventory records data = { 'ItemCode': ['A123', 'B456', 'A123', 'C789', 'A123'], 'Quantity': [10, 20, 10, 30, 10], 'Warehouse': ['WH1', 'WH2', 'WH1', 'WH3', 'WH1'] } df = pd.DataFrame(data) # Remove duplicate rows but keep the last occurrence df_keep_last = df.drop_duplicates(keep='last') print(df_keep_last)
In this example, df_keep_last
will keep the most recent duplicate record of ItemCode
'A123'
.
Output:
ItemCode Quantity Warehouse
1 B456 20 WH2
3 C789 30 WH3
4 A123 10 WH1
Filtering Your Data
Filtering data in Pandas allows you to extract specific subsets of your dataset based on certain conditions. Whether you want to find rows that meet a single or multiple conditions, Pandas helps you accomplish this task.
Filtering Rows by Condition
When you need to filter rows based on a single condition, you can use boolean indexing. This technique involves creating a boolean mask that indicates which rows satisfy the condition, and then using this mask to select the desired rows from the DataFrame.
Example:
import pandas as pd # Suppose you have a dataset of movies movies_data = {'Title': ['Movie1', 'Movie2', 'Movie3', 'Movie4'], 'Genre': ['Action', 'Comedy', 'Drama', 'Horror'], 'Rating': [8.0, 7.5, 9.2, 6.4]} movies_df = pd.DataFrame(movies_data) # Let's say you want to filter only the high-rated movies (rating > 8.0) high_rated_movies = movies_df.loc[movies_df['Rating'] > 8.0] # Your selection now includes only the top-rated movies print(high_rated_movies)
In this example, loc[]
allows you to locate rows that meet the condition, in this case, movies with a rating greater than 8.0.
Output:
Title Genre Rating
2 Movie3 Drama 9.2
Filtering Rows by Multiple Conditions
For more complex filtering requirements, you can filter rows based on multiple conditions by combining boolean masks using logical operators like &
(and) and |
(or).
Example:
import pandas as pd # Suppose you have a dataset of movies movies_data = {'Title': ['Movie1', 'Movie2', 'Movie3', 'Movie4'], 'Genre': ['Action', 'Comedy', 'Drama', 'Horror'], 'Rating': [8.0, 7.5, 9.2, 6.4]} movies_df = pd.DataFrame(movies_data) # You're in the mood for both action and drama movies action_drama_movies = movies_df.loc[(movies_df['Genre'] == 'Action') | (movies_df['Genre'] == 'Drama')] # Your selection now contains a mix of action and drama movies print(action_drama_movies)
Output:
Title Genre Rating
0 Movie1 Action 8.0
2 Movie3 Drama 9.2
Sorting Your Data
Sorting data is a fundamental part of data analysis, allowing you to organize your DataFrame in a specific order based on the values in one or more columns. Pandas provides a powerful function, sort_values()
to help you achieve this. Let’s explore how to use it.
Sorting by Column Values
To sort your DataFrame by the values in a single column, you use the sort_values()
function. This is useful when you want to rank or arrange your data based on the values in one column.
Example:
In this example, we will sort our data based on the 'Rating'
column. By default, sort_values()
sorts in ascending order (smallest to largest). To sort in descending order (largest to smallest), you can set the ascending
parameter to False
.
import pandas as pd # Suppose you have a dataset of movies movies_data = {'Title': ['Movie1', 'Movie2', 'Movie3', 'Movie4'], 'Genre': ['Action', 'Comedy', 'Drama', 'Horror'], 'Rating': [8.0, 7.5, 9.2, 6.4]} movies_df = pd.DataFrame(movies_data) # Let's sort your movies by rating in descending order sorted_movies = movies_df.sort_values(by='Rating', ascending=False) # Now you have your top-rated movies at the top print(sorted_movies)
Output:
Title Genre Rating
2 Movie3 Drama 9.2
0 Movie1 Action 8.0
1 Movie2 Comedy 7.5
3 Movie4 Horror 6.4
Sorting by Multiple Columns
You might need to sort your DataFrame based on more than one column. You can pass a list of column names to sort_values()
. This is useful when you want to sort your data primarily by one column and then sort by another column within those groups.
Example:
import pandas as pd # Suppose you have a dataset of movies movies_data = {'Title': ['Movie1', 'Movie2', 'Movie3', 'Movie4'], 'Genre': ['Action', 'Comedy', 'Drama', 'Horror'], 'Rating': [8.0, 7.5, 9.2, 6.4]} movies_df = pd.DataFrame(movies_data) # You want to sort first by genre, then by rating within each genre sorted_movies = movies_df.sort_values(by=['Genre', 'Rating'], ascending=[True, False]) # Your movies are now sorted first by genre and then by rating within each genre print(sorted_movies)
In this code, we sort the data first by 'Genre'
in ascending order and then, within each genre, by 'Rating'
in descending order.
Output:
Title Genre Rating
0 Movie1 Action 8.0
1 Movie2 Comedy 7.5
2 Movie3 Drama 9.2
3 Movie4 Horror 6.4