Handling Missing Values, Managing Duplicates, and Data Filtering and Sorting in Pandas

Customizing the Index

In Pandas, the index is like a label for each row in your DataFrame. It helps you quickly locate and manipulate specific rows of data.

By default, Pandas assigns a numeric index (0, 1, 2, …) to your DataFrame. However, you can customize the index to be more meaningful. For example, you might want to use one of your columns as the index, such as a unique identifier or a date.

Example:

Let’s say you want to use the book titles as the index for your DataFrame.

# Let's create a dataset of books
books_data = {'Title': ['The Hobbit', 'To Kill a Mockingbird', '1984'],
              'Author': ['J.R.R. Tolkien', 'Harper Lee', 'George Orwell'],
              'Publication_Year': [1937, 1960, 1949],
              'Genre': ['Fantasy', 'Fiction', 'Dystopian']}

books_df = pd.DataFrame(books_data)

# You want to use 'Title' as the index
books_df.set_index('Title', inplace=True)

# Now, 'Title' is your index, and your DataFrame is organized accordingly
print(books_df)

Here, we used the set_index() function to set the 'Title' column as the index.

Output:

                               Author  Publication_Year      Genre
Title                                                             
The Hobbit             J.R.R. Tolkien              1937    Fantasy
To Kill a Mockingbird      Harper Lee              1960    Fiction
1984                    George Orwell              1949  Dystopian

Resetting the Index

If you’ve customized the index but later want to go back to the default numeric index, you can reset it.

Example:

# Let's create a dataset of books
books_data = {'Title': ['The Hobbit', 'To Kill a Mockingbird', '1984'],
              'Author': ['J.R.R. Tolkien', 'Harper Lee', 'George Orwell'],
              'Publication_Year': [1937, 1960, 1949],
              'Genre': ['Fantasy', 'Fiction', 'Dystopian']}

books_df = pd.DataFrame(books_data)

# You want to use 'Title' as the index
books_df.set_index('Title', inplace=True)

# Now, 'Title' is your index, and your DataFrame is organized accordingly
print(books_df)

print("------------------------------------")
# You want to reset the index to the default integer index
books_df.reset_index(drop=False, inplace=True)

# Now, you have the regular integer index again
print(books_df)

The reset_index() function resets the index back to the default numeric index. This can be useful if you no longer need the customized index or want to start fresh.

Output:

                               Author  Publication_Year      Genre
Title                                                             
The Hobbit             J.R.R. Tolkien              1937    Fantasy
To Kill a Mockingbird      Harper Lee              1960    Fiction
1984                    George Orwell              1949  Dystopian
------------------------------------
                   Title          Author  Publication_Year      Genre
0             The Hobbit  J.R.R. Tolkien              1937    Fantasy
1  To Kill a Mockingbird      Harper Lee              1960    Fiction
2                   1984   George Orwell              1949  Dystopian

Selecting Columns by Label

In Pandas, selecting specific data from a DataFrame is crucial for data analysis and manipulation.

To select one or multiple columns from a DataFrame based on their labels (column names), you can use the square bracket notation [].

Example:

Suppose you have a dataset of books. You can access specific columns by their labels, which is handy when you’re interested in particular book attributes.

# Let's create a dataset of books
books_data = {'Title': ['The Hobbit', 'To Kill a Mockingbird', '1984'],
              'Author': ['J.R.R. Tolkien', 'Harper Lee', 'George Orwell'],
              'Publication_Year': [1937, 1960, 1949],
              'Genre': ['Fantasy', 'Fiction', 'Dystopian']}

books_df = pd.DataFrame(books_data)

# Selecting the 'Title' and 'Author' columns
selected_columns = books_df[['Title', 'Author']]

# This will give you a DataFrame with only those columns
print(selected_columns)

Output:

                   Title          Author
0             The Hobbit  J.R.R. Tolkien
1  To Kill a Mockingbird      Harper Lee
2                   1984   George Orwell

Selecting Rows by Condition

You can use boolean indexing to select rows from a DataFrame based on certain conditions. This means you create a condition that evaluates to True or False for each row and then use that condition to select the desired rows.

Example:

Now, let’s imagine you want to find books published after 1950. Pandas allow you to filter your data with conditional statements.

# Let's create a dataset of books
books_data = {'Title': ['The Hobbit', 'To Kill a Mockingbird', '1984'],
              'Author': ['J.R.R. Tolkien', 'Harper Lee', 'George Orwell'],
              'Publication_Year': [1937, 1960, 1949],
              'Genre': ['Fantasy', 'Fiction', 'Dystopian']}

books_df = pd.DataFrame(books_data)

# You're searching for books published after 1950
selected_rows = books_df[books_df['Publication_Year'] > 1950]

# This creates a new DataFrame with only the rows that meet the condition
print(selected_rows)

Output:

                  Title      Author  Publication_Year    Genre
1  To Kill a Mockingbird  Harper Lee              1960  Fiction

Selecting with loc[] and iloc[]

Pandas provides two methods, loc[] and iloc[], for a more precise selection of data based on labels or integer positions, respectively.

  • loc[] is primarily label-based, which means you specify rows and columns based on their labels.
  • iloc[] is primarily integer position-based, which means you specify rows and columns based on their integer position.

Here’s how you can use these methods:

Example:

# Let's create a dataset of books
books_data = {'Title': ['The Hobbit', 'To Kill a Mockingbird', '1984'],
              'Author': ['J.R.R. Tolkien', 'Harper Lee', 'George Orwell'],
              'Publication_Year': [1937, 1960, 1949],
              'Genre': ['Fantasy', 'Fiction', 'Dystopian']}

books_df = pd.DataFrame(books_data)

# loc[] allows label-based selection
selected_data_label = books_df.loc[1:2, ['Title', 'Author']]

# iloc[] enables integer-based selection
selected_data_integer = books_df.iloc[1:3, 0:2]

# These methods give you flexibility in choosing data
print(selected_data_label)
print("----------------------------------")
print(selected_data_integer)

Output:

                   Title         Author
1  To Kill a Mockingbird     Harper Lee
2                   1984  George Orwell
----------------------------------
                   Title         Author
1  To Kill a Mockingbird     Harper Lee
2                   1984  George Orwell

Finding Missing Values

Handling missing data is a common and important task in data analysis. Pandas makes it easy to identify and deal with missing values in your dataset.

In Pandas, missing values are typically represented as NaN (Not a Number). To find these missing values, you can use the isna() function, which returns a DataFrame of the same size with True where data is missing and False where data is present.

Example:

import pandas as pd

# Suppose you have a dataset with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [None, 2, 3, None, 5]}

df = pd.DataFrame(data)

# Use .isna() to identify missing values
missing_values = df.isna()

# This creates a DataFrame with 'True' where values are missing
print(missing_values)

Output:

       A      B
0  False   True
1  False  False
2   True  False
3  False   True
4  False  False

Counting Missing Values

To count the number of missing values in each column, you can use the isna().sum() function. This tells you how many missing values there are in each column, helping you understand the extent of the missing data issue.

Example:

import pandas as pd

# Suppose you have a dataset with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [None, 2, 3, None, 5]}

df = pd.DataFrame(data)

# Count missing values in each column
missing_count = df.isna().sum()

# This gives you a count of missing values for each column
print(missing_count)

Output:

A    1
B    2
dtype: int64

Removing Missing Data

Sometimes, it’s easiest to remove rows or columns that contain missing values. The dropna() function allows you to do this. By default, dropna() removes any row that contains at least one missing value.

Example:

import pandas as pd

# Suppose you have a dataset with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [None, 2, 3, None, 5]}

df = pd.DataFrame(data)

# Remove rows with missing values
df_cleaned = df.dropna()

# This creates a new DataFrame without the rows containing missing values
print(df_cleaned)

Output:

     A    B
1  2.0  2.0
4  5.0  5.0

If you want to remove columns that have missing values, you can use the axis=1 parameter.

# Remove columns with any missing values
df_dropped_columns = df.dropna(axis=1)

print(df_dropped_columns)

This can help clean your data by removing incomplete rows or columns, but it can also lead to a significant reduction in your data if many values are missing.

Filling Missing Data

Instead of removing data, you might want to fill in the missing values with a specific value. The fillna() function allows you to replace missing values with a specified value.

Example:

import pandas as pd

# Suppose you have a dataset with missing values
data = {'A': [1, 2, None, 4, 5],
        'B': [None, 2, 3, None, 5]}

df = pd.DataFrame(data)

# Replace missing values with a specific value, let's say 0
df_filled = df.fillna(0)

# This creates a new DataFrame with missing values replaced by 0
print(df_filled)

Output:

     A    B
0  1.0  0.0
1  2.0  2.0
2  0.0  3.0
3  4.0  0.0
4  5.0  5.0

Finding Duplicates

Handling duplicate data is crucial for maintaining the quality and accuracy of your data analysis. Duplicate entries can distort your results, so it’s important to find and remove them. Pandas provides easy-to-use functions to identify and eliminate duplicate rows from your DataFrame. Let’s explore these functions.

To find duplicate rows in your DataFrame, use the duplicated() function. This function checks each row and returns a Series of True or False values, where True indicates that the row is a duplicate of a previous row.

Example:

import pandas as pd

# Create a sample DataFrame with duplicate sales records
data = {
    'TransactionID': [1, 2, 3, 4, 1],
    'Product': ['Laptop', 'Tablet', 'Laptop', 'Smartphone', 'Laptop'],
    'Price': [1000, 500, 1000, 800, 1000]
}
df = pd.DataFrame(data)

# Find duplicate rows
duplicates = df.duplicated()

print(duplicates)

In this example, duplicates will show True for the last row because it is a duplicate of the first row based on all columns.

Output:

0    False
1    False
2    False
3    False
4     True
dtype: bool

Removing Duplicates

To remove duplicate rows from your DataFrame, use the drop_duplicates() function. By default, this function removes all but the first occurrence of each duplicate row.

Example:

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()

print(df_no_duplicates)

Output:

   TransactionID     Product  Price
0              1      Laptop   1000
1              2      Tablet    500
2              3      Laptop   1000
3              4  Smartphone    800

Customizing Duplicate Removal

If you want to remove duplicates based on specific columns rather than the entire row, you can use the subset parameter. This allows you to check for duplicates only in the specified columns.

Example:

# Create a sample DataFrame with duplicate employee records
data = {
    'EmployeeID': [101, 102, 103, 104, 101],
    'Department': ['HR', 'Finance', 'IT', 'Marketing', 'HR'],
    'Salary': [60000, 65000, 70000, 55000, 80000]
}
df = pd.DataFrame(data)

# Remove duplicate rows based on 'EmployeeID' only
df_no_duplicates_subset = df.drop_duplicates(subset=['EmployeeID'])

print(df_no_duplicates_subset)

This will remove duplicates based on the 'EmployeeID' column, ensuring each employee is listed only once.

Output:

   EmployeeID Department  Salary
0         101         HR   60000
1         102    Finance   65000
2         103         IT   70000
3         104  Marketing   55000

Keeping the Last Occurrence

By default, drop_duplicates() keeps the first occurrence of each duplicate and removes the subsequent ones. If you prefer to keep the last occurrence and remove the earlier ones, you can use the keep='last' parameter.

Example:

# Create a sample DataFrame with duplicate inventory records
data = {
    'ItemCode': ['A123', 'B456', 'A123', 'C789', 'A123'],
    'Quantity': [10, 20, 10, 30, 10],
    'Warehouse': ['WH1', 'WH2', 'WH1', 'WH3', 'WH1']
}
df = pd.DataFrame(data)

# Remove duplicate rows but keep the last occurrence
df_keep_last = df.drop_duplicates(keep='last')

print(df_keep_last)

In this example, df_keep_last will keep the most recent duplicate record of ItemCode 'A123'.

Output:

  ItemCode  Quantity Warehouse
1     B456        20       WH2
3     C789        30       WH3
4     A123        10       WH1

Filtering Your Data

Filtering data in Pandas allows you to extract specific subsets of your dataset based on certain conditions. Whether you want to find rows that meet a single or multiple conditions, Pandas helps you accomplish this task.

Filtering Rows by Condition

When you need to filter rows based on a single condition, you can use boolean indexing. This technique involves creating a boolean mask that indicates which rows satisfy the condition, and then using this mask to select the desired rows from the DataFrame.

Example:

import pandas as pd

# Suppose you have a dataset of movies
movies_data = {'Title': ['Movie1', 'Movie2', 'Movie3', 'Movie4'],
               'Genre': ['Action', 'Comedy', 'Drama', 'Horror'],
               'Rating': [8.0, 7.5, 9.2, 6.4]}

movies_df = pd.DataFrame(movies_data)

# Let's say you want to filter only the high-rated movies (rating > 8.0)
high_rated_movies = movies_df.loc[movies_df['Rating'] > 8.0]

# Your selection now includes only the top-rated movies
print(high_rated_movies)

In this example, loc[] allows you to locate rows that meet the condition, in this case, movies with a rating greater than 8.0.

Output:

    Title  Genre  Rating
2  Movie3  Drama     9.2

Filtering Rows by Multiple Conditions

For more complex filtering requirements, you can filter rows based on multiple conditions by combining boolean masks using logical operators like & (and) and | (or).

Example:

import pandas as pd

# Suppose you have a dataset of movies
movies_data = {'Title': ['Movie1', 'Movie2', 'Movie3', 'Movie4'],
               'Genre': ['Action', 'Comedy', 'Drama', 'Horror'],
               'Rating': [8.0, 7.5, 9.2, 6.4]}

movies_df = pd.DataFrame(movies_data)

# You're in the mood for both action and drama movies
action_drama_movies = movies_df.loc[(movies_df['Genre'] == 'Action') | (movies_df['Genre'] == 'Drama')]

# Your selection now contains a mix of action and drama movies
print(action_drama_movies)

Output:

    Title   Genre  Rating
0  Movie1  Action     8.0
2  Movie3   Drama     9.2

Sorting Your Data

Sorting data is a fundamental part of data analysis, allowing you to organize your DataFrame in a specific order based on the values in one or more columns. Pandas provides a powerful function, sort_values() to help you achieve this. Let’s explore how to use it.

Sorting by Column Values

To sort your DataFrame by the values in a single column, you use the sort_values() function. This is useful when you want to rank or arrange your data based on the values in one column.

Example:

In this example, we will sort our data based on the 'Rating' column. By default, sort_values() sorts in ascending order (smallest to largest). To sort in descending order (largest to smallest), you can set the ascending parameter to False.

import pandas as pd

# Suppose you have a dataset of movies
movies_data = {'Title': ['Movie1', 'Movie2', 'Movie3', 'Movie4'],
               'Genre': ['Action', 'Comedy', 'Drama', 'Horror'],
               'Rating': [8.0, 7.5, 9.2, 6.4]}

movies_df = pd.DataFrame(movies_data)

# Let's sort your movies by rating in descending order
sorted_movies = movies_df.sort_values(by='Rating', ascending=False)

# Now you have your top-rated movies at the top
print(sorted_movies)

Output:

    Title   Genre  Rating
2  Movie3   Drama     9.2
0  Movie1  Action     8.0
1  Movie2  Comedy     7.5
3  Movie4  Horror     6.4

Sorting by Multiple Columns

You might need to sort your DataFrame based on more than one column. You can pass a list of column names to sort_values(). This is useful when you want to sort your data primarily by one column and then sort by another column within those groups.

Example:

import pandas as pd

# Suppose you have a dataset of movies
movies_data = {'Title': ['Movie1', 'Movie2', 'Movie3', 'Movie4'],
               'Genre': ['Action', 'Comedy', 'Drama', 'Horror'],
               'Rating': [8.0, 7.5, 9.2, 6.4]}

movies_df = pd.DataFrame(movies_data)

# You want to sort first by genre, then by rating within each genre
sorted_movies = movies_df.sort_values(by=['Genre', 'Rating'], ascending=[True, False])

# Your movies are now sorted first by genre and then by rating within each genre
print(sorted_movies)

In this code, we sort the data first by 'Genre' in ascending order and then, within each genre, by 'Rating' in descending order.

Output:

   Title   Genre  Rating
0  Movie1  Action     8.0
1  Movie2  Comedy     7.5
2  Movie3   Drama     9.2
3  Movie4  Horror     6.4

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *