Working With Text Data and Statistical Analysis with Pandas

In Pandas, you can easily handle text data, like strings, using special functions designed for text manipulation. Let’s explore these functions.

Finding Text Length

If you want to analyze the length of strings, such as counting the number of characters in each text entry. You can easily do this by using the .str.len() method.

Example:

import pandas as pd

# Imagine you have a dataset of customer reviews
reviews_data = {'Review': ['Great product!', 'Not satisfied with this item.', 'Awesome experience']}
reviews_df = pd.DataFrame(reviews_data)

# You can extract the length of each review using .str.len()
reviews_df['Review Length'] = reviews_df['Review'].str.len()

# Now, your DataFrame includes the length of each review
print(reviews_df)

Output:

                          Review  Review Length
0                 Great product!             14
1  Not satisfied with this item.             29
2             Awesome experience             18

Extracting Substrings

If you need only a part of the text, like mentions or hashtags in social media posts. You can use the .str.slice() to extract substrings. You specify the start and end positions to get the desired portion.

Example:

import pandas as pd

# Let's say you have a dataset of tweets
tweets_data = {'Tweet': ['Excited for the #Weekend!', 'Just met @JohnDoe at the park.', 'Loving this weather. #Sunshine']}
tweets_df = pd.DataFrame(tweets_data)

# You can extract hashtags and mentions using .str.extract()
tweets_df['Hashtags'] = tweets_df['Tweet'].str.extract(r'#(\w+)')
tweets_df['Mentions'] = tweets_df['Tweet'].str.extract(r'@(\w+)')

# Now, you have separate columns for hashtags and mentions
print(tweets_df)

Output:

                           Tweet  Hashtags Mentions
0       Excited for the #Weekend!   Weekend      NaN
1  Just met @JohnDoe at the park.       NaN  JohnDoe
2  Loving this weather. #Sunshine  Sunshine      NaN

Converting Text to Uppercase or Lowercase

To convert text to uppercase or lowercase, you can use .str.upper() and .str.lower(), respectively. It’s handy for making text consistent.

Example:

import pandas as pd

# Suppose you have a dataset of book titles
books_data = {'Title': ['The Great Gatsby', 'To Kill a Mockingbird', '1984']}
books_df = pd.DataFrame(books_data)

# You can convert the titles to uppercase
books_df['Title Uppercase'] = books_df['Title'].str.upper()

# Or to lowercase
books_df['Title Lowercase'] = books_df['Title'].str.lower()

# Now, you have titles in both uppercase and lowercase
print(books_df)

Output:

                   Title        Title Uppercase        Title Lowercase
0       The Great Gatsby       THE GREAT GATSBY       the great gatsby
1  To Kill a Mockingbird  TO KILL A MOCKINGBIRD  to kill a mockingbird
2                   1984                   1984                   1984

Removing Whitespace

Extra spaces at the beginning or end of text can be annoying. Use .str.strip() to remove them. For only leading or trailing spaces, use .str.lstrip() or .str.rstrip().

Example:

import pandas as pd

# Let's say you have a dataset of user names
user_data = {'Name': ['  Eren  ', 'Annie   ', '  Neil']}
user_df = pd.DataFrame(user_data)

# You can strip the extra spaces using .str.strip()
user_df['Cleaned Name'] = user_df['Name'].str.strip()

# Now, your user names are clean and tidy
print(user_df)

Output:

      Name Cleaned Name
0    Eren           Eren
1  Annie           Annie
2      Neil         Neil

Statistical Analysis with Pandas

Pandas help you uncover insights from your data effortlessly. Let’s explore how to use it for summary statistics and understanding correlations between variables.

Summary Statistics

Pandas provides a quick way to get an overview of your data with summary statistics. It gives you insights into the distribution and central tendency of your dataset. You can use describe() function for this.

Example:

import pandas as pd

# Let's say you have a dataset of exam scores
scores_data = {'Student': ['Eren', 'Annie', 'Neil', 'Richard'],
               'Score': [85, 92, 76, 98]}

scores_df = pd.DataFrame(scores_data)

# Get summary statistics
summary = scores_df['Score'].describe()

# This reveals statistics like mean, standard deviation, and quartiles
print(summary)

Output:

count     4.000000
mean     87.750000
std       9.464847
min      76.000000
25%      82.750000
50%      88.500000
75%      93.500000
max      98.000000
Name: Score, dtype: float64

Calculate Mean, Mode, And Median

These measures help you understand the average, most common, and middle values in your data.

Example:

import pandas as pd

csv_file = pd.read_csv("sample_imdb_data.csv")

mean = csv_file['score'].mean()
median = csv_file['score'].median()
mode = csv_file['score'].mode()

print("Mean of data =", mean)
print("Median of data =", median)
print("Mode of data =", mode[0])

Output:

Mean of data = 67.33333333333333
Median of data = 69.0
Mode of data = 73.0

Correlation Analysis

Correlation reveals how variables change together, helping you spot patterns and relationships.

Example:

import pandas as pd

# Suppose you have a dataset of sales and advertising expenses
sales_data = {'Sales': [1000, 1500, 800, 1200],
              'Ad_Expense': [200, 300, 100, 150]}

sales_df = pd.DataFrame(sales_data)

# Calculate the correlation coefficient
correlation = sales_df['Sales'].corr(sales_df['Ad_Expense'])

# This provides insight into how sales and advertising expenses relate
print(correlation)

Output:

0.8660639443333457

The corr() function returns a value between -1 and 1, indicating the strength and direction of the relationship.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *