Working With Text Data and Statistical Analysis with Pandas
In Pandas, you can easily handle text data, like strings, using special functions designed for text manipulation. Let’s explore these functions.
Finding Text Length
If you want to analyze the length of strings, such as counting the number of characters in each text entry. You can easily do this by using the .str.len()
method.
Example:
import pandas as pd # Imagine you have a dataset of customer reviews reviews_data = {'Review': ['Great product!', 'Not satisfied with this item.', 'Awesome experience']} reviews_df = pd.DataFrame(reviews_data) # You can extract the length of each review using .str.len() reviews_df['Review Length'] = reviews_df['Review'].str.len() # Now, your DataFrame includes the length of each review print(reviews_df)
Output:
Review Review Length
0 Great product! 14
1 Not satisfied with this item. 29
2 Awesome experience 18
Extracting Substrings
If you need only a part of the text, like mentions or hashtags in social media posts. You can use the .str.slice()
to extract substrings. You specify the start and end positions to get the desired portion.
Example:
import pandas as pd # Let's say you have a dataset of tweets tweets_data = {'Tweet': ['Excited for the #Weekend!', 'Just met @JohnDoe at the park.', 'Loving this weather. #Sunshine']} tweets_df = pd.DataFrame(tweets_data) # You can extract hashtags and mentions using .str.extract() tweets_df['Hashtags'] = tweets_df['Tweet'].str.extract(r'#(\w+)') tweets_df['Mentions'] = tweets_df['Tweet'].str.extract(r'@(\w+)') # Now, you have separate columns for hashtags and mentions print(tweets_df)
Output:
Tweet Hashtags Mentions
0 Excited for the #Weekend! Weekend NaN
1 Just met @JohnDoe at the park. NaN JohnDoe
2 Loving this weather. #Sunshine Sunshine NaN
Converting Text to Uppercase or Lowercase
To convert text to uppercase or lowercase, you can use .str.upper()
and .str.lower()
, respectively. It’s handy for making text consistent.
Example:
import pandas as pd # Suppose you have a dataset of book titles books_data = {'Title': ['The Great Gatsby', 'To Kill a Mockingbird', '1984']} books_df = pd.DataFrame(books_data) # You can convert the titles to uppercase books_df['Title Uppercase'] = books_df['Title'].str.upper() # Or to lowercase books_df['Title Lowercase'] = books_df['Title'].str.lower() # Now, you have titles in both uppercase and lowercase print(books_df)
Output:
Title Title Uppercase Title Lowercase
0 The Great Gatsby THE GREAT GATSBY the great gatsby
1 To Kill a Mockingbird TO KILL A MOCKINGBIRD to kill a mockingbird
2 1984 1984 1984
Removing Whitespace
Extra spaces at the beginning or end of text can be annoying. Use .str.strip()
to remove them. For only leading or trailing spaces, use .str.lstrip()
or .str.rstrip()
.
Example:
import pandas as pd # Let's say you have a dataset of user names user_data = {'Name': [' Eren ', 'Annie ', ' Neil']} user_df = pd.DataFrame(user_data) # You can strip the extra spaces using .str.strip() user_df['Cleaned Name'] = user_df['Name'].str.strip() # Now, your user names are clean and tidy print(user_df)
Output:
Name Cleaned Name
0 Eren Eren
1 Annie Annie
2 Neil Neil
Statistical Analysis with Pandas
Pandas help you uncover insights from your data effortlessly. Let’s explore how to use it for summary statistics and understanding correlations between variables.
Summary Statistics
Pandas provides a quick way to get an overview of your data with summary statistics. It gives you insights into the distribution and central tendency of your dataset. You can use describe()
function for this.
Example:
import pandas as pd # Let's say you have a dataset of exam scores scores_data = {'Student': ['Eren', 'Annie', 'Neil', 'Richard'], 'Score': [85, 92, 76, 98]} scores_df = pd.DataFrame(scores_data) # Get summary statistics summary = scores_df['Score'].describe() # This reveals statistics like mean, standard deviation, and quartiles print(summary)
Output:
count 4.000000
mean 87.750000
std 9.464847
min 76.000000
25% 82.750000
50% 88.500000
75% 93.500000
max 98.000000
Name: Score, dtype: float64
Calculate Mean, Mode, And Median
These measures help you understand the average, most common, and middle values in your data.
Example:
import pandas as pd csv_file = pd.read_csv("sample_imdb_data.csv") mean = csv_file['score'].mean() median = csv_file['score'].median() mode = csv_file['score'].mode() print("Mean of data =", mean) print("Median of data =", median) print("Mode of data =", mode[0])
Output:
Mean of data = 67.33333333333333
Median of data = 69.0
Mode of data = 73.0
Correlation Analysis
Correlation reveals how variables change together, helping you spot patterns and relationships.
Example:
import pandas as pd # Suppose you have a dataset of sales and advertising expenses sales_data = {'Sales': [1000, 1500, 800, 1200], 'Ad_Expense': [200, 300, 100, 150]} sales_df = pd.DataFrame(sales_data) # Calculate the correlation coefficient correlation = sales_df['Sales'].corr(sales_df['Ad_Expense']) # This provides insight into how sales and advertising expenses relate print(correlation)
Output:
0.8660639443333457
The corr()
function returns a value between -1 and 1, indicating the strength and direction of the relationship.