Introduction of Pandas, and Understand Series and Dataframe
Pandas is an open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. It’s particularly useful for working with data from various sources like CSV files, SQL databases, or Excel spreadsheets.
Here are some reasons why you should use Pandas:
- User-Friendly: Pandas is easy to learn and use. It lets you perform complex data operations with simple commands, making it accessible even if you’re new to programming.
- Data Cleaning and Preparation: Pandas makes it easy to clean and prepare data. You can handle missing values, filter data, and merge datasets effortlessly.
- Performance: Even though it’s written in Python, Pandas is fast and efficient, capable of handling large datasets smoothly.
- Integration: Pandas works well with other Python libraries like NumPy, Matplotlib, and Scikit-Learn, making it a versatile tool for data science tasks.
- Community and Documentation: Pandas has a large community and plenty of documentation, so you can easily find resources for learning and troubleshooting.
Table of Contents
- Installing Pandas
- Series in Pandas
- Series Attributes in Pandas
- Series with Custom Indexes
- Create Series From a Dictionary
- Add Prefix to Series Indexes
- Operations on Series
- Broadcasting
- Understanding DataFrames in Pandas
- Dataframe With Custom Indexes
- Creating a DataFrame from a List of Lists
- DataFrame Attributes
- What You Should Learn Next?
- Best Practices in Pandas
Installing Pandas
Here’s how you can install Pandas using various methods and verify the installation:
Using pip
Pip is the package installer for Python. You can install Pandas using pip by running the following command in your terminal or command prompt:
pip install pandas
This command will download and install the latest version of Pandas and its dependencies from the Python Package Index (PyPI).
Using conda
If you have Anaconda or Miniconda installed, you can install Pandas using conda by running the following command in your terminal or command prompt:
conda install pandas
Importing Pandas and Verifying the Installation
To use Pandas in your Python scripts, you need to import it. It’s common practice to import Pandas with the alias pd
. You can check the version of Pandas you’ve installed by typing:
import pandas as pd print(pd.__version__)
This will print the version number of Pandas installed on your system.
Series in Pandas
A Series is a one-dimensional array-like object that can hold many types of data, including numbers, strings, or even other objects. It is similar to a column in an Excel spreadsheet or a database table. Each item in a Series has a unique label, called an index, which makes it easy to access and manipulate the data.
You can create a Series by passing a list or array of values to the pd.Series()
constructor:
Example:
import pandas as pd data = [1, 2, 3, 4, 5] s = pd.Series(data) print(s)
In this snippet, we import Pandas and then create a Series named s
from a Python list called data. This Series now contains the numbers 1 through 5, and each number is associated with an index label. If you peek inside s
, you’ll see both the data and the corresponding labels.
Output:
0 1
1 2
2 3
3 4
4 5
dtype: int64
Series Attributes in Pandas
Series comes with several useful attributes. These attributes give you information about the Series, like its size, data type, and index labels. Here’s a simple guide to some key attributes of a Series.
Attribute | Description | Example Usage | Example Output |
---|---|---|---|
index | The labels (indexes) of the Series are similar to row labels in a table. | series.index | Index([0, 1, 2, 3, 4], dtype='object') |
values | The data values in the Series are returned as a NumPy array. | series.values | array([1, 2, 3, 4, 5]) |
dtype | The data type of the values in the Series. | series.dtype | dtype('int64') |
size | The number of elements (items) in the Series. | series.size | 5 |
name | The name of the Series, which is useful for labeling. | series.name | None (or a name if set, e.g., ‘My Series’) |
is_unique | True if all values in the Series are unique, otherwise False. | series.is_unique | True or False |
is_monotonic | True if the Series values are sorted in ascending order, otherwise False. | series.is_monotonic | True or False |
isnull() | Returns a Series of the same shape indicating if each value is missing (NaN). | series.isnull() | 0 False 1 False 2 False 3 False 4 False dtype: bool |
notnull() | Returns a Series of the same shape indicating if each value is not missing. | series.notnull() | 0 True 1 True 2 True 3 True 4 True dtype: bool |
Series with Custom Indexes
You can customize the index of a Series. This is useful if you want to label the data in a meaningful way instead of using the default numerical indexes.
Example:
import pandas as pd fruits_list = ["apple", "banana", "watermelon", "grapes", "orange"] fruit_series = pd.Series(fruits_list, index = ["fruit 1", "fruit 2", "fruit 3", "fruit 4", "fruit 5", ]) print(fruit_series) print("-----------------------") print(fruit_series["fruit 1"]) print(fruit_series["fruit 4"]) print(fruit_series["fruit 2"])
Output:
fruit 1 apple
fruit 2 banana
fruit 3 watermelon
fruit 4 grapes
fruit 5 orange
dtype: object
-----------------------
apple
grapes
banana
Create Series From a Dictionary
Another way to create a Series is by using a dictionary. The keys of the dictionary become the indexes, and the values become the data in the Series.
Using a dictionary is helpful when your data is naturally in key-value pairs, like the population of cities or scores of players.
Example:
import pandas as pd days_dict = {"Day 1" : "Sunday", "Day 2" : "Monday", "Day 3" : "Tuesday", "Day 4" : "Wednesday", "Day 5" : "Thursday", "Day 6" : "Friday", "Day 7" : "Saturday"} days_series = pd.Series(days_dict) print(days_series) print("---------------------------") print(pd.Series(days_dict, index = ["Day 5", "Day 2", "Day 4"]))
Output:
Day 1 Sunday
Day 2 Monday
Day 3 Tuesday
Day 4 Wednesday
Day 5 Thursday
Day 6 Friday
Day 7 Saturday
dtype: object
---------------------------
Day 5 Thursday
Day 2 Monday
Day 4 Wednesday
dtype: object
Add Prefix to Series Indexes
By using the add_prefix
method, you can add a prefix to the indexes of a Pandas Series. This will make the indexes more descriptive or avoid conflicts when merging with other data.
Example:
import pandas as pd days_list = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"] days_series = pd.Series(days_list).add_prefix("Day_") print(days_series)
Output:
Day_0 Sunday
Day_1 Monday
Day_2 Tuesday
Day_3 Wednesday
Day_4 Thursday
Day_5 Friday
Day_6 Saturday
dtype: object
If you want to add a suffix instead of a prefix, then you should use the add_suffix()
method.
Operations on Series
You can perform arithmetic operations on Series just like you would with numbers. These operations happen element by element. For example, if you have two Series and you add them together, Pandas will add the first element of the first Series to the first element of the second Series, and so on.
Example:
s1 = pd.Series([1, 2, 3]) s2 = pd.Series([10, 20, 30]) result = s1 + s2 print(result)
Output:
0 11
1 22
2 33
dtype: int64
Broadcasting
Broadcasting is a feature in Pandas that allows you to apply an operation between a Series and a single value (called a scalar). It helps you to perform the same operation on every element in a Series and saves you from writing loops.
Example:
import pandas as pd data = [1, 2, 3, 4, 5] s = pd.Series(data) print(s) squared = s ** 2 print("_________________________") print(squared)
Output:
0 1
1 2
2 3
3 4
4 5
dtype: int64
_________________________
0 1
1 4
2 9
3 16
4 25
dtype: int64
Understanding DataFrames in Pandas
DataFrames are like tables in Pandas. They’re used to store data in a structured way, with rows and columns, just like you’d see in a spreadsheet or a database table. Each column in a DataFrame represents a different type of information, like names, ages, or cities, and each row represents a separate record or observation.
One simple way to create a DataFrame is from a dictionary. In this dictionary, the keys are the column names, and the values are lists or arrays containing the data for those columns.
Example:
import pandas as pd student_dict = {"Name" : ["Joe", "Gon", "Eren", "Loid"], "Age" : [14, 12, 16, 18], "Class" : ["10th", "8th", "11th", "12th"]} new_dataframe = pd.DataFrame(student_dict) print(new_dataframe)
In this example, we have a dictionary called student_dict
, and we’ve created a DataFrame called new_dataframe
from it. The keys (‘Name’ , ‘Age’ and ‘Class’) are column names, and the lists are the data.
Output:
Name Age Class
0 Joe 14 10th
1 Gon 12 8th
2 Eren 16 11th
3 Loid 18 12th
Dataframe With Custom Indexes
You can also set custom indexes (row labels) for your DataFrame. This can be useful if you want to give each row a meaningful label instead of just using the default numbers.
Example:
import pandas as pd student_dict = {"Name" : ["Joe", "Gon", "Eren", "Loid"], "Age" : [14, 12, 16, 18], "Class" : ["10th", "8th", "11th", "12th"]} student_dataframe = pd.DataFrame(student_dict, index = ["Student_1", "Student_2", "Student_3", "Student_4"]) print(student_dataframe) print("----------------------") print(student_dataframe.loc["Student_3"])
In this example, instead of the default row numbers (0, 1, 2, 3), the rows are labeled with “Student_1”, “Student_2”, “Student_3”, and “Student_4”.
Output:
Name Age Class
Student_1 Joe 14 10th
Student_2 Gon 12 8th
Student_3 Eren 16 11th
Student_4 Loid 18 12th
----------------------
Name Eren
Age 16
Class 11th
Name: Student_3, dtype: object
Creating a DataFrame from a List of Lists
You can create a DataFrame from a list of lists too. Each inner list represents a row of data, and you can specify the column names separately.
Example:
import pandas as pd data = [[1, 'Max'], [2, 'Gon'], [3, 'Barbie']] df = pd.DataFrame(data, columns=['ID', 'Name']) print(df)
In this snippet, we’ve used a list of lists to create a DataFrame. The columns parameter allows us to specify custom column names.
Output:
ID Name
0 1 Max
1 2 Gon
2 3 Barbie
DataFrame Attributes
Just like Series, DataFrames come with their own set of attributes. They provide valuable insights into the structure, content, and characteristics of your DataFrame:
Attribute | Description |
---|---|
index | Represents the row labels, which identify each row in the DataFrame. These labels are typically either integers or strings. |
values | Contains the actual data in the DataFrame, displayed as a 2D array. |
dtypes | Displays the data types of each column in the DataFrame. |
size | Represents the total number of elements (cells) in the DataFrame. |
columns | The column labels of the DataFrame, representing the different types of data in the DataFrame. |
shape | Shows the size of the DataFrame, indicating its number of rows and columns. |
What You Should Learn Next?
- Saving and Loading Data with Pandas and Basic Data Exploration
- Handling Missing Values, Managing Duplicates, and Data Filtering and Sorting in Pandas
- Advanced Data Handling Techniques with Pandas
- Working With Text Data and Statistical Analysis with Pandas
Best Practices in Pandas
Pandas is a versatile tool for data analysis, but to harness its full potential, it’s important to follow some best practices.
Use Meaningful Variable Names
When you’re working with Pandas, it’s tempting to use short, concise variable names. However, using meaningful names for your DataFrames and Series makes your code more readable and self-explanatory.
Example:
# Avoid: df = pd.read_csv('data.csv') # Prefer: customer_data = pd.read_csv('customer_data.csv')
Using descriptive variable names helps you and others understand the purpose of your data.
Check for Missing Data
Before starting the data analysis, check for missing data. Pandas provide useful methods for this, such as isna()
, isnull()
, and notna()
. Handle missing data appropriately with methods like dropna()
, fillna()
, or by using the inplace
parameter.
Example:
# Check for missing data missing_data = df.isna().sum() # Fill missing values with the mean df.fillna(df.mean(), inplace=True)
Avoid Chained Indexing
Chained indexing, like df['column1']['column2']
, can lead to unpredictable results and may not always work as expected. Instead, use loc[]
or iloc[]
for selecting data based on labels or integers.
Example:
# Avoid chained indexing value = df['column1']['column2'] # Prefer .loc[] or .iloc[] value = df.loc['column2', 'column1']
Use Vectorized Operations
Pandas is optimized for vectorized operations, which are significantly faster than using loops. Whenever possible, perform operations on the entire Series or DataFrames instead of iterating through rows.
Example:
# Avoid iterating through rows for index, row in df.iterrows(): df.at[index, 'new_column'] = row['old_column'] * 2 # Prefer vectorized operations df['new_column'] = df['old_column'] * 2
Minimize Memory Usage
Large datasets can consume a lot of memory. Use the appropriate data types (e.g., int8, float32) to minimize memory usage. The info()
method is helpful for assessing memory consumption.
Example:
# Check memory usage df.info() # Convert to more memory-efficient data types df['column'] = df['column'].astype('int8')
Document Your Code
Maintain clear documentation for your code. Explain the purpose of your analysis, your data sources, and the methods used. Good documentation makes it easier for you and your team to revisit and understand the analysis later.
Example:
# Include comments to explain your code # This section calculates the average revenue per customer avg_revenue = df['Revenue'].mean()
Use Version Control
Version control, like Git, is invaluable when working on data analysis projects. It allows you to track changes, collaborate with others, and revert to previous versions if needed.
These best practices will not only enhance your productivity but also help you avoid common pitfalls in data analysis using Pandas.