Pandas in Python, and Learn Series and Dataframe Structures

Pandas is an open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. It’s particularly useful for working with data from various sources like CSV files, SQL databases, or Excel spreadsheets.

Here are some reasons why you should use Pandas:

User-Friendly: Pandas is easy to learn and use. It lets you perform complex data operations with simple commands, making it accessible even if you’re new to programming.
Data Cleaning and Preparation: Pandas makes it easy to clean and prepare data. You can handle missing values, filter data, and merge datasets effortlessly.
Performance: Even though it’s written in Python, Pandas is fast and efficient, capable of handling large datasets smoothly.
Integration: Pandas works well with other Python libraries like NumPy, Matplotlib, and Scikit-Learn, making it a versatile tool for data science tasks.
Community and Documentation: Pandas has a large community and plenty of documentation, so you can easily find resources for learning and troubleshooting.

Installing Pandas
Series in Pandas
Series Attributes in Pandas
Series with Custom Indexes
Create Series From a Dictionary
Add Prefix to Series Indexes
Operations on Series
Broadcasting
Understanding DataFrames in Pandas
Dataframe With Custom Indexes
Creating a DataFrame from a List of Lists
DataFrame Attributes
What You Should Learn Next?
Best Practices in Pandas

Installing Pandas

Here’s how you can install Pandas using various methods and verify the installation:

Using pip

Pip is the package installer for Python. You can install Pandas using pip by running the following command in your terminal or command prompt:

pip install pandas

This command will download and install the latest version of Pandas and its dependencies from the Python Package Index (PyPI).

Using conda

If you have Anaconda or Miniconda installed, you can install Pandas using conda by running the following command in your terminal or command prompt:

conda install pandas

Importing Pandas and Verifying the Installation

To use Pandas in your Python scripts, you need to import it. It’s common practice to import Pandas with the alias pd. You can check the version of Pandas you’ve installed by typing:

import pandas as pd

print(pd.__version__)

This will print the version number of Pandas installed on your system.

Series in Pandas

A Series is a one-dimensional array-like object that can hold many types of data, including numbers, strings, or even other objects. It is similar to a column in an Excel spreadsheet or a database table. Each item in a Series has a unique label, called an index, which makes it easy to access and manipulate the data.

You can create a Series by passing a list or array of values to the pd.Series() constructor:

Example:

import pandas as pd

data = [1, 2, 3, 4, 5]
s = pd.Series(data)

print(s)

In this snippet, we import Pandas and then create a Series named s from a Python list called data. This Series now contains the numbers 1 through 5, and each number is associated with an index label. If you peek inside s, you’ll see both the data and the corresponding labels.

Output:

0    1
1    2
2    3
3    4
4    5
dtype: int64

Series Attributes in Pandas

Series comes with several useful attributes. These attributes give you information about the Series, like its size, data type, and index labels. Here’s a simple guide to some key attributes of a Series.

Attribute	Description	Example Usage	Example Output
`index`	The labels (indexes) of the Series are similar to row labels in a table.	`series.index`	`Index([0, 1, 2, 3, 4], dtype='object')`
`values`	The data values in the Series are returned as a NumPy array.	`series.values`	`array([1, 2, 3, 4, 5])`
`dtype`	The data type of the values in the Series.	`series.dtype`	`dtype('int64')`
`size`	The number of elements (items) in the Series.	`series.size`	`5`
`name`	The name of the Series, which is useful for labeling.	`series.name`	`None` (or a name if set, e.g., ‘My Series’)
`is_unique`	True if all values in the Series are unique, otherwise False.	`series.is_unique`	`True` or `False`
`is_monotonic`	True if the Series values are sorted in ascending order, otherwise False.	`series.is_monotonic`	`True` or `False`
`isnull()`	Returns a Series of the same shape indicating if each value is missing (NaN).	`series.isnull()`	`0 False 1 False 2 False 3 False 4 False dtype: bool`
`notnull()`	Returns a Series of the same shape indicating if each value is not missing.	`series.notnull()`	`0 True 1 True 2 True 3 True 4 True dtype: bool`

Series with Custom Indexes

You can customize the index of a Series. This is useful if you want to label the data in a meaningful way instead of using the default numerical indexes.

Example:

import pandas as pd

fruits_list = ["apple", "banana", "watermelon", "grapes", "orange"]

fruit_series = pd.Series(fruits_list, index = ["fruit 1", "fruit 2", "fruit 3", "fruit 4", "fruit 5", ])

print(fruit_series)

print("-----------------------")

print(fruit_series["fruit 1"])
print(fruit_series["fruit 4"])
print(fruit_series["fruit 2"])

Output:

fruit 1         apple
fruit 2        banana
fruit 3    watermelon
fruit 4        grapes
fruit 5        orange
dtype: object
-----------------------
apple
grapes
banana

Create Series From a Dictionary

Another way to create a Series is by using a dictionary. The keys of the dictionary become the indexes, and the values become the data in the Series.

Using a dictionary is helpful when your data is naturally in key-value pairs, like the population of cities or scores of players.

Example:

import pandas as pd

days_dict = {"Day 1" : "Sunday", "Day 2" : "Monday", "Day 3" : "Tuesday", "Day 4" : "Wednesday", "Day 5" : "Thursday", "Day 6" : "Friday", "Day 7" : "Saturday"}

days_series = pd.Series(days_dict)

print(days_series)

print("---------------------------")

print(pd.Series(days_dict, index = ["Day 5", "Day 2", "Day 4"]))

Output:

Day 1       Sunday
Day 2       Monday
Day 3      Tuesday
Day 4    Wednesday
Day 5     Thursday
Day 6       Friday
Day 7     Saturday
dtype: object
---------------------------
Day 5     Thursday
Day 2       Monday
Day 4    Wednesday
dtype: object

Add Prefix to Series Indexes

By using the add_prefix method, you can add a prefix to the indexes of a Pandas Series. This will make the indexes more descriptive or avoid conflicts when merging with other data.

Example:

import pandas as pd

days_list = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]

days_series = pd.Series(days_list).add_prefix("Day_")

print(days_series)

Output:

Day_0       Sunday
Day_1       Monday
Day_2      Tuesday
Day_3    Wednesday
Day_4     Thursday
Day_5       Friday
Day_6     Saturday
dtype: object

If you want to add a suffix instead of a prefix, then you should use the add_suffix() method.

Operations on Series

You can perform arithmetic operations on Series just like you would with numbers. These operations happen element by element. For example, if you have two Series and you add them together, Pandas will add the first element of the first Series to the first element of the second Series, and so on.

Example:

s1 = pd.Series([1, 2, 3])
s2 = pd.Series([10, 20, 30])

result = s1 + s2

print(result)

Output:

0    11
1    22
2    33
dtype: int64

Broadcasting

Broadcasting is a feature in Pandas that allows you to apply an operation between a Series and a single value (called a scalar). It helps you to perform the same operation on every element in a Series and saves you from writing loops.

Example:

import pandas as pd

data = [1, 2, 3, 4, 5]
s = pd.Series(data)

print(s)

squared = s ** 2
print("_________________________")
print(squared)

Output:

0    1
1    2
2    3
3    4
4    5
dtype: int64
_________________________
0     1
1     4
2     9
3    16
4    25
dtype: int64

Understanding DataFrames in Pandas

DataFrames are like tables in Pandas. They’re used to store data in a structured way, with rows and columns, just like you’d see in a spreadsheet or a database table. Each column in a DataFrame represents a different type of information, like names, ages, or cities, and each row represents a separate record or observation.

One simple way to create a DataFrame is from a dictionary. In this dictionary, the keys are the column names, and the values are lists or arrays containing the data for those columns.

Example:

import pandas as pd

student_dict = {"Name" : ["Joe", "Gon", "Eren", "Loid"], 
                "Age" : [14, 12, 16, 18], 
                "Class" : ["10th", "8th", "11th", "12th"]}

new_dataframe = pd.DataFrame(student_dict)
print(new_dataframe)

In this example, we have a dictionary called student_dict, and we’ve created a DataFrame called new_dataframe from it. The keys (‘Name’ , ‘Age’ and ‘Class’) are column names, and the lists are the data.

Output:

   Name  Age Class
0   Joe   14  10th
1   Gon   12   8th
2  Eren   16  11th
3  Loid   18  12th

Dataframe With Custom Indexes

You can also set custom indexes (row labels) for your DataFrame. This can be useful if you want to give each row a meaningful label instead of just using the default numbers.

Example:

import pandas as pd

student_dict = {"Name" : ["Joe", "Gon", "Eren", "Loid"], 
                "Age" : [14, 12, 16, 18], 
                "Class" : ["10th", "8th", "11th", "12th"]}

student_dataframe = pd.DataFrame(student_dict, index = ["Student_1", "Student_2", "Student_3", "Student_4"])

print(student_dataframe)

print("----------------------")

print(student_dataframe.loc["Student_3"])

In this example, instead of the default row numbers (0, 1, 2, 3), the rows are labeled with “Student_1”, “Student_2”, “Student_3”, and “Student_4”.

Output:

           Name  Age Class
Student_1   Joe   14  10th
Student_2   Gon   12   8th
Student_3  Eren   16  11th
Student_4  Loid   18  12th
----------------------
Name     Eren
Age        16
Class    11th
Name: Student_3, dtype: object

Creating a DataFrame from a List of Lists

You can create a DataFrame from a list of lists too. Each inner list represents a row of data, and you can specify the column names separately.

Example:

import pandas as pd

data = [[1, 'Max'], [2, 'Gon'], [3, 'Barbie']]
df = pd.DataFrame(data, columns=['ID', 'Name'])

print(df)

In this snippet, we’ve used a list of lists to create a DataFrame. The columns parameter allows us to specify custom column names.

Output:

	ID	Name
0	1	Max
1	2	Gon
2	3	Barbie

DataFrame Attributes

Just like Series, DataFrames come with their own set of attributes. They provide valuable insights into the structure, content, and characteristics of your DataFrame:

Attribute	Description
`index`	Represents the row labels, which identify each row in the DataFrame. These labels are typically either integers or strings.
`values`	Contains the actual data in the DataFrame, displayed as a 2D array.
`dtypes`	Displays the data types of each column in the DataFrame.
`size`	Represents the total number of elements (cells) in the DataFrame.
`columns`	The column labels of the DataFrame, representing the different types of data in the DataFrame.
`shape`	Shows the size of the DataFrame, indicating its number of rows and columns.

What You Should Learn Next?

Best Practices in Pandas

Pandas is a versatile tool for data analysis, but to harness its full potential, it’s important to follow some best practices.

Use Meaningful Variable Names

When you’re working with Pandas, it’s tempting to use short, concise variable names. However, using meaningful names for your DataFrames and Series makes your code more readable and self-explanatory.

Example:

# Avoid:
df = pd.read_csv('data.csv')

# Prefer:
customer_data = pd.read_csv('customer_data.csv')

Using descriptive variable names helps you and others understand the purpose of your data.

Check for Missing Data

Before starting the data analysis, check for missing data. Pandas provide useful methods for this, such as isna(), isnull(), and notna(). Handle missing data appropriately with methods like dropna(), fillna(), or by using the inplace parameter.

Example:

# Check for missing data
missing_data = df.isna().sum()

# Fill missing values with the mean
df.fillna(df.mean(), inplace=True)

Avoid Chained Indexing

Chained indexing, like df['column1']['column2'], can lead to unpredictable results and may not always work as expected. Instead, use loc[] or iloc[] for selecting data based on labels or integers.

Example:

# Avoid chained indexing
value = df['column1']['column2']

# Prefer .loc[] or .iloc[]
value = df.loc['column2', 'column1']

Use Vectorized Operations

Pandas is optimized for vectorized operations, which are significantly faster than using loops. Whenever possible, perform operations on the entire Series or DataFrames instead of iterating through rows.

Example:

# Avoid iterating through rows
for index, row in df.iterrows():
    df.at[index, 'new_column'] = row['old_column'] * 2

# Prefer vectorized operations
df['new_column'] = df['old_column'] * 2

Minimize Memory Usage

Large datasets can consume a lot of memory. Use the appropriate data types (e.g., int8, float32) to minimize memory usage. The info() method is helpful for assessing memory consumption.

Example:

# Check memory usage
df.info()

# Convert to more memory-efficient data types
df['column'] = df['column'].astype('int8')

Document Your Code

Maintain clear documentation for your code. Explain the purpose of your analysis, your data sources, and the methods used. Good documentation makes it easier for you and your team to revisit and understand the analysis later.

Example:

# Include comments to explain your code
# This section calculates the average revenue per customer
avg_revenue = df['Revenue'].mean()

Use Version Control

Version control, like Git, is invaluable when working on data analysis projects. It allows you to track changes, collaborate with others, and revert to previous versions if needed.

These best practices will not only enhance your productivity but also help you avoid common pitfalls in data analysis using Pandas.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Table of Contents

Installing Pandas

Using pip

Using conda

Importing Pandas and Verifying the Installation

Series in Pandas

Example:

Output:

Series Attributes in Pandas

Series with Custom Indexes

Example:

Output:

Create Series From a Dictionary

Example:

Output:

Add Prefix to Series Indexes

Example:

Output:

Operations on Series

Example:

Output:

Broadcasting

Example:

Output:

Understanding DataFrames in Pandas

Example:

Output:

Dataframe With Custom Indexes

Example:

Output:

Creating a DataFrame from a List of Lists

Example:

Output:

DataFrame Attributes

What You Should Learn Next?

Best Practices in Pandas

Use Meaningful Variable Names

Example:

Check for Missing Data

Example:

Avoid Chained Indexing

Example:

Use Vectorized Operations

Example:

Minimize Memory Usage

Example:

Document Your Code

Example:

Use Version Control

Other Related Tutorials

Similar Posts

Leave a Reply Cancel reply