Saving and Loading Data with Pandas and Basic Data Exploration
Loading Data with Pandas
One of the core functionalities of Pandas is its ability to load data from various file formats and sources. Here, we’ll explain how to load data into Pandas from different types of files: CSV, JSON, Excel, and databases.
Loading Data From CSV Files
CSV (Comma-Separated Values) files are one of the most common data formats. Each line in a CSV file corresponds to a row in the data, and the values in the rows are separated by commas.
To load a CSV file into a Pandas DataFrame, you use the read_csv()
function. You need to pass the file path or URL of the CSV file to this function:
Example:
import pandas as pd csv_file = pd.read_csv("sample_imdb_data.csv") print(csv_file.to_string())
Output:
names date_x score orig_lang country
0 Creed III 3/2/2023 73.0 English AU
1 Avatar: The Way of Water 12/15/2022 78.0 English AU
2 The Super Mario Bros. Movie 4/5/2023 76.0 English AU
3 Mummies 1/5/2023 70.0 Spanish, Castilian AU
4 Supercell 3/17/2023 61.0 English US
5 Supercell 3/17/2023 61.0 English US
6 John Wick: Chapter 4 3/23/2023 80.0 NaN AU
7 Puss in Boots: The Last Wish 12/26/2022 83.0 English AU
8 Attack on Titan 9/30/2022 NaN English US
9 The Park 3/2/2023 58.0 English US
10 Winnie the Pooh: Blood and Honey 2/14/2023 58.0 English AU
11 The Exorcist 11/2/2022 55.0 Spanish, Castilian MX
12 The Exorcist 11/2/2022 55.0 Spanish, Castilian MX
13 Black Panther: Wakanda Forever 11/10/2022 73.0 English AU
14 The Pope's Exorcist 4/6/2023 72.0 English AU
15 Prizefighter: The Life of Jem Belcher 7/22/2022 62.0 English GB
16 Knock at the Cabin 11/16/2507 64.0 English AU
17 The Devil Conspiracy 1/13/2023 65.0 English US
18 Cazadora 1/19/2023 57.0 Spanish, Castilian CL
19 Gold Run 12/15/2022 65.0 Norwegian NO
20 The Magician's Elephant 3/17/2023 73.0 English AU
21 The Magician's Elephant 3/17/2023 73.0 English AU
22 The Passion of the Christ 2/25/2004 74.0 English AU
23 Batman: The Doom That Came to Gotham NaN NaN English US
24 Shazam! Fury of the Gods 3/16/2023 69.0 English AU
25 Consecration 2/10/2023 65.0 English US
26 Shark Side of the Moon 8/12/2022 53.0 English US
27 Black Adam 10/20/2022 71.0 English AU
28 M3GAN 1/12/2023 74.0 English AU
Loading Data From JSON Files
JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate.
To load a JSON file into a Pandas DataFrame, you use the read_json()
function.
Example:
import pandas as pd jsonfile = pd.read_json("sample_jsonfile.json") print(jsonfile) print("--------------------------") print(jsonfile.head(2)) print("--------------------------") print(jsonfile.tail(2))
Output:
Name Class Age
Student_1 Eren 10th 14
Student_2 Mikasa 10th 14
Student_3 Sammy 12th 17
Student_4 Kevin 9th 12
Student_5 Joey 8th 11
--------------------------
Name Class Age
Student_1 Eren 10th 14
Student_2 Mikasa 10th 14
--------------------------
Name Class Age
Student_4 Kevin 9th 12
Student_5 Joey 8th 11
Loading Data From Excel Files
Excel files are widely used for data storage and manipulation. Pandas can read Excel files using the read_excel()
function.
Example:
import pandas as pd # Load data from an Excel file data = pd.read_excel('your_data.xlsx') # Display the first few rows of the DataFrame print(data.head())
Replace 'your_data.xlsx'
with the actual Excel file path and specify the sheet name as needed.
You can also specify the sheet name if your Excel file has multiple sheets:
# Load data from the Sheet1 data = pd.read_excel('your_data.xlsx', sheet_name='Sheet1')
Loading Data From Databases
Pandas can also load data directly from SQL databases. To do this, you’ll need a database connector library like sqlite3
for SQLite databases or sqlalchemy
for various other databases.
Example:
import pandas as pd import sqlite3 # Connect to a SQLite database conn = sqlite3.connect('your_database.db') # Load data from a database table data = pd.read_sql_query('SELECT * FROM your_table', conn) # Display the first few rows of the DataFrame print(data.head())
In this example, 'your_database.db'
is the SQLite database file, and 'your_table'
is the name of the table you want to query. The read_sql_query()
function executes the SQL query and returns a DataFrame.
Saving Data in Pandas
After working with data in Pandas, you often need to save your data for later use, sharing, or further analysis. Pandas make saving data in various formats like CSV, Excel, databases, JSON, and more easy. Let’s explore how to save data in these formats.
Saving Data To CSV
CSV files are widely used because they are easy to read and write. To save a DataFrame to a CSV file in Pandas, use the to_csv()
function. Here’s how:
Example:
import pandas as pd # Create a sample DataFrame data = { 'Name': ['Ashish', 'Lucy', 'Meera'], 'Age': [25, 30, 35], 'City': ['Mumbai', 'Los Angeles', 'Chicago'] } df = pd.DataFrame(data) # Save the DataFrame to a CSV file df.to_csv('data.csv', index=False)
In this example:
'data.csv'
is the name of the file you’re saving the data to.index=False
means the row numbers (index) will not be saved in the file, making it cleaner.
Saving Data To Excel
Excel files are another common format, especially useful when sharing data with others who might use Excel to view or analyze the data.
You can save a DataFrame to an Excel file, by using the to_excel()
function.
Example:
# Save the DataFrame to an Excel file df.to_excel('data.xlsx', index=False)
Saving Data to Database
Databases are great for storing large amounts of data and allow for efficient querying and data management. SQLite is a lightweight database that’s easy to use and doesn’t require a server. Here’s how to save a DataFrame to an SQLite database:
Example:
import pandas as pd import sqlite3 # Connect to an SQLite database (or create it if it doesn't exist) conn = sqlite3.connect('data.db') # Save the DataFrame to a table named 'table_name' df.to_sql('table_name', conn, if_exists='replace', index=False)
In this example:
'data.db'
is the SQLite database file.'table_name'
is the name of the table in the database.if_exists='replace'
means if the table already exists, it will be replaced with the new data.index=False
ensures the row numbers are not included.
Saving Data To JSON
JSON (JavaScript Object Notation) is a lightweight data format that is commonly used for data exchange between servers and web applications.
You have to use the to_json()
function for saving a DataFrame to a JSON file.
Example:
# Save the DataFrame to a JSON file df.to_json('data.json', orient='records', lines=True)
In this example:
'data.json'
is the name of the JSON file.orient='records'
saves the DataFrame as a list of dictionaries (records).lines=True
writes each record on a new line, making it easier to read and process.
Saving Data To HTML
Pandas also supports saving data in other formats. You can save a DataFrame as an HTML file, which can be useful for displaying data in a web browser.
Example:
# Save the DataFrame to an HTML file df.to_html('data.html')
This will save the DataFrame as an HTML table in 'data.html'
.
Saving Data To Pickle
Pickle is a Python-specific format that saves objects in a binary format, useful for saving complex data structures.
Example:
This saves the DataFrame in data.pkl, which can be loaded back into a DataFrame later using pd.read_pickle('data.pkl').
This saves the DataFrame in data.pkl
, which can be loaded back into a DataFrame later using pd.read_pickle('data.pkl')
.
Exploring Your Data in Pandas – head() and tail()
Understanding your dataset is the first step in any data analysis task. Pandas provides several functions to help you explore your data quickly and gain valuable insights. Let’s explore some of these functions: head()
, tail()
, info()
, describe()
, and value_counts()
.
When you first load a dataset into Pandas, you might want to take a quick look at the beginning and end of it to get an idea of its contents. The head()
function allows you to see the first few rows of your DataFrame, while the tail()
function shows you the last few rows. This can help you understand the structure of your data and check if it has been loaded correctly.
Example:
import pandas as pd csv_file = pd.read_csv("sample_imdb_data.csv") print("-----------------------------First 5 Rows-----------------------") # Display the first few rows of the DataFrame print(csv_file.head()) print("-----------------------------Last 5 Rows-----------------------") # Display the last few rows of the DataFrame print(csv_file.tail())
Output:
-----------------------------First 5 Rows-----------------------
names date_x score orig_lang country
0 Creed III 3/2/2023 73.0 English AU
1 Avatar: The Way of Water 12/15/2022 78.0 English AU
2 The Super Mario Bros. Movie 4/5/2023 76.0 English AU
3 Mummies 1/5/2023 70.0 Spanish, Castilian AU
4 Supercell 3/17/2023 61.0 English US
-----------------------------Last 5 Rows-----------------------
names date_x score orig_lang country
24 Shazam! Fury of the Gods 3/16/2023 69.0 English AU
25 Consecration 2/10/2023 65.0 English US
26 Shark Side of the Moon 8/12/2022 53.0 English US
27 Black Adam 10/20/2022 71.0 English AU
28 M3GAN 1/12/2023 74.0 English AU
Data Overview Using info()
The info()
function provides a concise summary of your DataFrame. It tells you the number of rows and columns, the data types of each column, and the number of non-null values. This is particularly useful for checking if there are any missing values in your dataset and understanding the data types of each column, which can influence your data analysis later on.
Example:
import pandas as pd csv_file = pd.read_csv("sample_imdb_data.csv") # Display information about the DataFrame print(csv_file.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 names 29 non-null object
1 date_x 28 non-null object
2 score 27 non-null float64
3 orig_lang 28 non-null object
4 country 29 non-null object
dtypes: float64(1), object(4)
memory usage: 760.0+ bytes
None
Statistics Snapshot using describe()
The describe()
function generates descriptive statistics for numerical columns in your DataFrame. It gives you key summary statistics such as count, mean, standard deviation, minimum, maximum, and quartile values. This helps you understand the distribution of numerical data in your dataset and identify any outliers or unusual patterns.
Example:
import pandas as pd csv_file = pd.read_csv("sample_imdb_data.csv") # Display descriptive statistics for numerical columns print(csv_file.describe())
Output:
score
count 27.000000
mean 67.333333
std 8.283440
min 53.000000
25% 61.000000
50% 69.000000
75% 73.000000
max 83.000000
Counting Categories using value_counts()
When you’re working with categorical data, it’s essential to understand the distribution of values within each category. The value_counts()
function counts the occurrences of unique values in a column and returns a Series with the counts. This allows you to see which categories are most common and which are rare, helping you make informed decisions about how to handle categorical data in your analysis.
Example:
import pandas as pd csv_file = pd.read_csv("sample_imdb_data.csv") # Count the occurrences of unique values in a column print(csv_file['score'].value_counts())
Output:
score
73.0 4
65.0 3
61.0 2
74.0 2
58.0 2
55.0 2
64.0 1
53.0 1
69.0 1
57.0 1
72.0 1
62.0 1
78.0 1
83.0 1
80.0 1
70.0 1
76.0 1
71.0 1
Name: count, dtype: int64