Saving and Loading Data with Pandas and Basic Data Exploration

Loading Data with Pandas

One of the core functionalities of Pandas is its ability to load data from various file formats and sources. Here, we’ll explain how to load data into Pandas from different types of files: CSV, JSON, Excel, and databases.

Loading Data From CSV Files

CSV (Comma-Separated Values) files are one of the most common data formats. Each line in a CSV file corresponds to a row in the data, and the values in the rows are separated by commas.

To load a CSV file into a Pandas DataFrame, you use the read_csv() function. You need to pass the file path or URL of the CSV file to this function:

Example:

import pandas as pd

csv_file = pd.read_csv("sample_imdb_data.csv")

print(csv_file.to_string())

Output:

                                    names      date_x  score            orig_lang country
0                               Creed III    3/2/2023   73.0              English      AU
1                Avatar: The Way of Water  12/15/2022   78.0              English      AU
2             The Super Mario Bros. Movie    4/5/2023   76.0              English      AU
3                                 Mummies    1/5/2023   70.0   Spanish, Castilian      AU
4                               Supercell   3/17/2023   61.0              English      US
5                               Supercell   3/17/2023   61.0              English      US
6                    John Wick: Chapter 4   3/23/2023   80.0                  NaN      AU
7            Puss in Boots: The Last Wish  12/26/2022   83.0              English      AU
8                         Attack on Titan   9/30/2022    NaN              English      US
9                                The Park    3/2/2023   58.0              English      US
10       Winnie the Pooh: Blood and Honey   2/14/2023   58.0              English      AU
11                           The Exorcist   11/2/2022   55.0   Spanish, Castilian      MX
12                           The Exorcist   11/2/2022   55.0   Spanish, Castilian      MX
13         Black Panther: Wakanda Forever  11/10/2022   73.0              English      AU
14                    The Pope's Exorcist    4/6/2023   72.0              English      AU
15  Prizefighter: The Life of Jem Belcher   7/22/2022   62.0              English      GB
16                     Knock at the Cabin  11/16/2507   64.0              English      AU
17                   The Devil Conspiracy   1/13/2023   65.0              English      US
18                               Cazadora   1/19/2023   57.0   Spanish, Castilian      CL
19                               Gold Run  12/15/2022   65.0            Norwegian      NO
20                The Magician's Elephant   3/17/2023   73.0              English      AU
21                The Magician's Elephant   3/17/2023   73.0              English      AU
22              The Passion of the Christ   2/25/2004   74.0              English      AU
23   Batman: The Doom That Came to Gotham         NaN    NaN              English      US
24               Shazam! Fury of the Gods   3/16/2023   69.0              English      AU
25                           Consecration   2/10/2023   65.0              English      US
26                 Shark Side of the Moon   8/12/2022   53.0              English      US
27                             Black Adam  10/20/2022   71.0              English      AU
28                                  M3GAN   1/12/2023   74.0              English      AU

Loading Data From JSON Files

JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate.

To load a JSON file into a Pandas DataFrame, you use the read_json() function.

Example:

import pandas as pd

jsonfile = pd.read_json("sample_jsonfile.json")

print(jsonfile)

print("--------------------------")

print(jsonfile.head(2))

print("--------------------------")

print(jsonfile.tail(2))

Output:

             Name Class  Age
Student_1    Eren  10th   14
Student_2  Mikasa  10th   14
Student_3   Sammy  12th   17
Student_4   Kevin   9th   12
Student_5    Joey   8th   11
--------------------------
             Name Class  Age
Student_1    Eren  10th   14
Student_2  Mikasa  10th   14
--------------------------
            Name Class  Age
Student_4  Kevin   9th   12
Student_5   Joey   8th   11

Loading Data From Excel Files

Excel files are widely used for data storage and manipulation. Pandas can read Excel files using the read_excel() function.

Example:

import pandas as pd

# Load data from an Excel file
data = pd.read_excel('your_data.xlsx')

# Display the first few rows of the DataFrame
print(data.head())

Replace 'your_data.xlsx' with the actual Excel file path and specify the sheet name as needed.

You can also specify the sheet name if your Excel file has multiple sheets:

# Load data from the Sheet1
data = pd.read_excel('your_data.xlsx', sheet_name='Sheet1')

Loading Data From Databases

Pandas can also load data directly from SQL databases. To do this, you’ll need a database connector library like sqlite3 for SQLite databases or sqlalchemy for various other databases.

Example:

import pandas as pd
import sqlite3

# Connect to a SQLite database
conn = sqlite3.connect('your_database.db')

# Load data from a database table
data = pd.read_sql_query('SELECT * FROM your_table', conn)

# Display the first few rows of the DataFrame
print(data.head())

In this example, 'your_database.db' is the SQLite database file, and 'your_table' is the name of the table you want to query. The read_sql_query() function executes the SQL query and returns a DataFrame.

Saving Data in Pandas

After working with data in Pandas, you often need to save your data for later use, sharing, or further analysis. Pandas make saving data in various formats like CSV, Excel, databases, JSON, and more easy. Let’s explore how to save data in these formats.

Saving Data To CSV

CSV files are widely used because they are easy to read and write. To save a DataFrame to a CSV file in Pandas, use the to_csv() function. Here’s how:

Example:

import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Ashish', 'Lucy', 'Meera'],
    'Age': [25, 30, 35],
    'City': ['Mumbai', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv('data.csv', index=False)

In this example:

  • 'data.csv' is the name of the file you’re saving the data to.
  • index=False means the row numbers (index) will not be saved in the file, making it cleaner.

Saving Data To Excel

Excel files are another common format, especially useful when sharing data with others who might use Excel to view or analyze the data.

You can save a DataFrame to an Excel file, by using the to_excel() function.

Example:

# Save the DataFrame to an Excel file
df.to_excel('data.xlsx', index=False)

Saving Data to Database

Databases are great for storing large amounts of data and allow for efficient querying and data management. SQLite is a lightweight database that’s easy to use and doesn’t require a server. Here’s how to save a DataFrame to an SQLite database:

Example:

import pandas as pd
import sqlite3

# Connect to an SQLite database (or create it if it doesn't exist)
conn = sqlite3.connect('data.db')

# Save the DataFrame to a table named 'table_name'
df.to_sql('table_name', conn, if_exists='replace', index=False)

In this example:

  • 'data.db' is the SQLite database file.
  • 'table_name' is the name of the table in the database.
  • if_exists='replace' means if the table already exists, it will be replaced with the new data.
  • index=False ensures the row numbers are not included.

Saving Data To JSON

JSON (JavaScript Object Notation) is a lightweight data format that is commonly used for data exchange between servers and web applications.

You have to use the to_json() function for saving a DataFrame to a JSON file.

Example:

# Save the DataFrame to a JSON file
df.to_json('data.json', orient='records', lines=True)

In this example:

  • 'data.json' is the name of the JSON file.
  • orient='records' saves the DataFrame as a list of dictionaries (records).
  • lines=True writes each record on a new line, making it easier to read and process.

Saving Data To HTML

Pandas also supports saving data in other formats. You can save a DataFrame as an HTML file, which can be useful for displaying data in a web browser.

Example:

# Save the DataFrame to an HTML file
df.to_html('data.html')

This will save the DataFrame as an HTML table in 'data.html'.

Saving Data To Pickle

Pickle is a Python-specific format that saves objects in a binary format, useful for saving complex data structures.

Example:

This saves the DataFrame in data.pkl, which can be loaded back into a DataFrame later using pd.read_pickle('data.pkl').

This saves the DataFrame in data.pkl, which can be loaded back into a DataFrame later using pd.read_pickle('data.pkl').

Exploring Your Data in Pandas – head() and tail()

Understanding your dataset is the first step in any data analysis task. Pandas provides several functions to help you explore your data quickly and gain valuable insights. Let’s explore some of these functions: head(), tail(), info(), describe(), and value_counts().

When you first load a dataset into Pandas, you might want to take a quick look at the beginning and end of it to get an idea of its contents. The head() function allows you to see the first few rows of your DataFrame, while the tail() function shows you the last few rows. This can help you understand the structure of your data and check if it has been loaded correctly.

Example:

import pandas as pd

csv_file = pd.read_csv("sample_imdb_data.csv")

print("-----------------------------First 5 Rows-----------------------")
# Display the first few rows of the DataFrame
print(csv_file.head())

print("-----------------------------Last 5 Rows-----------------------")
# Display the last few rows of the DataFrame
print(csv_file.tail())

Output:

-----------------------------First 5 Rows-----------------------
                         names      date_x  score            orig_lang country
0                    Creed III    3/2/2023   73.0              English      AU
1     Avatar: The Way of Water  12/15/2022   78.0              English      AU
2  The Super Mario Bros. Movie    4/5/2023   76.0              English      AU
3                      Mummies    1/5/2023   70.0   Spanish, Castilian      AU
4                    Supercell   3/17/2023   61.0              English      US
-----------------------------Last 5 Rows-----------------------
                       names      date_x  score orig_lang country
24  Shazam! Fury of the Gods   3/16/2023   69.0   English      AU
25              Consecration   2/10/2023   65.0   English      US
26    Shark Side of the Moon   8/12/2022   53.0   English      US
27                Black Adam  10/20/2022   71.0   English      AU
28                     M3GAN   1/12/2023   74.0   English      AU

Data Overview Using info()

The info() function provides a concise summary of your DataFrame. It tells you the number of rows and columns, the data types of each column, and the number of non-null values. This is particularly useful for checking if there are any missing values in your dataset and understanding the data types of each column, which can influence your data analysis later on.

Example:

import pandas as pd

csv_file = pd.read_csv("sample_imdb_data.csv")

# Display information about the DataFrame
print(csv_file.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   names      29 non-null     object 
 1   date_x     28 non-null     object 
 2   score      27 non-null     float64
 3   orig_lang  28 non-null     object 
 4   country    29 non-null     object 
dtypes: float64(1), object(4)
memory usage: 760.0+ bytes
None

Statistics Snapshot using describe()

The describe() function generates descriptive statistics for numerical columns in your DataFrame. It gives you key summary statistics such as count, mean, standard deviation, minimum, maximum, and quartile values. This helps you understand the distribution of numerical data in your dataset and identify any outliers or unusual patterns.

Example:

import pandas as pd

csv_file = pd.read_csv("sample_imdb_data.csv")

# Display descriptive statistics for numerical columns
print(csv_file.describe())

Output:

           score
count  27.000000
mean   67.333333
std     8.283440
min    53.000000
25%    61.000000
50%    69.000000
75%    73.000000
max    83.000000

Counting Categories using value_counts()

When you’re working with categorical data, it’s essential to understand the distribution of values within each category. The value_counts() function counts the occurrences of unique values in a column and returns a Series with the counts. This allows you to see which categories are most common and which are rare, helping you make informed decisions about how to handle categorical data in your analysis.

Example:

import pandas as pd

csv_file = pd.read_csv("sample_imdb_data.csv")

# Count the occurrences of unique values in a column
print(csv_file['score'].value_counts())

Output:

score
73.0    4
65.0    3
61.0    2
74.0    2
58.0    2
55.0    2
64.0    1
53.0    1
69.0    1
57.0    1
72.0    1
62.0    1
78.0    1
83.0    1
80.0    1
70.0    1
76.0    1
71.0    1
Name: count, dtype: int64

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *