Pandas Data Cleaning Cheat Sheet

Posted : admin On 1/3/2022

The data scientist can only clean, visualize, wrangle, and build predictive models only after importing the data. In this cheat sheet, you will learn the tips and techniques to import data like CSV Files, Text Files, Excel Data, Data from URL, and SQL Database into Python. The Most Comprehensive Cheat Sheet. This one is from the pandas guys, so it makes sense that. But, the need for extracting relevant data from huge datasets is becoming more and more important with the rise of big data and complex “raw” sources, and this is where data wrangling tools such as Python and R excel. Data Wrangling Cheat Sheet with Python and R. There are numerous functions, dedicated to cleaning or merging data.

Pandas is arguably the most important Python package for data science. Not only does it give you lots of methods and functions that make working with data easier, but it has been optimized for speed which gives you a significant advantage compared with working with numeric data using Python’s built-in functions.

It’s common when first learning pandas to have trouble remembering all the functions and methods that you need, and while at Dataquest we advocate getting used to consulting the pandas documentation, sometimes it’s nice to have a handy reference, so we’ve put together this cheat sheet to help you out!

Regular expression character sets denoted by a pair of brackets will match any of the characters included within the brackets. For example, the regular expression conscenscus will match any of the spellings consensus, concensus, consencus, and concencus. The Pandas cheat sheet will guide you through the basics of Pandas, going from the data structures to reading, writing, selection, dropping indices or columns, sorting and ranking, retrieving basic info of the data structures you’re working with to applying functions and data alignment. Import pandas as pd import numpy as np Importing D a ta.

If you’re interested in learning pandas, you can consult our two-part pandas tutorial blog post, or you can signup for free and start learning pandas through our interactive pandas for data science course.

Key and Imports

In this cheat sheet, we use the following shorthand:

dfAny pandas DataFrame object
sAny pandas Series object

You’ll also need to perform the following imports to get started:

Importing Data

pd.read_csv(filename)From a CSV file
pd.read_table(filename)From a delimited text file (like TSV)
pd.read_excel(filename)From an Excel file
pd.read_sql(query, connection_object)Read from a SQL table/database
pd.read_json(json_string)Read from a JSON formatted string, URL or file.
pd.read_html(url)Parses an html URL, string or file and extracts tables to a list of dataframes
pd.read_clipboard()Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict)From a dict, keys for columns names, values for data as lists

Exporting Data

df.to_csv(filename)Write to a CSV file
df.to_excel(filename)Write to an Excel file
df.to_sql(table_name, connection_object)Write to a SQL table
df.to_json(filename)Write to a file in JSON format

Create Test Objects

Useful for testing code segements

Cheat
pd.DataFrame(np.random.rand(20,5))5 columns and 20 rows of random floats
pd.Series(my_list)Create a series from an iterable my_list
df.index = pd.date_range('1900/1/30', periods=df.shape[0])Add a date index

Viewing/Inspecting Data

df.head(n)First n rows of the DataFrame
df.tail(n)Last n rows of the DataFrame
df.shape()Number of rows and columns
df.info()Index, Datatype and Memory information
df.describe()Summary statistics for numerical columns
s.value_counts(dropna=False)View unique values and counts
df.apply(pd.Series.value_counts)Unique values and counts for all columns

Selection

df[col]Return column with label col as Series
df[[col1, col2]]Return Columns as a new DataFrame
s.iloc[0]Selection by position
s.loc['index_one']Selection by index
df.iloc[0,:]First row
df.iloc[0,0]First element of first column

Data Cleaning

df.columns = ['a','b','c']Rename columns
pd.isnull()Checks for null Values, Returns Boolean Arrray
pd.notnull()Opposite of pd.isnull()
df.dropna()Drop all rows that contain null values
df.dropna(axis=1)Drop all columns that contain null values
df.dropna(axis=1,thresh=n)Drop all rows have have less than n non null values
df.fillna(x)Replace all null values with x
s.fillna(s.mean())Replace all null values with the mean (mean can be replaced with almost any function from the statistics section)
s.astype(float)Convert the datatype of the series to float
s.replace(1,'one')Replace all values equal to 1 with 'one'
s.replace([1,3],['one','three'])Replace all 1 with 'one' and 3 with 'three'
df.rename(columns=lambda x: x + 1)Mass renaming of columns
df.rename(columns={'old_name': 'new_ name'})Selective renaming
df.set_index('column_one')Change the index
df.rename(index=lambda x: x + 1)Mass renaming of index

Filter, Sort & Groupby

df[df[col] > 0.5]Rows where the col column is greater than 0.5
df[(df[col] > 0.5) & (1.7)]Rows where 0.7 > col > 0.5
df.sort_values(col1)Sort values by col1 in ascending order
df.sort_values(col2,ascending=False)Sort values by col2 in descending order
df.sort_values([col1,ascending=[True,False])Sort values by col1 in ascending order then col2 in descending order
df.groupby(col)Return a groupby object for values from one column
df.groupby([col1,col2])Return groupby object for values from multiple columns
df.groupby(col1)[col2]Return the mean of the values in col2, grouped by the values in col1 (mean can be replaced with almost any function from the statistics section)
df.pivot_table(index=col1,values=[col2,col3],aggfunc=max)Create a pivot table that groups by col1 and calculates the mean of col2 and col3
df.groupby(col1).agg(np.mean)Find the average across all columns for every unique col1 group
data.apply(np.mean)Apply a function across each column
data.apply(np.max,axis=1)Apply a function across each row

Join/Comine

df1.append(df2)Add the rows in df1 to the end of df2 (columns should be identical)
df.concat([df1, df2],axis=1)Add the columns in df1 to the end of df2 (rows should be identical)
df1.join(df2,on=col1,how='inner')SQL-style join the columns in df1 with the columns on df2 where the rows for col have identical values. how can be one of 'left', 'right', 'outer', 'inner'

Sql To Pandas Cheat Sheet

Statistics

These can all be applied to a series as well.

df.describe()Summary statistics for numerical columns
df.mean()Return the mean of all columns
df.corr()Finds the correlation between columns in a DataFrame.
df.count()Counts the number of non-null values in each DataFrame column.
df.max()Finds the highest value in each column.
df.min()Finds the lowest value in each column.
df.median()Finds the median of each column.
df.std()Finds the standard deviation of each column.

Download a printable version of this cheat sheet

If you’d like to download a printable version of this cheat sheet you can do so below.

Pandas is an open-source Python library that is powerful and flexible for data analysis. If there is something you want to do with data, the chances are it will be possible in pandas. There are a vast number of possibilities within pandas, but most users find themselves using the same methods time after time. In this article, we compiled the best cheat sheets from across the web, which show you these core methods at a glance.

The primary data structure in pandas is the DataFrame used to store two-dimensional data, along with a label for each corresponding column and row. If you are familiar with Excel spreadsheets or SQL databases, you can think of the DataFrame as being the pandas equivalent. If we take a single column from a DataFrame, we have one-dimensional data. In pandas, this is called a Series. DataFrames can be created from scratch in your code, or loaded into Python from some external location, such as a CSV. This is often the first stage in any data analysis task. We can then do any number of things with our DataFrame in Pandas, including removing or editing values, filtering our data, or combining this DataFrame with another DataFrame. Each line of code in these cheat sheets lets you do something different with a DataFrame. Also, if you are coming from an Excel background, you will enjoy the performance pandas has to offer. After you get over the learning curve, you will be even more impressed with the functionality.

Whether you are already familiar with pandas and are looking for a handy reference you can print out, or you have never used pandas and are looking for a resource to help you get a feel for the library- there is a cheat sheet here for you!

Pandas Data Cleaning Cheat Sheet

1. The Most Comprehensive Cheat Sheet

This one is from the pandas guys, so it makes sense that this is a comprehensive and inclusive cheat sheet. It covers the vast majority of what most pandas users will ever need to do to a DataFrame. Have you already used pandas for a little while? And are you looking to up your game? This is your cheat sheet! However, if you are newer to pandas and this cheat sheet is a bit overwhelming, don’t worry! You definitely don’t need to understand everything in this cheat sheet to get started. Instead, check out the next cheat sheet on this list.

2. The Beginner’s Cheat Sheet

Dataquest is an online platform that teaches Data Science using interactive coding challenges. I love this cheat sheet they have put together. It has everything the pandas beginner needs to start using pandas right away in a friendly, neat list format. It covers the bare essentials of each stage in the data analysis process:

2018Pandas visualization cheat sheet
  • Importing and exporting your data from an Excel file, CSV, HTML table or SQL database
  • Cleaning your data of any empty rows, changing data formats to allow for further analysis or renaming columns
  • Filtering your data or removing anomalous values
  • Different ways to view the data and see it’s dimensions
  • Selecting any combination of columns and rows within the DataFrame using loc and iloc
  • Using the .apply method to apply a formula to a particular column in the DataFrame
  • Creating summary statistics for columns in the DataFrame. This includes the median, mean and standard deviation
  • Combining DataFrames

3. The Excel User’s Cheat Sheet

Ok, this isn’t quite a cheat sheet, it’s more of an entire manifesto on the pandas DataFrame! If you have a little time on your hands, this will help you get your head around some of the theory behind DataFrames. It will take you all the way from loading in your trusty CSV from Microsoft Excel to viewing your data in Jupyter and handling the basics. The article finishes off by using the DataFrame to create a histogram and bar chart. For migrating your spreadsheet work from Excel to pandas, this is a fantastic guide. It will teach you how to perform many of the Excel basics in pandas. If you are also looking for how to perform the pandas equivalent of a VLOOKUP in Excel, check out Shane’s article on the merge method.

4. The Most Beautiful Cheat Sheet

Pandas

If you’re more of a visual learner, try this cheat sheet! Many common pandas tasks have intricate, color-coded illustrations showing how the operation works. On page 3, there is a fantastic section called ‘Computation with Series and DataFrames’, which provides an intuitive explanation for how DataFrames work and shows how the index is used to align data when DataFrames are combined and how element-wise operations work in contrast to operations which work on each row or column. At 8 pages long, it’s more of a booklet than a cheat sheet, but it can still make for a great resource!

5. The Best Machine Learning Cheat Sheet

Much like the other cheat sheets, there is comprehensive coverage of the pandas basic in here. So, that includes filtering, sorting, importing, exploring, and combining DataFrames. However, where this Cheat Sheet differs is that it finishes off with an excellent section on scikit-learn, Python’s machine learning library. In this section, the DataFrame is used to train a machine learning model. This cheat sheet will be perfect for anybody who is already familiar with machine learning and is transitioning from a different technology, such as R.

6. The Most Compact Cheat Sheet

Pandas Cheat Sheet

Data Camp is an online platform that teaches Data Science with videos and coding exercises. They have made cheat sheets on a bunch of the most popular Python libraries, which you can also check out here. This cheat sheet nicely introduces the DataFrame, and then gives a quick overview of the basics. Unfortunately, it doesn’t provide any information on the various ways you can combine DataFrames, but it does all fit on one page and looks great. So, if you are looking to stick a pandas cheat sheet on your bedroom wall and nail home the basics, this one might be for you! The cheat sheet finishes with a small section introducing NaN values, which come from NumPy. These indicate a null value and arise when the indices of two Series don’t quite match up in this case.

7. The Best Statistics Cheat Sheet

While there aren’t any pictures to be found in this sheet, it is an incredibly detailed set of notes on the pandas DataFrame. This cheat shines with its complete section on time series and statistics. There are methods for calculating covariance, correlation, and regression here. So, if you are using pandas for some advanced statistics or any kind of scientific work, this is going to be your cheat sheet.

Where to go from here?

Pandas Cleaning Data

For just automating a few tedious tasks at work, or using pandas to replace your crashing Excel spreadsheet, everything covered in these cheat sheets should be entirely sufficient for your purposes.

If you are looking to use pandas for Data Science, then you are only going to be limited by your knowledge of statistics and probability. This is the area that most people lack when they try to enter this field. I highly recommend checking out Think Stats by Allen B Downey, which provides an introduction to statistics using Python.

For those a little more advanced, looking to do some machine learning, you will want to start taking a look at the scikit-learn library. Data Camp has a great cheat sheet for this. You will also want to pick up a linear algebra textbook to understand the theory of machine learning. For something more practical, perhaps give the famous Kaggle Titanic machine learning competition.

Pandas Data Cleaning Cheat Sheet Pdf

Learning about pandas has many uses, and can be interesting simply for its own sake. However, Python is massively in demand right now, and for that reason, it is a high-income skill. At any given time, there are thousands of people searching for somebody to solve their problems with Python. So, if you are looking to use Python to work as a freelancer, then check out the Finxter Python Freelancer Course. This provides the step by step path to go from nothing to earning a full-time income with Python in a few months, and gives you the tools to become a six-figure developer!

Related Posts