Pandas in Python for Data Manipulation (an Introduction)

One of Python’s most dynamic and powerful libraries for working with data is Pandas. Pandas lets you read data from txt, csv, and Excel files. A video of this tutorial is available below.

https://youtube.com/watch?v=YBDKuuu2TTM%3Ffeature%3Doembed

You must first import pandas in python.

import pandas as pd

df=pd.read_csv(‘kc_house_data.csv’)

The read_csv command by default looks into the same folder as the python project files. I generally use Jupyter Notebooks to program Python. In Jupyter Notebook I create a project folder and place all of my projects within it. I also upload data files into the same folder.

Once you set a data frame equal to the data that is read from the csv file, it is best not to change this data frame.

Remember, the dataframe remains in memory from one run to another as long as you keep Jupyter Notebook open and running. If you do any significant data manipulation, especially if you change the dataframe, you may want to run the read_csv method each time to ensure that you can start out with the same dataframe. This is what I do to make sure I am starting with the same data.

Learning About the Data

After reading in a data file, first, check the shape of the data frame. The shape property will show how many rows and columns it has.

print(df.shape)

One often overlooked benefit of using Jupyter notebook is when you want to find out about a fucntion or property you just put a question mark directly after it.

For example, running the line,

df2=df.shape?

will give you a thorough explanation of the shape data frame property.

Pandas even has the capacity to reveal and manipulate big data by reading the data in chunks of defined size. Chunks are more advanced than what I want to cover in this tutorial.

It is a good idea to print out the first few rows of a data frame with the head function. For example, the following command prints out the first five rows.

print(df.head(5))

The columns property reveals all of the column names. Be aware that column names are case-specific.

print(df.columns)

Run the dtypes property to show the data types of the columns.

print(df.dtypes)

Next, use the describe function to calculate the averages, find the minimums, maximums, as well as the values that are 25, 50, and 75 percentile.

print(df.describe())

Notice that because describe is a data frame function, it has a set of parentheses afterward. Some of these calculated numbers are meaningless. It is up to the data scientist to know which ones have meaning and which ones do not.

An average of an index column never has meaning. Also, averages of numbers that can only be integers may not have meaning. Additionally, the results of averaging boolean data that is one or zero don’t make sense. Such as whether or not it is on waterfront property, and does it have a view or not?

Looking at the data from kc_house_data.csv, I found that some homes have one-have or even one-quarter bathrooms. I had to look up what it meant. A bathroom, a sink, a toilet, a bath, and a shower are each one-quarter of a bathroom.

So the average number of bathrooms may or may not have meaning. Likewise, the average number of bedrooms may or may not make sense. Unlike bathrooms, which apparently can come in quarters, the number of bedrooms has to be an integer.

Pandas data frame Iloc and loc

Iloc, which stands for index locate, uses an index to select a value. Iloc can look at either a row index, a column index, or even both. Both row and column indexes begin at zero.

For example, the statement,

df2=df.iloc[16, 5]

sets the variable df2 equal to the value at the 17th row and 6th column.

And the command,

df2=df.iloc[:, 8]

sets the vaiable df2 equal to all of the rows of the 9th column.

The loc property is even more powerful and can be used to select data based on its values.

The Python command,

df2=df7.loc[df[‘sqft_living’] > 2500]

sets the variable df2 equal to all the homes in the data frame that have more than 2500 square feet in sqft_living space.

Notice I set it equal to a variable, and the loc function also has to have brackets and should be set equal to the column name ‘sqft_living’ in this case.

In addition to >, the command can use other comparison operators such as <, <=,>=, or == some amount. This comparison works because the data type of sqft_living is an integer. So, it can be compared to another number. Comparing two equal things requires two equal signs, as opposed to one equal sign used for assigning a value to a variable. If the values are not equal, the != symbol combination is used for the comparison.

Pandas is also good for data wrangling. For example, to eliminate the time from the datetime object, use the following.

df[‘date’] = pd.to_datetime(df[‘date’]).dt.date

This statement overwrites the datatime object with a date object. This is more complex than what I want to cover in this tutorial, but just know that it works.

Pandas Sort_values and Groupby

Sort_values puts the values of particular columns in ascending or descending order.

This sorting can work on several different columns. For example,

df.sort_values([‘sqft_living’, ‘price’], ascending=False)

Both sqft_living and price are sorted ascending.

The following example,

df.sort_values([‘sqft_living’, ‘price’], ascending=[0,1])

sorts sqft_living in descending order and price in ascending order.

Similarly, the groupby command groups the rows by specific column values.

However, the groupby statement, doesn’t actually work on data that is all numerical values. I will save that one for a future tutorial.

In combination, these Pandas commands are a powerful tool for data scientists.

The code and the data for this brief tutorial can be downloaded from my github page. I got the data from Kaggle.com.

Annotated Reference

R. A. How do you append a column and drop a column with pandas data frames? Stackoverflow.com retrieved oct. 5, 2019, answered, Oct 4, 2019. https://stackoverflow.com/questions/58238973/how-do-you-append-a-column-and-drop-a-column-with-pandas-dataframes-cant-figur

I found out how to convert a datetime object to a date object from Anthony R’s answer to my question in Stackoverflow.com. Despite my lack of knowledge and a poorly worded question about attempting to append and delete columns, Anthony R. answered my question with a single line of code. Instead of the slow process of appending and deleting columns. His single line of code overwrote the existing ‘date’ column.