Pandas DataFrames

Pandas DataFrames were developed by Wes McKinney as a high level manipulation tool. A dataframe is similar to a spreadsheet or a table in a database with rows and columns where rows represent entities and columns represent attributes.

You need to install pandas into your development environment using:

  • pip3 install pandas

and import it into your project with:

  • import pandas as pd

You can build dataframes from lists, arrays, or dictionaries.

You can also import data into a dataframe from a .csv file using:

  • my_dataframe = pd.read_csv(“filename.csv”)

In order to access information in a dataframe you can get a series, which is a column and corresponds to a numpy array by using square brackets:

  • my_dataframe[‘column_name’]

If you want to keep the column in a dataframe format then you can do:

  • my_dataframe[[‘column_name’]]

Or even select multiple columns into a new dataframe:

  • my_dataframe[[‘column1’, ‘column2’]]

In order to select rows the format resembles the slices from a list or numpy array:

  • my_dataframe[1:4]

In order to access subsets that are blocks of your dataframe you need to get both rows and columns using either .loc or .iloc functions.

.loc function accesses rows and columns based on their index names:

  • my_dataframe[“column1”]
  • my_dataframe[[“column2″,”column4”], [“row1″,”row2”]]
  • my_dataframe[:, [“row1″,”row2”]]

.iloc function works the same except that it uses the numerical index instead of the index labels:

  • my_dataframe[[1,2,3], [0,1]]
  • my_dataframe[:, [0,1]]