3  pandas

In this lesson we introduce the two core objects in the pandas library, the pandas.Series and the pandas.DataFrame.

3.1 pandas

pandas is a Python package to wrangle and analyze tabular data. It is built on top of NumPy and has become the core tool for doing data analysis in Python.

The convention to import it is:

import pandas as pd

# we will also import numpy 
import numpy as np

There is so much to learn about pandas. While we won’t be able to cover every single functionality of this package in the next three lecutres, the goal is to get you started with the basic tools for data wrangling and give you a solid basis on which you can explore further.

3.2 Series

The first core data structure of pandas is the series. A series is a one-dimensional array of indexed data. A pandas.Series having an index is the main difference between a pandas.Series and a numpy array. See the difference:

# a numpy array
# np.random.randn returns values from the std normal distribution
arr = np.random.randn(4) 
print(type(arr))
print(arr, "\n")

# a pandas series made from the previous array
s = pd.Series(arr)
print(type(s))
print(s)
<class 'numpy.ndarray'>
[-0.25619978 -0.29003821  0.86499254 -1.00247858] 

<class 'pandas.core.series.Series'>
0   -0.256200
1   -0.290038
2    0.864993
3   -1.002479
dtype: float64

3.2.1 Creating a pandas.Series

The basic method to create a pandas.Series is to call

s = pd.Series(data, index=index)

The data parameter can be:

The index parameter is a list of index labels.

For now, we will create a pandas.Series from a numpy array or list. To use this method we need to pass a numpy array (or a list of objects that can be converted to NumPy types) as data and a list of indices of the same length as data.

# a Series from a numpy array 
pd.Series(np.arange(3), index=['a','b','c'])
a    0
b    1
c    2
dtype: int64

The index parameter is optional. If we don’t include it, the default is to make the index equal to [0,...,len(data)-1]. For example:

# a Series from a list of strings with default index
pd.Series(['EDS 220', 'EDS 222', 'EDS 223', 'EDS 242'])
0    EDS 220
1    EDS 222
2    EDS 223
3    EDS 242
dtype: object

3.2.1.1 From a dictionary

Remember a dictionary is a set of key-value pairs. If we create a pandas.Series via a dictionary the keys will become the index and the values the corresponding data.

# construct dictionary
d = {'a':0, 'b':1, 'c':2}

# initialize a sries using a dictionary
pd.Series(d)
a    0
b    1
c    2
dtype: int64

3.2.1.2 From a number

If we only provide a number as the data for the series, we need to provide an index. The number will be repeated to match the length of the index.

pd.Series(3.0, index = ['A', 'B', 'C'])
A    3.0
B    3.0
C    3.0
dtype: float64

3.2.2 Simple operations

Arithmetic operations work on series and also most NumPy functions. For example:

# define a series
s = pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])

# divide each element in series by 10
print(s /10, '\n')

# take the exponential of each element in series
print(np.exp(s), '\n')

# notice this doesn't change the values of our series
print(s)
Andrea      9.8
Beth        7.3
Carolina    6.5
dtype: float64 

Andrea      3.637971e+42
Beth        5.052394e+31
Carolina    1.694889e+28
dtype: float64 

Andrea      98
Beth        73
Carolina    65
dtype: int64

We can also produce new pandas.Series with True/False values indicating whether the elements in a series satisfy a condition or not:

s > 10
Andrea      True
Beth        True
Carolina    True
dtype: bool

This kind of simple conditions on pandas.Series will be key when we are selecting data from data frames.

3.2.3 Attributes & Methods

pandas.Series have many attributes and methods, you can see a full list in the pandas documentation. For now we will cover two examples that have to do with identifying missing values.

pandas represents a missing or NA value with NaN, which stands for not a number. Let’s construct a small series with some NA values:

# series with NAs in it
s = pd.Series([1, 2, np.NaN, 4, np.NaN])

A pandas.Series has an attribute called hasnans that returns True if there are any NaNs:

# check if series has NAs
s.hasnans
True

Then we might be intersted in knowing which elements in the series are NAs. We can do this using the isna method:

s.isna()
0    False
1    False
2     True
3    False
4     True
dtype: bool

We can see the ouput is a pd.Series of boolean values indicating if an element in the row at the given index is NA (True = is NA) or not (False = not NA).

moving on

There’s much more to say about pandas.Series, but this is enought to get us going. At this point, we mainly want to know about pandas.Series because pandas.Series are the columns of pandas.DataFrames.

slicing with loc

Notice that when use slicing with loc we get both the start and the end of the indices we indicated. This is different to slicing in numpy arrays or lists where we do not get the element at the end of the slice. Compare the following:

x = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
print(x)

# slicing will return elements at indices 2 trhough 4 (inclusive)
x[2:5]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[2, 3, 4]
# define a np array with integers from 0 to 9
y = np.arange(10)
print(y)

# slicing will return elements at indices 2 trhough 4 (inclusive)
y[2:5]
[0 1 2 3 4 5 6 7 8 9]
array([2, 3, 4])
 z = pd.Series(y)
 print(z)

# slicing will return elements with index labels 2 through 5 (inclusive)
 z.loc[2:5]
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64
2    2
3    3
4    4
5    5
dtype: int64

3.3 Data Frames

The Data Frame is the most used pandas object. It represents tabular data and we can think of it as a spreadhseet. Each column of a pandas.DataFrame is a pandas.Series.

3.3.1 Creating a pandas.DataFrame

There are many ways of creating a pandas.DataFrame.

We already mentioned each column of a pandas.DataFrame is a pandas.Series. In fact, the pandas.DataFrame is a dictionary of pandas.Series, with each column name being the key and the column values being the key’s value. Thus, we can create a pandas.DataFrame in this way:

# initialize dictionary with columns' data 
d = {'col_name_1' : pd.Series(np.arange(3)),
     'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
     }

# create data frame
df = pd.DataFrame(d)
df
col_name_1 col_name_2
0 0 3.1
1 1 3.2
2 2 3.3

We can change the index and column names by changing the index and columns attributes in the data frame.

# print original index
print(df.index)

# change the index
df.index = ['a','b','c']
df
RangeIndex(start=0, stop=3, step=1)
col_name_1 col_name_2
a 0 3.1
b 1 3.2
c 2 3.3
# print original column names
print(df.columns)

# change column names 
df.columns = ['C1','C2']
df
Index(['col_name_1', 'col_name_2'], dtype='object')
C1 C2
a 0 3.1
b 1 3.2
c 2 3.3