import pandas as pd
# we will also import numpy
import numpy as np
3 pandas
In this lesson we introduce the two core objects in the pandas
library, the pandas.Series
and the pandas.DataFrame
.
3.1 pandas
pandas
is a Python package to wrangle and analyze tabular data. It is built on top of NumPy and has become the core tool for doing data analysis in Python.
The convention to import it is:
There is so much to learn about pandas
. While we won’t be able to cover every single functionality of this package in the next three lecutres, the goal is to get you started with the basic tools for data wrangling and give you a solid basis on which you can explore further.
3.2 Series
The first core data structure of pandas is the series. A series is a one-dimensional array of indexed data. A pandas.Series
having an index is the main difference between a pandas.Series
and a numpy array. See the difference:
# a numpy array
# np.random.randn returns values from the std normal distribution
= np.random.randn(4)
arr print(type(arr))
print(arr, "\n")
# a pandas series made from the previous array
= pd.Series(arr)
s print(type(s))
print(s)
<class 'numpy.ndarray'>
[-0.25619978 -0.29003821 0.86499254 -1.00247858]
<class 'pandas.core.series.Series'>
0 -0.256200
1 -0.290038
2 0.864993
3 -1.002479
dtype: float64
3.2.1 Creating a pandas.Series
The basic method to create a pandas.Series
is to call
= pd.Series(data, index=index) s
The data
parameter can be:
The index
parameter is a list of index labels.
For now, we will create a pandas.Series
from a numpy array or list. To use this method we need to pass a numpy array (or a list of objects that can be converted to NumPy types) as data
and a list of indices of the same length as data.
# a Series from a numpy array
3), index=['a','b','c']) pd.Series(np.arange(
a 0
b 1
c 2
dtype: int64
The index
parameter is optional. If we don’t include it, the default is to make the index equal to [0,...,len(data)-1]
. For example:
# a Series from a list of strings with default index
'EDS 220', 'EDS 222', 'EDS 223', 'EDS 242']) pd.Series([
0 EDS 220
1 EDS 222
2 EDS 223
3 EDS 242
dtype: object
3.2.1.1 From a dictionary
Remember a dictionary is a set of key-value pairs. If we create a pandas.Series
via a dictionary the keys will become the index and the values the corresponding data.
# construct dictionary
= {'a':0, 'b':1, 'c':2}
d
# initialize a sries using a dictionary
pd.Series(d)
a 0
b 1
c 2
dtype: int64
3.2.1.2 From a number
If we only provide a number as the data for the series, we need to provide an index. The number will be repeated to match the length of the index.
3.0, index = ['A', 'B', 'C']) pd.Series(
A 3.0
B 3.0
C 3.0
dtype: float64
3.2.2 Simple operations
Arithmetic operations work on series and also most NumPy functions. For example:
# define a series
= pd.Series([98,73,65],index=['Andrea', 'Beth', 'Carolina'])
s
# divide each element in series by 10
print(s /10, '\n')
# take the exponential of each element in series
print(np.exp(s), '\n')
# notice this doesn't change the values of our series
print(s)
Andrea 9.8
Beth 7.3
Carolina 6.5
dtype: float64
Andrea 3.637971e+42
Beth 5.052394e+31
Carolina 1.694889e+28
dtype: float64
Andrea 98
Beth 73
Carolina 65
dtype: int64
We can also produce new pandas.Series
with True
/False
values indicating whether the elements in a series satisfy a condition or not:
> 10 s
Andrea True
Beth True
Carolina True
dtype: bool
This kind of simple conditions on pandas.Series
will be key when we are selecting data from data frames.
3.2.3 Attributes & Methods
pandas.Series
have many attributes and methods, you can see a full list in the pandas
documentation. For now we will cover two examples that have to do with identifying missing values.
pandas
represents a missing or NA value with NaN
, which stands for not a number. Let’s construct a small series with some NA values:
# series with NAs in it
= pd.Series([1, 2, np.NaN, 4, np.NaN]) s
A pandas.Series
has an attribute called hasnans
that returns True
if there are any NaNs:
# check if series has NAs
s.hasnans
True
Then we might be intersted in knowing which elements in the series are NAs. We can do this using the isna
method:
s.isna()
0 False
1 False
2 True
3 False
4 True
dtype: bool
We can see the ouput is a pd.Series
of boolean values indicating if an element in the row at the given index is NA (True
= is NA) or not (False
= not NA).
There’s much more to say about pandas.Series
, but this is enought to get us going. At this point, we mainly want to know about pandas.Series
because pandas.Series
are the columns of pandas.DataFrame
s.
loc
Notice that when use slicing with loc
we get both the start and the end of the indices we indicated. This is different to slicing in numpy arrays or lists where we do not get the element at the end of the slice. Compare the following:
= [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
x print(x)
# slicing will return elements at indices 2 trhough 4 (inclusive)
2:5] x[
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[2, 3, 4]
# define a np array with integers from 0 to 9
= np.arange(10)
y print(y)
# slicing will return elements at indices 2 trhough 4 (inclusive)
2:5] y[
[0 1 2 3 4 5 6 7 8 9]
array([2, 3, 4])
= pd.Series(y)
z print(z)
# slicing will return elements with index labels 2 through 5 (inclusive)
2:5] z.loc[
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
2 2
3 3
4 4
5 5
dtype: int64
3.3 Data Frames
The Data Frame is the most used pandas
object. It represents tabular data and we can think of it as a spreadhseet. Each column of a pandas.DataFrame
is a pandas.Series
.
3.3.1 Creating a pandas.DataFrame
There are many ways of creating a pandas.DataFrame
.
We already mentioned each column of a pandas.DataFrame
is a pandas.Series
. In fact, the pandas.DataFrame
is a dictionary of pandas.Series
, with each column name being the key and the column values being the key’s value. Thus, we can create a pandas.DataFrame
in this way:
# initialize dictionary with columns' data
= {'col_name_1' : pd.Series(np.arange(3)),
d 'col_name_2' : pd.Series([3.1, 3.2, 3.3]),
}
# create data frame
= pd.DataFrame(d)
df df
col_name_1 | col_name_2 | |
---|---|---|
0 | 0 | 3.1 |
1 | 1 | 3.2 |
2 | 2 | 3.3 |
We can change the index and column names by changing the index
and columns
attributes in the data frame.
# print original index
print(df.index)
# change the index
= ['a','b','c']
df.index df
RangeIndex(start=0, stop=3, step=1)
col_name_1 | col_name_2 | |
---|---|---|
a | 0 | 3.1 |
b | 1 | 3.2 |
c | 2 | 3.3 |
# print original column names
print(df.columns)
# change column names
= ['C1','C2']
df.columns df
Index(['col_name_1', 'col_name_2'], dtype='object')
C1 | C2 | |
---|---|---|
a | 0 | 3.1 |
b | 1 | 3.2 |
c | 2 | 3.3 |