pandas cheat sheet

dimension Link to heading

Data Structure	Dimensions	Description
Series	1	1d array
Data Frames	2	2d array
Panel	3	3d array

Series: size immutable, data mutable
Data Frames: size immutable, data mutable
Panel: size mutable, data mutable

To sum, all the data structures are data mutable.

Data Structures Link to heading

pandas.Series Link to heading

pandas.Series(data, index, dtype, copy)

data is array-like

index value must be unique and hashable, same lenght as data.

dtype is for data type.

create a Series from dict

import pandas as pd
import numpy as np
data = {'a':0, 'b':1., 'c':2.}
s = pd.Series(data)
print s

a 0.0
b 1.0
c 2.0
dtype: float64

obeserve that dictionary keys are used to construct index

import pandas as pd
import numpy as np

data = {'a':0, 'b':1., 'c':2.}
s = pd.Series(data, index=['b', 'c', 'd', 'a'])
print s

b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

observe that index order is persisted and missing element is filled with NaN

accessing data from Series with position
- very similiar to array data access
- if Series is indexed, one can access using index label values.

pandas.DataFrame Link to heading

size is mutable
labeled axes

one can think of it as sql table data representation

pandas.DataFrame(data, index, columns, dtype, copy)

create a DataFrame from dict or arrays/lists

import pandas as pd

data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [28, 34, 29, 42]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])
print df

the output is as follows

         Age   Name
rank1     28   Tom
rank2     34   Jack
rank3     29   Steve
rank4     42   Ricky

column selection, addition and deletion

first, let’s see a script

import pandas

d = {'one':pd.Series([1,2,3], index=['a', 'b', 'c']),
	 'two':pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)
print df

and it will print

      one    two
a     1.0    1
b     2.0    2
c     3.0    3
d     NaN    4

if do selection

print(df['one'])

the output is as follows

a     1.0
b     2.0
c     3.0
d     NaN
Name: one, dtype: float64

if do addition

df['three'] = pd.Series([10,20.30], index=['a', 'b', 'c'])
print(df)

# and even this
df['four'] = df['one']+df['three']
print(df)

the output is as follows

Adding a new column by passing as Series:
     one   two   three
a    1.0    1    10.0
b    2.0    2    20.0
c    3.0    3    30.0
d    NaN    4    NaN

Adding a new column using the existing columns in DataFrame:
      one   two   three    four
a     1.0    1    10.0     11.0
b     2.0    2    20.0     22.0
c     3.0    3    30.0     33.0
d     NaN    4     NaN     NaN

if do deletion

print(df)

# using del function
del df['one']
print(df)

# using pop function
df.pop('two')
print(df)

the output is as follows

Our dataframe is:
      one   three  two
a     1.0    10.0   1
b     2.0    20.0   2
c     3.0    30.0   3
d     NaN     NaN   4

Deleting the first column using DEL function:
      three    two
a     10.0     1
b     20.0     2
c     30.0     3
d     NaN      4

Deleting another column using POP function:
   three
a  10.0
b  20.0
c  30.0
d  NaN

row selection, addtion and deletion

selection
1. selection by label using loc function
2. selection by integer location using iloc function

addition

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)
print(df)

output is as follows

deletion

using drop(<label>)

import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])

df = df.append(df2)

# Drop rows with label 0
df = df.drop(0)

print(df)

output is as follows

  a b
1 3 4
1 7 8

pandas.Panel() Link to heading

pandas.Panel(data, items, major_axis, dtype, copy)

items - axis 0, each item in DataFrame
major_axis - axis 1, row (index) of each DataFrame
major_axis - axis 2, columns of each DataFrame

data can be accessed by using major_axis method.

Basic Functionality Link to heading

x	axes	dtype	empty	ndim	size	values	head()	tail()	T	shape
Series	x	x	x	x	x	x	x	x
DataFrame	x	x	x	x	x	x	x	x	x	x

ndim - return the number of dimensions of the underlying data by definition 1
head() - return the first n rows
tail() - return the last n rows
T - transpose rows and columns
shape - tuple representing the dimenstionality of DataFrame

statistic methods Link to heading

most of them are aggregations function take an axis argument

Sr.No.	Function	Description
1	count()	Number of non-null observations
2	sum()	Sum of values
3	mean()	Mean of Values
4	median()	Median of Values
5	mode()	Mode of values
6	std()	Standard Deviation of the Values
7	min()	Minimum Value
8	max()	Maximum Value
9	abs()	Absolute Value
10	prod()	Product of Values
11	cumsum()	Cumulative Sum
12	cumprod()	Cumulative Product

discribe()
function computes a summary of statistics pertaining to the DataFrame columns

function application Link to heading

table wise function application : pipe() Link to heading

...
df.pipe(<my_function>, ...)
...

the above script tells us that apply my_function to the whole DataFrame

row or column wise function application : apply() Link to heading

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df = df.apply(np.mean,axis=1)
print(df)

the output is as following

0    0.248585
1   -0.552151
2   -0.579080
3   -0.351066
4   -0.401736
dtype: float64

parameter axis can be passed. 0 represent to row, whereas 1 represent to column

element wise function application : applymap() Link to heading

the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])

# My custom function
print(df)
df = df['col1'].map(lambda x:x*100)
print(df)

output is following

       col1      col2      col3
0 -0.728627 -0.101886  1.025377
1  0.530011 -0.999194  1.060044
2 -0.284739 -0.423325 -0.599201
3 -0.064372  1.523775 -0.321916
4  1.391365  0.436669 -0.342277
0    -72.862671
1     53.001126
2    -28.473895
3     -6.437233
4    139.136522
Name: col1, dtype: float64

to tell the truth, I think pipe() and map() are so similar

reindex Link to heading

Reindexing changes the row labels and column labels of a DataFrame.

usage is like following:

# reindex using index 0, 2, 5
...
df.reindex(index=[0, 2, 5], columns=['A', 'B', 'C'])
...

# reindex align to other
...
df1 = df1.reindex_like(df2)
...

# ffill/bfill/nearst
# ffill with pad Nan(s)
...
df2.reindex_like(df1, method='ffill')
...

# limit how many rows to fill
...
df2.reindex_like(df1, method='ffill', limit=1)
...

# rename column name
...
df1.rename(columns={'col1':'c1', 'col2': 'c2'})
...

iterations Link to heading

iterate

Series - on values
DataFrame - on column labels
Panel - on item labels

three functions can be used

iteritems()
iterrows()
itertuples()

sort Link to heading

sort_index(ascending=False, axis=<axis>)

sort_value(by='[<which_cols>]', kind='mergesort/quicksort/heapsort')

working with text data Link to heading

these all applies to df.str

Sr.No	Function & Description
1	lower()Converts strings in the Series/Index to lower case.
2	upper()Converts strings in the Series/Index to upper case.
3	len()Computes String length().
4	strip()Helps strip whitespace(including newline) from each string in the Series/index from both the sides.
5	split(’ ‘)Splits each string with the given pattern.
6	cat(sep=’ ‘)Concatenates the series/index elements with given separator.
7	get_dummies()Returns the DataFrame with One-Hot Encoded values.
8	contains(pattern)Returns a Boolean value True for each element if the substring contains in the element, else False.
9	replace(a,b)Replaces the value a with the value b.
10	repeat(value)Repeats each element with specified number of times.
11	count(pattern)Returns count of appearance of pattern in each element.
12	startswith(pattern)Returns true if the element in the Series/Index starts with the pattern.
13	endswith(pattern)Returns true if the element in the Series/Index ends with the pattern.
14	find(pattern)Returns the first position of the first occurrence of the pattern.
15	findall(pattern)Returns a list of all occurrence of the pattern.
16	swapcaseSwaps the case lower/upper.
17	islower()Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean
18	isupper()Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.
19	isnumeric()Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

indexing and selecting data Link to heading

.loc() - index based
.iloc() - 0-based integer indexing
.ix() - both label and integer based

Object	Indexers	Return Type
Series	s.loc[indexer]	Scalar value
DataFrame	df.loc[row_index,col_index]	Series object
Panel	p.loc[item_index,major_index, minor_index]	p.loc[item_index,major_index, minor_index]

window functions Link to heading

imagine a sliding window that works on top of DataFrame objects

.rolling() - Specify the window=n argument and apply the appropriate statistical function on top of it. usage df.rolling(window=x)...
.expanding() - This function can be applied on a series of data. Specify the min_periods=n argument and apply the appropriate statistical function on top of it. usage df.expanding(min_periods=x)...

Missing Data Link to heading

cleaning / filling missing data

df.fillna(0, method='pad/bfill') - fill NaN with 0

drop missing value

df.dropna(0, axis=<axis>)

replace missing or generic values

df.replace({<generic>:<replaced>, ...})

Group by Link to heading

using groupby() function, it can group by one col, or multiple cols
given groupby object in hand, we can iterate through the object.
select a group by using get_group() to select a single group.
agg() function can be performed on the grouped data, aggregation functions can be applied multiple at once. eg: agg([np.sum, np.mean, np.std])
transform(<my_function>) - Transformation on a group or a column returns an object that is indexed the same size of that is being grouped. Thus, the transform should return a result that is the same size as that of a group chunk.
filter() - Filtration filters the data on a defined criteria and returns the subset of data. The filter() function is used to filter the data.

`merge()` == sql join Link to heading

merge(left_data, right_data, how='inner/left/right/outer', on=[<columns>])

`cancat()` == stack() and/or union() Link to heading

df.concat([df1, df2, ...], keys=['x', 'y', ...], ignore_index=True/False, axis='<axis>')

concatenating using append

df1.append([df2, df3, ...])

Time in pandas Link to heading

datetime Link to heading

usage

import pandas as pd

print(pd.datetime.now())

# create a range of time
pd.date_range("11:00", "13:30", freq="30min").time
# output is [datetime.time(11, 0) datetime.time(11, 30) datetime.time(12, 0) datetime.time(12, 30) datetime.time(13, 0) datetime.time(13, 30)]

# converting to Timestamps
pd.to_datetime(<pandas_object>)

NaT Not a Time equivlent to NaN

dates Link to heading

pd.date_range('1/1/2011', periods=5, freq='M')
# output is a list of DateTimeIndex, M stand for Month

some offset aliases are listed below

Alias	Description
B	business day frequency
D	calendar day frequency
W	weekly frequency
M	month end frequency
SM	semi-month end frequency
BM	business month end frequency
MS	month start frequency
SMS	SMS semi month start frequency
BMS	business month start frequency
Q	quarter end frequency
BQ	business quarter end frequency
QS	quarter start frequency
BQS	business quarter start frequency
A	annual(Year) end frequency
BA	business year end frequency
BAS	business year start frequency
BH	business hour frequency
H	hourly frequency
T, min	minutely frequency
S	secondly frequency
L, ms	milliseconds
U, us	microseconds
N	nanoseconds

Timedelta Link to heading

import pandas as pd

pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')\
# 2 days 02:15:30

pd.Timedelta(6, unit='h')
# 0 days 06:00:00

pd.Timedelta(days=2)

date object can also add/minus with other date object

Categorical data Link to heading

Visualization Link to heading

basic plot Link to heading

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4), index=pd.date_range('1/1/2000', periods=10), columns=list('ABCD'))

df.plot()

bar plot Link to heading

df.plot.bar(stacked=True)
df.plot.barh(stacked=True)

histograms Link to heading

df.plot.hist(bins=20)
df.diff.hist(bins=20)

box plots Link to heading

df.plot.box() # df can be Series/DataFrame

area plots Link to heading

df.plot.area() # df can be Series/DataFrame

scatter plots Link to heading

df.plot.scatter(x='cola', y='colb') # DataFrame

pie charts Link to heading

df.plot.pie(subplots=True)

I/O Link to heading

pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer',
names=None, index_col=None, usecols=None, dtype={'colx': np.numeric_type}, skiprows=[<row_nums>])

people can change dtype while read csv on the fly, can change header names and can skip rows by identifying row number

sparse data Link to heading

Sparse objects are “compressed” when any data matching a specific value (NaN / missing value, though any value can be chosen) is omitted.

using to_sparse() method creates sparse objects, existing for memory efficiency reasons.

Any sparse object can be converted back to the standard dense form by calling to_dense()

sparse and dense objects have the same dtype

pandas vs. SQL Link to heading

where clause Link to heading

SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;

in pandas

tips[tips['time'] == 'Dinner'].head(5)

group by Link to heading

SELECT sex, count(*)
FROM tips
GROUP BY sex;

in pandas

tips.groupby('sex').size()

dimension Link to heading

Data Structures Link to heading

pandas.Series Link to heading

pandas.DataFrame Link to heading

pandas.Panel() Link to heading

Basic Functionality Link to heading

statistic methods Link to heading

function application Link to heading

table wise function application : pipe() Link to heading

row or column wise function application : apply() Link to heading

element wise function application : applymap() Link to heading

reindex Link to heading

iterations Link to heading

sort Link to heading

working with text data Link to heading

indexing and selecting data Link to heading

window functions Link to heading

Missing Data Link to heading

Group by Link to heading

merge() == sql join Link to heading

cancat() == stack() and/or union() Link to heading

Time in pandas Link to heading

datetime Link to heading

dates Link to heading

Timedelta Link to heading

Categorical data Link to heading

Visualization Link to heading

basic plot Link to heading

bar plot Link to heading

histograms Link to heading

box plots Link to heading

area plots Link to heading

scatter plots Link to heading

pie charts Link to heading

I/O Link to heading

sparse data Link to heading

other functions Link to heading

pandas vs. SQL Link to heading

top N Link to heading

where clause Link to heading

group by Link to heading

`merge()` == sql join Link to heading

`cancat()` == stack() and/or union() Link to heading