dimension Link to heading

Data StructureDimensionsDescription
Series11d array
Data Frames22d array
Panel33d array
  • Series: size immutable, data mutable
  • Data Frames: size immutable, data mutable
  • Panel: size mutable, data mutable

To sum, all the data structures are data mutable.

Data Structures Link to heading

pandas.Series Link to heading

pandas.Series(data, index, dtype, copy)

data is array-like

index value must be unique and hashable, same lenght as data.

dtype is for data type.

  • create a Series from dict
import pandas as pd
import numpy as np
data = {'a':0, 'b':1., 'c':2.}
s = pd.Series(data)
print s
a 0.0
b 1.0
c 2.0
dtype: float64

obeserve that dictionary keys are used to construct index

import pandas as pd
import numpy as np

data = {'a':0, 'b':1., 'c':2.}
s = pd.Series(data, index=['b', 'c', 'd', 'a'])
print s
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

observe that index order is persisted and missing element is filled with NaN

  • accessing data from Series with position

    • very similiar to array data access

    • if Series is indexed, one can access using index label values.

pandas.DataFrame Link to heading

  • size is mutable
  • labeled axes

one can think of it as sql table data representation

pandas.DataFrame(data, index, columns, dtype, copy)

  • create a DataFrame from dict or arrays/lists
import pandas as pd

data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [28, 34, 29, 42]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])
print df

the output is as follows

         Age   Name
rank1     28   Tom
rank2     34   Jack
rank3     29   Steve
rank4     42   Ricky
  • column selection, addition and deletion

    first, let’s see a script

    import pandas
    
    d = {'one':pd.Series([1,2,3], index=['a', 'b', 'c']),
    	 'two':pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])}
    
    df = pd.DataFrame(d)
    print df
    

    and it will print

          one    two
    a     1.0    1
    b     2.0    2
    c     3.0    3
    d     NaN    4
    
    • if do selection

      print(df['one'])
      

      the output is as follows

      a     1.0
      b     2.0
      c     3.0
      d     NaN
      Name: one, dtype: float64
      
    • if do addition

      df['three'] = pd.Series([10,20.30], index=['a', 'b', 'c'])
      print(df)
      
      # and even this
      df['four'] = df['one']+df['three']
      print(df)
      

      the output is as follows

      Adding a new column by passing as Series:
           one   two   three
      a    1.0    1    10.0
      b    2.0    2    20.0
      c    3.0    3    30.0
      d    NaN    4    NaN
      
      Adding a new column using the existing columns in DataFrame:
            one   two   three    four
      a     1.0    1    10.0     11.0
      b     2.0    2    20.0     22.0
      c     3.0    3    30.0     33.0
      d     NaN    4     NaN     NaN
      
    • if do deletion

      print(df)
      
      # using del function
      del df['one']
      print(df)
      
      # using pop function
      df.pop('two')
      print(df)
      

      the output is as follows

      Our dataframe is:
            one   three  two
      a     1.0    10.0   1
      b     2.0    20.0   2
      c     3.0    30.0   3
      d     NaN     NaN   4
      
      Deleting the first column using DEL function:
            three    two
      a     10.0     1
      b     20.0     2
      c     30.0     3
      d     NaN      4
      
      Deleting another column using POP function:
         three
      a  10.0
      b  20.0
      c  30.0
      d  NaN
      
  • row selection, addtion and deletion

    • selection

      1. selection by label using loc function
      2. selection by integer location using iloc function
    • addition

      import pandas as pd
      
      df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
      df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
      
      df = df.append(df2)
      print(df)
      

      output is as follows

         a  b
      0  1  2
      1  3  4
      0  5  6
      1  7  8
      
    • deletion

      using drop(<label>)

      import pandas as pd
      
      df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b'])
      df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b'])
      
      df = df.append(df2)
      
      # Drop rows with label 0
      df = df.drop(0)
      
      print(df)
      

      output is as follows

        a b
      1 3 4
      1 7 8
      

pandas.Panel() Link to heading

pandas.Panel(data, items, major_axis, dtype, copy)

  • items - axis 0, each item in DataFrame
  • major_axis - axis 1, row (index) of each DataFrame
  • major_axis - axis 2, columns of each DataFrame

data can be accessed by using major_axis method.

Basic Functionality Link to heading

xaxesdtypeemptyndimsizevalueshead()tail()Tshape
Seriesxxxxxxxx
DataFramexxxxxxxxxx
  • ndim - return the number of dimensions of the underlying data by definition 1
  • head() - return the first n rows
  • tail() - return the last n rows
  • T - transpose rows and columns
  • shape - tuple representing the dimenstionality of DataFrame

statistic methods Link to heading

most of them are aggregations function take an axis argument

Sr.No.FunctionDescription
1count()Number of non-null observations
2sum()Sum of values
3mean()Mean of Values
4median()Median of Values
5mode()Mode of values
6std()Standard Deviation of the Values
7min()Minimum Value
8max()Maximum Value
9abs()Absolute Value
10prod()Product of Values
11cumsum()Cumulative Sum
12cumprod()Cumulative Product
  • discribe()

    function computes a summary of statistics pertaining to the DataFrame columns

function application Link to heading

table wise function application : pipe() Link to heading

...
df.pipe(<my_function>, ...)
...

the above script tells us that apply my_function to the whole DataFrame

row or column wise function application : apply() Link to heading

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df = df.apply(np.mean,axis=1)
print(df)

the output is as following

0    0.248585
1   -0.552151
2   -0.579080
3   -0.351066
4   -0.401736
dtype: float64

parameter axis can be passed. 0 represent to row, whereas 1 represent to column

element wise function application : applymap() Link to heading

the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])

# My custom function
print(df)
df = df['col1'].map(lambda x:x*100)
print(df)

output is following

       col1      col2      col3
0 -0.728627 -0.101886  1.025377
1  0.530011 -0.999194  1.060044
2 -0.284739 -0.423325 -0.599201
3 -0.064372  1.523775 -0.321916
4  1.391365  0.436669 -0.342277
0    -72.862671
1     53.001126
2    -28.473895
3     -6.437233
4    139.136522
Name: col1, dtype: float64

to tell the truth, I think pipe() and map() are so similar

reindex Link to heading

Reindexing changes the row labels and column labels of a DataFrame.

usage is like following:

# reindex using index 0, 2, 5
...
df.reindex(index=[0, 2, 5], columns=['A', 'B', 'C'])
...

# reindex align to other
...
df1 = df1.reindex_like(df2)
...

# ffill/bfill/nearst
# ffill with pad Nan(s)
...
df2.reindex_like(df1, method='ffill')
...

# limit how many rows to fill
...
df2.reindex_like(df1, method='ffill', limit=1)
...

# rename column name
...
df1.rename(columns={'col1':'c1', 'col2': 'c2'})
...

iterations Link to heading

iterate

  • Series - on values
  • DataFrame - on column labels
  • Panel - on item labels

three functions can be used

  • iteritems()
  • iterrows()
  • itertuples()

sort Link to heading

sort_index(ascending=False, axis=<axis>)

sort_value(by='[<which_cols>]', kind='mergesort/quicksort/heapsort')

working with text data Link to heading

these all applies to df.str

Sr.NoFunction & Description
1lower()Converts strings in the Series/Index to lower case.
2upper()Converts strings in the Series/Index to upper case.
3len()Computes String length().
4strip()Helps strip whitespace(including newline) from each string in the Series/index from both the sides.
5split(’ ‘)Splits each string with the given pattern.
6cat(sep=’ ‘)Concatenates the series/index elements with given separator.
7get_dummies()Returns the DataFrame with One-Hot Encoded values.
8contains(pattern)Returns a Boolean value True for each element if the substring contains in the element, else False.
9replace(a,b)Replaces the value a with the value b.
10repeat(value)Repeats each element with specified number of times.
11count(pattern)Returns count of appearance of pattern in each element.
12startswith(pattern)Returns true if the element in the Series/Index starts with the pattern.
13endswith(pattern)Returns true if the element in the Series/Index ends with the pattern.
14find(pattern)Returns the first position of the first occurrence of the pattern.
15findall(pattern)Returns a list of all occurrence of the pattern.
16swapcaseSwaps the case lower/upper.
17islower()Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean
18isupper()Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.
19isnumeric()Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

indexing and selecting data Link to heading

  • .loc() - index based
  • .iloc() - 0-based integer indexing
  • .ix() - both label and integer based
ObjectIndexersReturn Type
Seriess.loc[indexer]Scalar value
DataFramedf.loc[row_index,col_index]Series object
Panelp.loc[item_index,major_index, minor_index]p.loc[item_index,major_index, minor_index]

window functions Link to heading

imagine a sliding window that works on top of DataFrame objects

  • .rolling() - Specify the window=n argument and apply the appropriate statistical function on top of it. usage df.rolling(window=x)...
  • .expanding() - This function can be applied on a series of data. Specify the min_periods=n argument and apply the appropriate statistical function on top of it. usage df.expanding(min_periods=x)...

Missing Data Link to heading

  • cleaning / filling missing data

df.fillna(0, method='pad/bfill') - fill NaN with 0

  • drop missing value

df.dropna(0, axis=<axis>)

  • replace missing or generic values

df.replace({<generic>:<replaced>, ...})

Group by Link to heading

  • using groupby() function, it can group by one col, or multiple cols

  • given groupby object in hand, we can iterate through the object.

  • select a group by using get_group() to select a single group.

  • agg() function can be performed on the grouped data, aggregation functions can be applied multiple at once. eg: agg([np.sum, np.mean, np.std])

  • transform(<my_function>) - Transformation on a group or a column returns an object that is indexed the same size of that is being grouped. Thus, the transform should return a result that is the same size as that of a group chunk.

  • filter() - Filtration filters the data on a defined criteria and returns the subset of data. The filter() function is used to filter the data.

merge() == sql join Link to heading

merge(left_data, right_data, how='inner/left/right/outer', on=[<columns>])

cancat() == stack() and/or union() Link to heading

df.concat([df1, df2, ...], keys=['x', 'y', ...], ignore_index=True/False, axis='<axis>')

concatenating using append

df1.append([df2, df3, ...])

Time in pandas Link to heading

datetime Link to heading

usage

import pandas as pd

print(pd.datetime.now())

# create a range of time
pd.date_range("11:00", "13:30", freq="30min").time
# output is [datetime.time(11, 0) datetime.time(11, 30) datetime.time(12, 0) datetime.time(12, 30) datetime.time(13, 0) datetime.time(13, 30)]

# converting to Timestamps
pd.to_datetime(<pandas_object>)

NaT Not a Time equivlent to NaN

dates Link to heading

pd.date_range('1/1/2011', periods=5, freq='M')
# output is a list of DateTimeIndex, M stand for Month

some offset aliases are listed below

AliasDescription
Bbusiness day frequency
Dcalendar day frequency
Wweekly frequency
Mmonth end frequency
SMsemi-month end frequency
BMbusiness month end frequency
MSmonth start frequency
SMSSMS semi month start frequency
BMSbusiness month start frequency
Qquarter end frequency
BQbusiness quarter end frequency
QSquarter start frequency
BQSbusiness quarter start frequency
Aannual(Year) end frequency
BAbusiness year end frequency
BASbusiness year start frequency
BHbusiness hour frequency
Hhourly frequency
T, minminutely frequency
Ssecondly frequency
L, msmilliseconds
U, usmicroseconds
Nnanoseconds

Timedelta Link to heading

import pandas as pd

pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')\
# 2 days 02:15:30

pd.Timedelta(6, unit='h')
# 0 days 06:00:00

pd.Timedelta(days=2)

date object can also add/minus with other date object

Categorical data Link to heading

Visualization Link to heading

basic plot Link to heading

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(10, 4), index=pd.date_range('1/1/2000', periods=10), columns=list('ABCD'))

df.plot()

bar plot Link to heading

df.plot.bar(stacked=True)
df.plot.barh(stacked=True)

histograms Link to heading

df.plot.hist(bins=20)
df.diff.hist(bins=20)

box plots Link to heading

df.plot.box() # df can be Series/DataFrame

area plots Link to heading

df.plot.area() # df can be Series/DataFrame

scatter plots Link to heading

df.plot.scatter(x='cola', y='colb') # DataFrame

pie charts Link to heading

df.plot.pie(subplots=True)

I/O Link to heading

pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer',
names=None, index_col=None, usecols=None, dtype={'colx': np.numeric_type}, skiprows=[<row_nums>])

people can change dtype while read csv on the fly, can change header names and can skip rows by identifying row number

sparse data Link to heading

Sparse objects are “compressed” when any data matching a specific value (NaN / missing value, though any value can be chosen) is omitted.

using to_sparse() method creates sparse objects, existing for memory efficiency reasons.

Any sparse object can be converted back to the standard dense form by calling to_dense()

sparse and dense objects have the same dtype

other functions Link to heading

  • any(): return whether any element is True.
  • isin(): whether each element in the DataFrame is contained in values

pandas vs. SQL Link to heading

top N Link to heading

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;

in pandas

tips[['total_bill', 'tip', 'smoker', 'time']].head(5)

where clause Link to heading

SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;

in pandas

tips[tips['time'] == 'Dinner'].head(5)

group by Link to heading

SELECT sex, count(*)
FROM tips
GROUP BY sex;

in pandas

tips.groupby('sex').size()