dimension Link to heading
| Data Structure | Dimensions | Description |
|---|---|---|
| Series | 1 | 1d array |
| Data Frames | 2 | 2d array |
| Panel | 3 | 3d array |
- Series: size immutable, data mutable
- Data Frames: size immutable, data mutable
- Panel: size mutable, data mutable
To sum, all the data structures are data mutable.
Data Structures Link to heading
pandas.Series Link to heading
pandas.Series(data, index, dtype, copy)
data is array-like
index value must be unique and hashable, same lenght as data.
dtype is for data type.
- create a Series from dict
import pandas as pd
import numpy as np
data = {'a':0, 'b':1., 'c':2.}
s = pd.Series(data)
print s
a 0.0
b 1.0
c 2.0
dtype: float64
obeserve that dictionary keys are used to construct index
import pandas as pd
import numpy as np
data = {'a':0, 'b':1., 'c':2.}
s = pd.Series(data, index=['b', 'c', 'd', 'a'])
print s
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
observe that index order is persisted and missing element is filled with NaN
accessing data from Series with position
very similiar to array data access
if Series is indexed, one can access using index label values.
pandas.DataFrame Link to heading
- size is mutable
- labeled axes
one can think of it as sql table data representation
pandas.DataFrame(data, index, columns, dtype, copy)
- create a DataFrame from dict or arrays/lists
import pandas as pd
data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 'Age': [28, 34, 29, 42]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])
print df
the output is as follows
Age Name
rank1 28 Tom
rank2 34 Jack
rank3 29 Steve
rank4 42 Ricky
column selection, addition and deletion
first, let’s see a script
import pandas d = {'one':pd.Series([1,2,3], index=['a', 'b', 'c']), 'two':pd.Series([1,2,3,4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print dfand it will print
one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4if do selection
print(df['one'])the output is as follows
a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64if do addition
df['three'] = pd.Series([10,20.30], index=['a', 'b', 'c']) print(df) # and even this df['four'] = df['one']+df['three'] print(df)the output is as follows
Adding a new column by passing as Series: one two three a 1.0 1 10.0 b 2.0 2 20.0 c 3.0 3 30.0 d NaN 4 NaN Adding a new column using the existing columns in DataFrame: one two three four a 1.0 1 10.0 11.0 b 2.0 2 20.0 22.0 c 3.0 3 30.0 33.0 d NaN 4 NaN NaNif do deletion
print(df) # using del function del df['one'] print(df) # using pop function df.pop('two') print(df)the output is as follows
Our dataframe is: one three two a 1.0 10.0 1 b 2.0 20.0 2 c 3.0 30.0 3 d NaN NaN 4 Deleting the first column using DEL function: three two a 10.0 1 b 20.0 2 c 30.0 3 d NaN 4 Deleting another column using POP function: three a 10.0 b 20.0 c 30.0 d NaN
row selection, addtion and deletion
selection
- selection by label using
locfunction - selection by integer location using
ilocfunction
- selection by label using
addition
import pandas as pd df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b']) df = df.append(df2) print(df)output is as follows
a b 0 1 2 1 3 4 0 5 6 1 7 8deletion
using
drop(<label>)import pandas as pd df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b']) df = df.append(df2) # Drop rows with label 0 df = df.drop(0) print(df)output is as follows
a b 1 3 4 1 7 8
pandas.Panel() Link to heading
pandas.Panel(data, items, major_axis, dtype, copy)
- items - axis 0, each item in DataFrame
- major_axis - axis 1, row (index) of each DataFrame
- major_axis - axis 2, columns of each DataFrame
data can be accessed by using major_axis method.
Basic Functionality Link to heading
| x | axes | dtype | empty | ndim | size | values | head() | tail() | T | shape |
|---|---|---|---|---|---|---|---|---|---|---|
| Series | x | x | x | x | x | x | x | x | ||
| DataFrame | x | x | x | x | x | x | x | x | x | x |
- ndim - return the number of dimensions of the underlying data by definition 1
- head() - return the first n rows
- tail() - return the last n rows
- T - transpose rows and columns
- shape - tuple representing the dimenstionality of DataFrame
statistic methods Link to heading
most of them are aggregations function take an axis argument
| Sr.No. | Function | Description |
|---|---|---|
| 1 | count() | Number of non-null observations |
| 2 | sum() | Sum of values |
| 3 | mean() | Mean of Values |
| 4 | median() | Median of Values |
| 5 | mode() | Mode of values |
| 6 | std() | Standard Deviation of the Values |
| 7 | min() | Minimum Value |
| 8 | max() | Maximum Value |
| 9 | abs() | Absolute Value |
| 10 | prod() | Product of Values |
| 11 | cumsum() | Cumulative Sum |
| 12 | cumprod() | Cumulative Product |
discribe()
function computes a summary of statistics pertaining to the DataFrame columns
function application Link to heading
table wise function application : pipe() Link to heading
...
df.pipe(<my_function>, ...)
...
the above script tells us that apply my_function to the whole DataFrame
row or column wise function application : apply() Link to heading
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df = df.apply(np.mean,axis=1)
print(df)
the output is as following
0 0.248585
1 -0.552151
2 -0.579080
3 -0.351066
4 -0.401736
dtype: float64
parameter axis can be passed. 0 represent to row, whereas 1 represent to column
element wise function application : applymap() Link to heading
the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
# My custom function
print(df)
df = df['col1'].map(lambda x:x*100)
print(df)
output is following
col1 col2 col3
0 -0.728627 -0.101886 1.025377
1 0.530011 -0.999194 1.060044
2 -0.284739 -0.423325 -0.599201
3 -0.064372 1.523775 -0.321916
4 1.391365 0.436669 -0.342277
0 -72.862671
1 53.001126
2 -28.473895
3 -6.437233
4 139.136522
Name: col1, dtype: float64
to tell the truth, I think pipe() and map() are so similar
reindex Link to heading
Reindexing changes the row labels and column labels of a DataFrame.
usage is like following:
# reindex using index 0, 2, 5
...
df.reindex(index=[0, 2, 5], columns=['A', 'B', 'C'])
...
# reindex align to other
...
df1 = df1.reindex_like(df2)
...
# ffill/bfill/nearst
# ffill with pad Nan(s)
...
df2.reindex_like(df1, method='ffill')
...
# limit how many rows to fill
...
df2.reindex_like(df1, method='ffill', limit=1)
...
# rename column name
...
df1.rename(columns={'col1':'c1', 'col2': 'c2'})
...
iterations Link to heading
iterate
- Series - on values
- DataFrame - on column labels
- Panel - on item labels
three functions can be used
- iteritems()
- iterrows()
- itertuples()
sort Link to heading
sort_index(ascending=False, axis=<axis>)
sort_value(by='[<which_cols>]', kind='mergesort/quicksort/heapsort')
working with text data Link to heading
these all applies to df.str
| Sr.No | Function & Description |
|---|---|
| 1 | lower()Converts strings in the Series/Index to lower case. |
| 2 | upper()Converts strings in the Series/Index to upper case. |
| 3 | len()Computes String length(). |
| 4 | strip()Helps strip whitespace(including newline) from each string in the Series/index from both the sides. |
| 5 | split(’ ‘)Splits each string with the given pattern. |
| 6 | cat(sep=’ ‘)Concatenates the series/index elements with given separator. |
| 7 | get_dummies()Returns the DataFrame with One-Hot Encoded values. |
| 8 | contains(pattern)Returns a Boolean value True for each element if the substring contains in the element, else False. |
| 9 | replace(a,b)Replaces the value a with the value b. |
| 10 | repeat(value)Repeats each element with specified number of times. |
| 11 | count(pattern)Returns count of appearance of pattern in each element. |
| 12 | startswith(pattern)Returns true if the element in the Series/Index starts with the pattern. |
| 13 | endswith(pattern)Returns true if the element in the Series/Index ends with the pattern. |
| 14 | find(pattern)Returns the first position of the first occurrence of the pattern. |
| 15 | findall(pattern)Returns a list of all occurrence of the pattern. |
| 16 | swapcaseSwaps the case lower/upper. |
| 17 | islower()Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean |
| 18 | isupper()Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean. |
| 19 | isnumeric()Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean. |
indexing and selecting data Link to heading
.loc()- index based.iloc()- 0-based integer indexing.ix()- both label and integer based
| Object | Indexers | Return Type |
|---|---|---|
| Series | s.loc[indexer] | Scalar value |
| DataFrame | df.loc[row_index,col_index] | Series object |
| Panel | p.loc[item_index,major_index, minor_index] | p.loc[item_index,major_index, minor_index] |
window functions Link to heading
imagine a sliding window that works on top of DataFrame objects
.rolling()- Specify the window=n argument and apply the appropriate statistical function on top of it. usagedf.rolling(window=x)....expanding()- This function can be applied on a series of data. Specify the min_periods=n argument and apply the appropriate statistical function on top of it. usagedf.expanding(min_periods=x)...
Missing Data Link to heading
- cleaning / filling missing data
df.fillna(0, method='pad/bfill') - fill NaN with 0
- drop missing value
df.dropna(0, axis=<axis>)
- replace missing or generic values
df.replace({<generic>:<replaced>, ...})
Group by Link to heading
using
groupby()function, it can group by one col, or multiple colsgiven
groupbyobject in hand, we can iterate through the object.select a group by using
get_group()to select a single group.agg()function can be performed on the grouped data, aggregation functions can be applied multiple at once. eg:agg([np.sum, np.mean, np.std])transform(<my_function>)- Transformation on a group or a column returns an object that is indexed the same size of that is being grouped. Thus, the transform should return a result that is the same size as that of a group chunk.filter()- Filtration filters the data on a defined criteria and returns the subset of data. Thefilter()function is used to filter the data.
merge() == sql join
Link to heading
merge(left_data, right_data, how='inner/left/right/outer', on=[<columns>])
cancat() == stack() and/or union()
Link to heading
df.concat([df1, df2, ...], keys=['x', 'y', ...], ignore_index=True/False, axis='<axis>')
concatenating using append
df1.append([df2, df3, ...])
Time in pandas Link to heading
datetime Link to heading
usage
import pandas as pd
print(pd.datetime.now())
# create a range of time
pd.date_range("11:00", "13:30", freq="30min").time
# output is [datetime.time(11, 0) datetime.time(11, 30) datetime.time(12, 0) datetime.time(12, 30) datetime.time(13, 0) datetime.time(13, 30)]
# converting to Timestamps
pd.to_datetime(<pandas_object>)
NaT Not a Time equivlent to NaN
dates Link to heading
pd.date_range('1/1/2011', periods=5, freq='M')
# output is a list of DateTimeIndex, M stand for Month
some offset aliases are listed below
| Alias | Description |
|---|---|
| B | business day frequency |
| D | calendar day frequency |
| W | weekly frequency |
| M | month end frequency |
| SM | semi-month end frequency |
| BM | business month end frequency |
| MS | month start frequency |
| SMS | SMS semi month start frequency |
| BMS | business month start frequency |
| Q | quarter end frequency |
| BQ | business quarter end frequency |
| QS | quarter start frequency |
| BQS | business quarter start frequency |
| A | annual(Year) end frequency |
| BA | business year end frequency |
| BAS | business year start frequency |
| BH | business hour frequency |
| H | hourly frequency |
| T, min | minutely frequency |
| S | secondly frequency |
| L, ms | milliseconds |
| U, us | microseconds |
| N | nanoseconds |
Timedelta Link to heading
import pandas as pd
pd.Timedelta('2 days 2 hours 15 minutes 30 seconds')\
# 2 days 02:15:30
pd.Timedelta(6, unit='h')
# 0 days 06:00:00
pd.Timedelta(days=2)
date object can also add/minus with other date object
Categorical data Link to heading
Visualization Link to heading
basic plot Link to heading
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(10, 4), index=pd.date_range('1/1/2000', periods=10), columns=list('ABCD'))
df.plot()
bar plot Link to heading
df.plot.bar(stacked=True)
df.plot.barh(stacked=True)
histograms Link to heading
df.plot.hist(bins=20)
df.diff.hist(bins=20)
box plots Link to heading
df.plot.box() # df can be Series/DataFrame
area plots Link to heading
df.plot.area() # df can be Series/DataFrame
scatter plots Link to heading
df.plot.scatter(x='cola', y='colb') # DataFrame
pie charts Link to heading
df.plot.pie(subplots=True)
I/O Link to heading
pandas.read_csv(filepath_or_buffer, sep=',', delimiter=None, header='infer',
names=None, index_col=None, usecols=None, dtype={'colx': np.numeric_type}, skiprows=[<row_nums>])
people can change dtype while read csv on the fly, can change header names and can skip rows by identifying row number
sparse data Link to heading
Sparse objects are “compressed” when any data matching a specific value (NaN / missing value, though any value can be chosen) is omitted.
using to_sparse() method creates sparse objects, existing for memory efficiency reasons.
Any sparse object can be converted back to the standard dense form by calling to_dense()
sparse and dense objects have the same dtype
other functions Link to heading
- any(): return whether any element is True.
- isin(): whether each element in the DataFrame is contained in values
pandas vs. SQL Link to heading
top N Link to heading
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;
in pandas
tips[['total_bill', 'tip', 'smoker', 'time']].head(5)
where clause Link to heading
SELECT * FROM tips WHERE time = 'Dinner' LIMIT 5;
in pandas
tips[tips['time'] == 'Dinner'].head(5)
group by Link to heading
SELECT sex, count(*)
FROM tips
GROUP BY sex;
in pandas
tips.groupby('sex').size()