Introduction to Pandas
Python Library : Pandas
PANDAS : Open source data analysis library written in python. It leverages the power and speed of numpy to make data analysis and EDA (Exploratory Data Analysis) really easy task for any data scientist.
Analogy : As we took vegetable from the market cleaning before cooking it, that's what pandas do (cleaning of the present data)
Feature means : Columns
Records means : Rows
Importing pandas and numpy libraries :
import pandas as pd
import numpy as np
Importing Dataset from local directory
data = pd.read_csv("/train.csv")
Dataset Datatype : DataFrame which is basically tabular form of data
type(data) #data structure type
Output :
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
How many null values are available in a given dataset
data.isnull().sum() # finds sum of all features(column) null values count
Output :
| 0 | |
| PassengerId | 0 |
| Survived | 0 |
| Pclass | 0 |
| Name | 0 |
| Sex | 0 |
| Age | 177 |
| SibSp | 0 |
| Parch | 0 |
| Ticket | 0 |
| Fare | 0 |
| Cabin | 687 |
| Embarked | 2 |
dtype: int64
Structure of the Dataset in given row and column
data.shape # gives row and column
Output :
(891, 12) #891 records, 12 features
Dictionary to DataFrame data type
Creating Dictionary :
dic1 = {"name": ["Priya", "Nikhil", "Sanjay"], "age": [25, 25, 37], "city": ["chandigarh", "Bengaluru", "Delhi"]}
Print dic1 dictionary:
dic1
Output :
{'name': ['Priya', 'Nikhil', 'Sanjay'],
'age': [25, 25, 37],
'city': ['chandigarh', 'Bengaluru', 'Delhi']}
Converting dictionary to DataFrame
dataFrame = pd.DataFrame(dic1) #convert dic1 to DataFrame
Printing dataFrame
dataFrame
Output :
| name | age | city | |
| 0 | Priya | 25 | chandigarh |
| 1 | Nikhil | 25 | Bengaluru |
| 2 | Sanjay | 37 | Delhi |
List of all features present in a given dataset
data.columns
Output :
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Every Column's datatypes
data.dtypes # for every features datatypes
Output :
| 0 | |
| PassengerId | int64 |
| Survived | int64 |
| Pclass | int64 |
| Name | object |
| Sex | object |
| Age | float64 |
| SibSp | int64 |
| Parch | int64 |
| Ticket | object |
| Fare | float64 |
| Cabin | object |
| Embarked | object |
dtype: object
Gives All Information about Dataset with respect to numerical features
data.info # information about dataset
Output :
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
| count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
dtype: float64
Series and DataFrame Data Structure
Data type of given code:
type(data['Age']) #gives datatype of 'age' , data type is Series because this is one dimentional
Output : pandas.core.series.Series
| Age | |
| 0 | 22.0 |
| 1 | 38.0 |
| 2 | 26.0 |
| 3 | 35.0 |
| 4 | 35.0 |
| ... | ... |
| 886 | 27.0 |
| 887 | 19.0 |
| 888 | NaN |
| 889 | 26.0 |
| 890 | 32.0 |
891 rows × 1 columns
Data type representation of given code:
type(data[["Name", "Age","Ticket"]] ) # data type is DataFrame , because this is greater than one dimensional data.
Output : pandas.core.frame.DataFrame
| Name | Age | Ticket | |
| 0 | Braund, Mr. Owen Harris | 22.0 | A/5 21171 |
| 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 38.0 | PC 17599 |
| 2 | Heikkinen, Miss. Laina | 26.0 | STON/O2. 3101282 |
| 3 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 35.0 | 113803 |
| 4 | Allen, Mr. William Henry | 35.0 | 373450 |
| ... | ... | ... | ... |
| 886 | Montvila, Rev. Juozas | 27.0 | 211536 |
| 887 | Graham, Miss. Margaret Edith | 19.0 | 112053 |
| 888 | Johnston, Miss. Catherine Helen "Carrie" | NaN | W./C. 6607 |
| 889 | Behr, Mr. Karl Howell | 26.0 | 111369 |
| 890 | Dooley, Mr. Patrick | 32.0 | 370376 |
891 rows × 3 columns
Difference between loc and iloc:
Row indexing : Preview 0th and 4th records only with all features
data.loc[[0,4]]
Output :
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.25 | NaN | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.05 | NaN | S |
Specific row and column indexing : repesent records and features with a given range, --excluding last record and feature
data.iloc[2:5, 0:4]
Output :
| PassengerId | Survived | Pclass | Name | |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry |