Introduction to Pandas: A Beginner's Guide to Python Data

Python Library : Pandas

PANDAS : Open source data analysis library written in python. It leverages the power and speed of numpy to make data analysis and EDA (Exploratory Data Analysis) really easy task for any data scientist.

Analogy : As we took vegetable from the market cleaning before cooking it, that's what pandas do (cleaning of the present data)

Feature means : Columns

Records means : Rows

Importing pandas and numpy libraries :

import pandas as pd
import numpy as np

Importing Dataset from local directory

data = pd.read_csv("/train.csv")

Dataset Datatype : DataFrame which is basically tabular form of data

type(data) #data structure type

Output :

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

How many null values are available in a given dataset

data.isnull().sum() # finds sum of all features(column) null values count

Output :

	0
PassengerId	0
Survived	0
Pclass	0
Name	0
Sex	0
Age	177
SibSp	0
Parch	0
Ticket	0
Fare	0
Cabin	687
Embarked	2

dtype: int64

Structure of the Dataset in given row and column

data.shape # gives row and column

Output :

(891, 12) #891 records, 12 features

Dictionary to DataFrame data type

Creating Dictionary :

dic1 = {"name": ["Priya", "Nikhil", "Sanjay"], "age": [25, 25, 37], "city": ["chandigarh", "Bengaluru", "Delhi"]}

Print dic1 dictionary:

dic1

Output :

{'name': ['Priya', 'Nikhil', 'Sanjay'],
 'age': [25, 25, 37],
 'city': ['chandigarh', 'Bengaluru', 'Delhi']}

Converting dictionary to DataFrame

dataFrame = pd.DataFrame(dic1) #convert dic1 to DataFrame

Printing dataFrame

dataFrame

Output :

	name	age	city
0	Priya	25	chandigarh
1	Nikhil	25	Bengaluru
2	Sanjay	37	Delhi

List of all features present in a given dataset

data.columns

Output :

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Every Column's datatypes

data.dtypes # for every features datatypes

Output :

	0
PassengerId	int64
Survived	int64
Pclass	int64
Name	object
Sex	object
Age	float64
SibSp	int64
Parch	int64
Ticket	object
Fare	float64
Cabin	object
Embarked	object

dtype: object

Gives All Information about Dataset with respect to numerical features

data.info # information about dataset

Output :

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

dtype: float64

Series and DataFrame Data Structure

Data type of given code:

type(data['Age']) #gives datatype of 'age' , data type is Series because this is one dimentional

Output : pandas.core.series.Series

	Age
0	22.0
1	38.0
2	26.0
3	35.0
4	35.0
...	...
886	27.0
887	19.0
888	NaN
889	26.0
890	32.0

891 rows × 1 columns

Data type representation of given code:

type(data[["Name", "Age","Ticket"]] ) # data type is DataFrame , because this is greater than one dimensional data.

Output : pandas.core.frame.DataFrame

	Name	Age	Ticket
0	Braund, Mr. Owen Harris	22.0	A/5 21171
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	38.0	PC 17599
2	Heikkinen, Miss. Laina	26.0	STON/O2. 3101282
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0	113803
4	Allen, Mr. William Henry	35.0	373450
...	...	...	...
886	Montvila, Rev. Juozas	27.0	211536
887	Graham, Miss. Margaret Edith	19.0	112053
888	Johnston, Miss. Catherine Helen "Carrie"	NaN	W./C. 6607
889	Behr, Mr. Karl Howell	26.0	111369
890	Dooley, Mr. Patrick	32.0	370376

891 rows × 3 columns

Difference between loc and iloc:

Row indexing : Preview 0th and 4th records only with all features

data.loc[[0,4]]

Output :

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.25	NaN	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.05	NaN	S

Specific row and column indexing : repesent records and features with a given range, --excluding last record and feature

data.iloc[2:5, 0:4]

Output :

	PassengerId	Survived	Pclass	Name
2	3	1	3	Heikkinen, Miss. Laina
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)
4	5	0	3	Allen, Mr. William Henry

Introduction to Pandas

Importing pandas and numpy libraries :

Importing Dataset from local directory

How many null values are available in a given dataset

Dictionary to DataFrame data type

Series and DataFrame Data Structure

Difference between loc and iloc:

Comments

More from this blog

Emmet for HTML

CSS Selectors 101

Understanding HTML Tags and Elements

How a Browser Works

TCP vs UDP

Command Palette

Importing pandas and numpy libraries :

Importing Dataset from local directory

How many null values are available in a given dataset

Dictionary to DataFrame data type

Series and DataFrame Data Structure

Difference between loc and iloc:

Comments

More from this blog