Data exploration and manipulation is the basic building block for data science. We cannot do this without making selections in our table. At first, it was very confusing and took some time for me to get hang of making selections in Pandas DataFrame. Pandas provided different options for selecting rows and columns in a DataFrame i.e. loc(), iloc(). I will discuss these options in this article and will work on some examples.
I am using the Titanic dataset for this exercise which can be downloaded from this Kaggle Competition Page.
We can read the dataset using pandas read_csv() function. Let’s first read the dataset and store it as a table or DataFrame.
#importing library and dataset import pandas as pd data = pd.read_csv('../train.csv') data.head()
We have imported the train.csv and stored it in a DataFrame named as data. We also looked into the top five rows by using df.head() function. We can see that it has twelve columns. Also, we can check the structure of any DataFrame by using df.shape function. It will give us no of rows and columns of that DataFrame.
1. Selecting a single column
We will select a single column i.e. ‘Name’ from this pandas DataFrame.
#select the name column name = data['Name'] type(name) name.head()
type(variable) gives us the datatype of the variable. df[column_name] gives a series as the output. If we want our selection to give output as a DataFrame, we can change it in the following way:-
#select the name column but result should be a dataframe name= data[['Name']] type(name) name.head()
We can check that in this case result of our selection is a DataFrame.
2. Selecting more than one column
In most of the cases, we will need to make a selection involving many columns. So, let’s select ‘Name’ and ‘Sex’ column and save the result in a different DataFrame.
#select name and sex of the passengers name = data[['Name', 'Sex']] name.head()
‘name’ is a DataFrame consisting of two columns only i.e. ‘Name’ and ‘Sex’.
3. Making selection based on the condition on any column
We will extract all the records from the data table of male passengers and will store it in another table.
#select all the record of male passengers male_record = data[(data['Sex']=='male')] male_record.head()
‘male_record’ will have all the records for male passengers.
We can also use more that one condition for selecting data. Let’s extract all the data for 20 years or older male passengers.
#select all the rows where passengers is male and is 20 years or older male_record = data[(data['Sex']=='male')&(data['Age']>= 20)] male_record.head()
‘male_record’ contains all the records where Sex is male and Age is more than or equal to 20.
What if we want to find out all the records where Age is null.
#select all the rows where age is null age_null = data[(data['Age'].isnull())] age_null.head()
‘age_null’ has all the records where age is null. We have used isnull() function for this.
Let’s find out all the records where Cabin is not null.
#select all the rows where cabin is not null cabin_value = data[(data['Cabin'].notnull())] cabin_value.head()
‘cabin_value’ contains all the rows where there is some value and it is not null. We have used notnull() function for this.
4. Exploring .iloc methods
Pandas has a df.iloc method which we can use to select rows and columns by the order in which they appear in the data frame. It takes two arguments where one is to specify rows and other is to specify columns.
You can find the total number of rows present in any DataFrame by using df.shape.
Let’s use df.iloc to select the first row from the table.
#select the first row data.iloc
We have only passed only one argument instead of two arguments. For the column reference, it takes all the column as the default value. Using df.iloc in this way gives output as a series.
We can also give the negative reference for rows position. -1 will refer to the last row.
Extract the last row from the data table by using negative reference in df.iloc.
#select the last row data.iloc[-1]
It also gives the output as a series. We can change it to get the output as a DataFrame.
#Extracting first row as a DataFrame data.iloc[]
This gives the first row as a DataFrame.
We can also use range function as an argument in df.iloc for selecting continuous rows from the table.
#select top 10 rows data.iloc[0:11]
0:11 gives the reference for rows from 0 to 10 and then df.iloc selects these rows and all the columns.
We can also extract particular rows by referencing it using a list.
#select first second, fourth and tenth rows data.iloc[[0,1,3,9]]
We are extracting first, second, fourth and tenth rows from the table. As python reference starts from 0, so for nth rows reference will be n-1.
We have worked on extracting required rows from the table. Now, we will work on selecting columns from the table.
We can select columns by passing the column reference as the second argument in the df.iloc function.
Let’s select all the values of the first column. As mentioned before, we can reference the first column by 0. We are using ‘:’ as our row reference which means all the rows here.
#select first column data.iloc[:,0]
As we are selecting only one column, it is giving output as a series.
We can use the column reference argument to reference more than one column. We can use range function to refer continuous columns.
#select starting three columns data.iloc[:,0:3]
Here, we use 0:3 to refer first, second and third columns. We are still selecting all the rows. We can also refer particular columns by its position in the list. As with the rows reference, n-1 will refer to the nth column. We are selecting first, third and fifth columns by passing [0, 2, 4] as column reference argument.
#select first, thrid and fifth columns data.iloc[:,[0,2,4]]
Now, we can combine both row and column reference together to access any particular cell or group of cells. We can use [0,0] to access the first cell or data point in the table.
#we can access any particluar cell using .iloc data.iloc[0,0]
Selecting data in the fourth and fifth column in the first row of the table by passing 3:6.
#Select 4th to 6th column in the first row data.iloc[0,3:6]
We can also pass range function is both row and column argument to select any particular subset. We are selecting data from first, second and third rows of the fourth and fifth columns.
#select 4th to 6th column for the first three row data.iloc[0:3,3:6]
So, we can select a subsection of data by passing range function in both rows and columns. Now, we will pass a list of columns position to access particular columns.
#select 2nd, 4th and 7th columns for the first three row data.iloc[0:3,[1,3,6]]
And a list of rows references with a list of columns references to select data from needed rows and columns.
#select 2nd, 4th and 7th columns for the first, three and fifth row data.iloc[[0,2,4],[1,3,6]]
As mentioned before, if we are selecting a single row output can be series. If we want DataFrame we can reference that row like this:
#df.iloc will return a pandas series if you are selecting only one row. #It will return a pandas dataframe when multiple rows are selected. type(data.iloc) type(data.iloc[])[/py]
Same also happens while selecting one column.
df.iloc only takes positional reference. If you try to pass the column name as the reference, it will throw an error. You can try the below example and check the error message.
#we cannot use column name here # data.iloc[0,'Sex'] # above will throw an error[/py]
5. Exploring .loc method
Pandas has another function i.e. df.loc for selecting data from DataFrames or table. df.iloc takes the positional references as the argument input while df.loc takes indexes as the argument. As df.loc takes indexes, we can pass strings as an argument whereas it will through an error if used with df.iloc.
Selecting data from the row where the index is equal to zero.
#select the rows where index = 0 data.loc[/py]
It behaves the same as df.iloc and gives a single row as series. We can change it so that it gives single row as a DataFrame by changing the way we pass the argument.
#select the rows where index=0 and get the output in a dataframe data.loc[]
Let’s use a range function to pass the row indexes.
#select all the rows where index is less than or equal to 10 data.loc[0:11]
We can also pass it a list of indexes to select required indexes.
#select all the rows where index is equal to 0, 5, 10, 20, 50 data.loc[[0,5,10,20,50]]
Unlike df.iloc, it takes the column name as column argument. So, we can pass it a column name to select data from that column.
Selecting all the data from the ‘Name’ column.
#select the cell where index = 0 and column = Name data.loc[0,'Name'][/py]
We can also pass multiple column names in a list.
Selecting all the data from the ‘Name’, ‘Sex’ and ‘Ticket’ columns.
#select the cell where index = 0 and columns are Name, Sex, Ticket data.loc[0,['Name', 'Sex', 'Ticket']][/py]
Selecting data from the ‘Name’, ‘Sex’ and ‘Ticket’ columns where the index is from 0 to 10.
#select the cell where index in in the range from 0 to 11 and columns are Name, Sex, Ticket data.loc[0:11,['Name', 'Sex', 'Ticket']]
We can pass a list of indexes in row reference argument and a list of column names in column reference argument to sample data.
#select all the data where index is equal to 0,5,10,20,50 and column names are Name, Sex, Ticket data.loc[[0,5,10,20,50], ['Name', 'Sex', 'Ticket']]
We can also use range function with column names. Here, ‘Name’:’Ticket’ will give the name of all the columns between the ‘Name’ column and the ‘Ticket’ column. This will also include ‘Name’ and ‘Tiger’ columns.
#select all the data where index is in the range of 0 to 11 and all the columns from name to ticket data.loc[0:11,'Name':'Ticket']
These are the basic selection techniques available in pandas library and are very essential in doing data exploration or data modeling. Hopefully, this post will help in making it clearer for you. You can also check pandas official document to explore other options or functionality available. If you want to practice these functions, you can check this Kaggle kernel.