Drop Duplicates Pandas
pandas.DataFrame.drop_duplicates
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
- Return DataFrame with duplicate rows removed.
- Considering certain columns is optional. Indexes, including time indexes are ignored.
import pandas as pd students = { "name": ["Adam", "Natalie", "Henry", "Adam", "Natalie"], "age": [20, 21, 19, 20, 21], "marks": [85.10, 77.80, 91.54, 85.10, 77.80], } # Create DataFrame from dict studentDataFrame = pd.DataFrame(students) print("Before dropping duplicates: \n", studentDataFrame) # drop duplicate rows afterDrop = studentDataFrame.drop_duplicates() print("\nAfter dropping column: \n", afterDrop)
remove duplicates pandas
Before dropping duplicates: name age marks 0 Adam 20 85.10 1 Natalie 21 77.80 2 Henry 19 91.54 3 Adam 20 85.10 4 Natalie 21 77.80 After dropping column: name age marks 0 Adam 20 85.10 1 Natalie 21 77.80 2 Henry 19 91.54
pandas drop_duplicates
Drop duplicates from defined columns
- By default,
DataFrame.drop_duplicate()
removes rows with the same values in all the columns. But, we can modify this behavior using a subset parameter.
import pandas as pd students = { "name": ["Jonny", "Natalie", "Henry", "Ben"], "age": [20, 21, 19, 21], "marks": [85.10, 77.80, 91.54, 77.80], } # Create DataFrame from dict studentDataFrame = pd.DataFrame(students) print("Before dropping duplicates: \n", studentDataFrame) # drop duplicate rows afterDrop = studentDataFrame.drop_duplicates(subset=['age','marks']) print("\nAfter dropping column: \n", afterDrop)
pandas dataframe remove duplicate rows
Before dropping duplicates: name age marks 0 Jonny 20 85.10 1 Natalie 21 77.80 2 Henry 19 91.54 3 Ben 21 77.80 After dropping column: name age marks 0 Jonny 20 85.10 1 Natalie 21 77.80 2 Henry 19 91.54
Drop duplicates but keep last
To keep only one occurrence of the duplicate row, we can use the keep parameter of a
DataFrame.drop_duplicate()
.
- first – Drop duplicates except for the first occurrence of the duplicate row. This is the default behavior.
- last – Drop duplicates except for the last occurrence of the duplicate row.
- False – Drop all the rows which are duplicate.
import pandas as pd students = { "name": ["Jonny", "Natalie", "Henry", "Natalie"], "age": [20, 21, 19, 21], "marks": [85.10, 77.80, 91.54, 77.80] } # Create DataFrame from dict studentDataFrame = pd.DataFrame(students) print("Before dropping duplicates: \n", studentDataFrame) # drop duplicate rows afterDrop = studentDataFrame.drop_duplicates(keep='last') print("\nAfter dropping column: \n", afterDrop)
Before dropping duplicates: name age marks 0 Jonny 20 85.10 1 Natalie 21 77.80 2 Henry 19 91.54 3 Natalie 21 77.80 After dropping column: name age marks 0 Jonny 20 85.10 2 Henry 19 91.54 3 Natalie 21 77.80
Drop all duplicates
- By default,
DataFrame.drop_duplicates()
keeps the duplicate row’s first occurrence and removes all others. - If we need to drop all the duplicate rows, then it can be done by using
keep=False
.
import pandas as pd students = { "name": ["Jonny", "Natalie", "Henry", "Natalie"], "age": [20, 21, 19, 21], "marks": [85.10, 77.80, 91.54, 77.80] } # Create DataFrame from dict studentDataFrame = pd.DataFrame(students) print("Before dropping duplicates: \n", studentDataFrame) # drop duplicate rows afterDrop = studentDataFrame.drop_duplicates(keep=False) print("\nAfter dropping column: \n", afterDrop)
Before dropping duplicates: name age marks 0 Jonny 20 85.10 1 Natalie 21 77.80 2 Henry 19 91.54 3 Natalie 21 77.80 After dropping column: name age marks 0 Jonny 20 85.10 2 Henry 19 91.54