Delete all duplicate rows across columns Python Pandas

Last Updated On Thursday 14th Oct 2021

Drop Duplicates Pandas

pandas.DataFrame.drop_duplicates

	DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
	
  • Return DataFrame with duplicate rows removed.
  • Considering certain columns is optional. Indexes, including time indexes are ignored.
	import pandas as pd

students = {
    "name": ["Adam", "Natalie", "Henry", "Adam", "Natalie"],
    "age": [20, 21, 19, 20, 21],
    "marks": [85.10, 77.80, 91.54, 85.10, 77.80],
}

# Create DataFrame from dict
studentDataFrame = pd.DataFrame(students)
print("Before dropping duplicates: \n", studentDataFrame)

# drop duplicate rows
afterDrop = studentDataFrame.drop_duplicates()

print("\nAfter dropping column: \n", afterDrop)
	

remove duplicates pandas

	Before dropping duplicates:

name  age  marks
0     Adam   20  85.10
1  Natalie   21  77.80
2    Henry   19  91.54
3     Adam   20  85.10
4  Natalie   21  77.80

After dropping column:

name  age  marks
0     Adam   20  85.10
1  Natalie   21  77.80
2    Henry   19  91.54
	

pandas drop_duplicates

Drop duplicates from defined columns

  • By default, DataFrame.drop_duplicate() removes rows with the same values in all the columns. But, we can modify this behavior using a subset parameter.
	import pandas as pd

students = {
    "name": ["Jonny", "Natalie", "Henry", "Ben"],
    "age": [20, 21, 19, 21],
    "marks": [85.10, 77.80, 91.54, 77.80],
}

# Create DataFrame from dict
studentDataFrame = pd.DataFrame(students)
print("Before dropping duplicates: \n", studentDataFrame)

# drop duplicate rows
afterDrop = studentDataFrame.drop_duplicates(subset=['age','marks'])

print("\nAfter dropping column: \n", afterDrop)
	

pandas dataframe remove duplicate rows

	Before dropping duplicates:

name  age  marks

0    Jonny   20  85.10
1  Natalie   21  77.80
2    Henry   19  91.54
3      Ben   21  77.80

After dropping column:

name  age  marks

0    Jonny   20  85.10
1  Natalie   21  77.80
2    Henry   19  91.54
	

Drop duplicates but keep last

To keep only one occurrence of the duplicate row, we can use the keep parameter of a DataFrame.drop_duplicate().

  • first – Drop duplicates except for the first occurrence of the duplicate row. This is the default behavior.
  • last – Drop duplicates except for the last occurrence of the duplicate row.
  • False – Drop all the rows which are duplicate.
	import pandas as pd

students = {
    "name": ["Jonny", "Natalie", "Henry", "Natalie"],
    "age": [20, 21, 19, 21],
    "marks": [85.10, 77.80, 91.54, 77.80]
}

# Create DataFrame from dict
studentDataFrame = pd.DataFrame(students)
print("Before dropping duplicates: \n", studentDataFrame)

# drop duplicate rows
afterDrop = studentDataFrame.drop_duplicates(keep='last')

print("\nAfter dropping column: \n", afterDrop)
	
	Before dropping duplicates:

name  age  marks

0    Jonny   20  85.10
1  Natalie   21  77.80
2    Henry   19  91.54
3  Natalie   21  77.80

After dropping column:

name  age  marks

0    Jonny   20  85.10
2    Henry   19  91.54
3  Natalie   21  77.80
	

Drop all duplicates

  • By default, DataFrame.drop_duplicates() keeps the duplicate row’s first occurrence and removes all others.
  • If we need to drop all the duplicate rows, then it can be done by using keep=False.
	import pandas as pd

students = {
    "name": ["Jonny", "Natalie", "Henry", "Natalie"],
    "age": [20, 21, 19, 21],
    "marks": [85.10, 77.80, 91.54, 77.80]
}

# Create DataFrame from dict
studentDataFrame = pd.DataFrame(students)
print("Before dropping duplicates: \n", studentDataFrame)

# drop duplicate rows
afterDrop = studentDataFrame.drop_duplicates(keep=False)

print("\nAfter dropping column: \n", afterDrop)
	
	Before dropping duplicates:

name  age  marks

0    Jonny   20  85.10
1  Natalie   21  77.80
2    Henry   19  91.54
3  Natalie   21  77.80

After dropping column:

name  age  marks

0  Jonny   20  85.10
2  Henry   19  91.54
	

References

drop_duplicates