From the course: Python for Data Science and Machine Learning Essential Training Part 1

Unlock the full course today

Join today to access over 24,400 courses taught by industry experts.

Removing duplicates

Removing duplicates

- [Instructor] It's really important to remove duplicates from your dataset in order to preserve the dataset's accuracy and avoid producing incorrect and misleading statistics. For example, imagine you're analyzing a retail sales table and shopaholic Sally came in three times and used three different credit cards to make purchases but provided the cashier the same zip code, 3-2-8-0-3, for each sale. Just based on the card number, Sally looks like three different customers all from the 3-2-8-0-3 zip code. If you fail to examine other attributes of the customer so that you can identify and remove duplicates, shopaholic Sally's results would skew the results of any customer demographic analysis because Sally would be counted as three people rather than one. To market to the 3-2-8-0-3 customers effectively you need to understand their characteristics. Don't let duplicate records skew your analysis. Okay, now let's look at removing duplicates. This notebook is coming preloaded with Numpy…

Contents