10. Data Cleaning with Tableau

10. Data Cleaning with Tableau

🎯 Learning Goals

  • Understand the importance of data cleaning in Tableau
  • Identify common data inconsistencies and errors
  • Perform transformations such as standardizing formats, handling missing values, and cleaning text fields
  • Use calculated fields to clean and enhance datasets
 

📗 Technical Vocabulary

  • Data Cleaning
  • Data Standardization
  • Calculated Fields
  • Data Types
  • Filtering
  • Splitting and Trimming
 
🧼

🔍 Why Data Cleaning Matters

Raw data is often messy, inconsistent, and full of errors. Without cleaning, data visualizations can be misleading or incorrect.
For example:
  • A company analyzing customer orders might find that names appear as “ada lovelace” and “Ada Lovelace”, making it hard to track orders correctly.
  • Date inconsistencies like “01/01/24” vs. “2024-01-01” might cause incorrect sorting.
  • Numeric values stored as text can prevent accurate calculations.
Cleaning data ensures accuracy, reliability, and usability for analysis and visualization.
✏️
Try-It | Messy Dataset Alert!
Okay, we know we just started this lesson, but we think you, our data detective, has what it takes to step in and clean up the mess.
  • Look at the dataset and identify at least three inconsistencies.
  • How might these errors affect analysis?
  • What would you do to clean up these errors?
Solutions
  • Inconsistent name formatting (e.g., lowercase names, missing last names)
  • Inconsistent age formats (e.g., numbers as text, missing values)
  • Inconsistent date formats (e.g., different separators, full month names)
  • Inconsistent grade levels (e.g., mix of numbers and words)
  • Inconsistent score values (e.g., numbers as text, letter grades, missing data)

Step 1: Load the Data

  1. Open Tableau Public in the browser.
  1. Click "Upload from computer" and select the Messy_Dataset.csv
  1. Browse and open the messy dataset.
  1. You may have to select ‘Update Now’ or ‘Create extract’

Step 2: Fix Name Formatting

🔴 Issues:
  • Some names are in lowercase.
  • Some entries are missing last names.
  • It might be better to have a column for first names and a column for last names.
Solution:
  1. In a sheet, select the Name column.
  1. Create a calculated field to capitalize names correctly:
    1. 💡
      What’s a calculated field? Calculated fields are a powerful way to level up your Tableau skills. A calculated field is a custom field you create using a formula to manipulate your data. When you create one, Tableau opens a formula editor, it kind of looks like SQL… but it’s not quite! Instead of standard SQL syntax, Tableau uses its own formula language.
      Still, if you’ve used IF, SUM, or LEFT in SQL, some of this will feel familiar. Tableau even helps you out by showing a list of all the possible functions you can use, like IF, ZNULL, LEFT, and more, so you don’t have to memorize everything.
      notion image
      • Right-click on the Name column
      • Select Create > Calculate Field
      • Name your new field Name_Proper
      • Enter the following formula:
      PROPER([Name])
      • Click Apply and then OK
      ✨ This will create a new column with properly capitalized names
  1. Split the name into First Name and Last Name:
      • Right-click on your Name_Proper field
      • Select Create > Calculate Field
      • Name your new field First_Name
      • Enter the following formula:
        • SPLIT([Name_Proper], " ", 1)
      • Click Apply and then OK
      • This extracts the first word before any space as the first name
  1. Create a calculated field for Last Name:
      • Right-click on your Name_Proper field
      • Select Create > Calculate Field
      • Name your new field Last_Name
      • Enter the following formula:
        • // For names with multiple parts, get everything after the first space IFNULL(SPLIT([Name_Proper], " ", 2), "Unknown")
      • Click Apply and then OK
      • This extracts everything after the first space as the last name
  1. Handle missing last names:
      • Use filters to find entries with empty Last_Name:
        • Click on the Last_Name column
        • Click the filter icon (funnel symbol) in the toolbar
        • Filter for blank values
      • For records with missing last names, you could:
        • Add a note in your documentation to manually review these entries
        • Use external reference data if available
        • ✨ Use "Unknown" as a placeholder if appropriate for your analysis. We’re going to use this method today! ✨

Step 3: Standardize Age Values

🔴 Issues:
  • Some ages are stored as text ("seventeen", "Sixteen").
  • Some values are missing.
Solution:
  1. Select the Age column.
  1. Create a calculated field to convert text to numbers:
      • Right-click on the Age column
      • Select Create > Calculate Field
      • Name your new field Age_Standardized
      • Enter the following formula:
      CASE [Age] WHEN "seventeen" THEN 17 WHEN "Sixteen" THEN 16 WHEN "Fifteen" THEN 15 ELSE INT([Age]) // Converts numbers stored as text END
  1. Replace missing values using an estimated average or interpolation.

Step 4: Standardize Join Dates

🔴 Issues:
  • Mixed formats (YYYY-MM-DD, MM/DD/YYYY, Month DD, YYYY).
  • Inconsistent separators (/, , .).
Solution:
  1. Create a calculated field to standardize date formats:
      • Right-click on the Join Date column
      • Select Create > Calculate Field
      • Name your new field Join_Date_Standardized
      • Enter the following formula:
        • DATE([Join Date])
      • Click Apply and then OK
      • This will create a new column with standardized date values
If Tableau doesn't detect formats automatically, create a more complex calculated field:
  • Right-click on the Join Date column
  • Select Create > Calculate Field
  • Name your new field Join_Date_Standardized
  • Enter a formula that handles multiple formats:
    • // Try to parse the date using different formats CASE IF DATEPARSE("yyyy-MM-dd", [Join Date]) IS NOT NULL THEN DATEPARSE("yyyy-MM-dd", [Join Date]) ELSEIF DATEPARSE("MM/dd/yyyy", [Join Date]) IS NOT NULL THEN DATEPARSE("MM/dd/yyyy", [Join Date]) ELSEIF DATEPARSE("MMMM dd, yyyy", [Join Date]) IS NOT NULL THEN DATEPARSE("MMMM dd, yyyy", [Join Date]) ELSE NULL END
  • Click Apply and then OK
  1. After creating your calculated field, right-click on the new field
  • Select Format
  • In the Format pane, select Date
  • Choose your preferred date format (e.g., YYYY-MM-DD)
  • Click OK

Step 5: Normalize Grade Levels

🔴 Issues:
  • Grades appear in mixed formats ("10", "Tenth", "11th").
Solution:
  1. Select the Grade column.
  1. Create a calculated field to standardize grade levels to a consistent text format:
      • Right-click on the Grade column
      • Select Create > Calculate Field
      • Name your new field Grade_Norm
      • Enter the following formula:
        • CASE [Grade] WHEN "Ninth" THEN "9th" WHEN "Tenth" THEN "10th" ELSE [Grade] // Keep correct values END
      • Click Apply and then OK
      • This will create a new column with standardized grade values
  1. Remove text suffixes to create a numeric-only field:
      • Right-click on your Grade_Norm field
      • Select Create > Calculate Field
      • Name your new field Grade_Number
      • Enter the following formula:
        • // Extract numeric part from grade values by removing suffixes REPLACE(REPLACE(REPLACE([Grade_Norm], "th", ""), "nd", ""), "rd", "")
      • Click Apply and then OK
        • 💭
          Why are we creating a numeric-only grade field? This allows us to do calculations with them, like calculating the average or median grade level. This is especially helpful for spotting trends or summarizing data by grade.

Step 6: Fix Programming Language Formatting

🔴 Issues:
  • Extra spaces ("Python " vs. "Python").
  • Inconsistent capitalization ("java" vs. "Java").
  • Multiple values combined ("HTML & CSS").
Solution:
  1. Select the Favorite Language column.
  1. Create a calculated field to remove extra spaces:
      • Right-click on the Favorite Language column
      • Select Create > Calculate Field
      • Name your new field Language_Trimmed
      • Enter the following formula:
        • TRIM([Favorite Language])
      • Click Apply and then OK
      • This removes leading and trailing spaces from the values
  1. Create another calculated field to fix capitalization:
      • Right-click on your Language_Trimmed field
      • Select Create > Calculate Field
      • Name your new field Language_Proper
      • Enter the following formula:
        • PROPER([Language_Trimmed])
      • Click Apply and then OK
      • This capitalizes the first letter of each word
  1. Create a calculated field to separate multiple languages:
      • Right-click on your Language_Proper field
      • Select Create > Calculate Field
      • Name your new field Primary_Language
      • Enter a formula to split at common separators:
        • // Extract first language when multiple are listed SPLIT([Language_Proper], "&", 1)
      • Click Apply and then OK
      • This extracts just the first language before any "&" symbol

 
📝

Practice | Club Memberships Dataset Cleaning

Toggle to see how to export the file as .csv
notion image
The university's student office needs your help! They have a dataset of student club memberships, but the data is a bit inconsistent and needs cleaning before they can generate accurate reports on student engagement.
Toggle for an explanation of the columns
  • StudentID: A unique identifier assigned to each student (e.g., S001, S002). This helps in distinguishing individual students.
  • FullName: The full name of the student. You'll notice some inconsistencies here, like varied capitalization, which will need cleaning.
  • EnrollDate: The date on which the student enrolled in a particular club. These dates are in various formats (e.g., MM/DD/YY, Month DD, YYYY, YYYY-MM-DD) and will require standardization.
  • ClubName: The name of the student club the student is a member of (e.g., Photography Club, Debate Club).
  • FeePaidStatus: Indicates whether the student has paid their club membership fee. This column contains inconsistent entries (e.g., "Paid", "YES", "N", "Unpaid", "pending") that need to be cleaned into categories.
  • ActivityPoints: A numerical score representing the student's engagement level or participation in club activities.
Specifically, you’ll want to:
  • standardize the names
  • standardize the enrollment dates
  • clean up the fee paid status
🌶️🌶️ Medium Challenge: Standardize Activity Points
The student office wants to understand overall student engagement better. They'd like you to create a new category for each member based on their ActivityPoints.
  • Goal: Create a new calculated field named ActivityLevel that categorizes members as "Low", "Medium", or "High" based on their ActivityPoints. You can define the thresholds as follows:
    • Low: ActivityPoints < 60
    • Medium: ActivityPoints between 60 and 80 (inclusive)
    • High: ActivityPoints > 80
 
page icon
For a summary of this lesson, check out the 10. Data Cleaning with Tableau One-Pager!