One of the common tasks in ETL and Data Warehousing is cleaning source system data. One of the common "cleaning" tasks is to weed out duplicate entries before you update dimension tables. Eliminating duplicate primary keys can be dead easy... but it can also be pretty difficult. Here are three ways - from dead easy to pretty hard - about how to do just that.
All of these methods work inside a single Data Flow Task in Integration Services - either 2005 or 2008, doesn't matter. I've created a SQL 2008 package that demonstrates all three methods - download it from my SkyDrive.
Dead Easy Duplicate Key Removal
The most basic way to detect and "weed out" duplicate rows is simple, but only really works if you don't care about the values in any "non-primary key" columns.
First, feed the rows into a Sort component. Sort by your primary key column(s) and check the box at the bottom to "discard duplicate rows". Done!
The problem with this is that you don't have (much) control over which of the duplicate rows gets passed through the Sort component and which one(s) get discarded.
Moderately Difficult Duplicate Key Removal
If you know you have duplicates, but you are required to use "non-primary key" column values from the "best" version of the row, this method may work. It's appropriate if you have a way of "scoring" a row that will enable you to identify which row is "best", without looking at the other rows that share the same primary key.
  1. Send your data into a Derived Column component that creates a temporary "score" or "value" column containing some value that will let you (later) pick out the "best" row. Perhaps some measurement as to how many fields of non-primary key data are filled in. You might not even need this Derived Column component if there's already some column that tells you exactly which row you're most interested in - like an "updated date" (where later is always "best"). If you can't do this because you need to know information from other rows that share the primary key, you have to go use the "Pretty Difficult" method.
  2. Multicast your flow into streams "A" and "B". (They aren't labeled that way, but I'll talk about stream "A" and "B" differently.)
  3. Send stream "A" through an Aggregate component, grouping by your primary key columns and ONLY your primary key columns. Use an aggregation operation - like Maximum (for the date example) or Minimum - on your "score" column. (All the other column information will be lost here - that's OK.)
  4. Send stream "A" through a Sort component, sorting by the primary key column(s), then your "score" column. (This is necessary for the Merge Join component coming up.)
  5. Send stream "A" into the LEFT side of a Merge Join component.
  6. Send stream "B" into a Sort component, sorting by the primary key column(s), and your "score" column. Do NOT check the "discard duplicates" box.
  7. Send stream "B" into the RIGHT side of the Merge Join component.
  8. Edit the Merge Join, make it an Inner Join (the default), and pick all the columns from the right-hand list.
Pretty Difficult Duplicate Key Removal

If you can't decide which row of the (potential) duplicates to pass through without knowing what's in the other rows, you need to use a Script Transform in asynchronous mode. (Asynchronous in this instance means that we're not outputting one row for each and every row that comes into the component.)
  1. Place a Script component in your flow, identifying it as a transformation component (the default).
  2. Open the Script component editor, pick the "Input Columns" tab. Select the key columns, and any other columns that you want to pass through and/or will be used to determine which row is "best" within the duplicates.
  3. Still in the Script component editor, pick the "Inputs and Outputs" tab. Pick the "Output0" node in the treeview. Change the SynchronousInputID value to "None".
  4. Still on the "Inputs and Outputs" tab of the Script component editor, open the "Output0" node, and select the "Output Columns" folder icon. Press the "Add Column" button, and add your first primary key column to the output - pay attention to data types! You will have to repeat this process and add back ALL of the columns. By making the Script asynchronous, we've essentially removed the entire output column definition that's usually constructed for us by the Script component.
  5. Now you get to edit the Script! (Too much to post here - download from my SkyDrive)
One note for the script component... As it is now, the code in there is NOT according to Microsoft documentation, or how I'm used to the Script component or custom components working. I've posted a bug at Connect for some clarification, since I have it in a repeatable form...
Other Methods?
If you know of some other slick ways of eliminating or processing duplicates, please share...
Update
If you're interested in extracting the duplicates from the data flow and doing something with them instead of discarding them, read Extracting Duplicate Primary Key Rows in SSIS.
6

View comments

Loading