Microsoft SQL Server Integration Services
Is the Row Sampling Transformation fully blocking?
Is the Row Sampling Transformation fully blocking?
Case
Is the Row Sampling Transformation a (fully) blocking component or a non-blocking component? Various blogs and forums disagree on each other. Is it possible to see it in SSIS?
Solution
By adding some logging to the package you can see the Pipeline Execution Trees of your Data Flow. This will tell you whether a transformation creates a new buffer. When it creates a new buffer then the component is a-synchronously / blocking.
1) Add logging
Open your package and go to the SSIS menu and choose Logging...
2) SQL logging
For this example I selected SQL Server logging, pressed the add button and then selected a Connection Manager.
Next check the Data Flow Task (1) and the enable the SQL Server Log (2) for it. After that go to the Details tab (3).
In the details tab scroll down and select PipelineExecutionTrees. Then click on the advanced button.
In the advanced section make sure that the MessageText is checked. In this column you will find the execution trees text. After that click OK to close the logging window.
3) Run and check log
Run the package and then check the log. In the column Message you can find the Execution Trees.
Now check these examples and see which is blocking. Multiple paths means that there is a transformation that creates new buffers. The last example is the Row Sampling:
4) Partial Blocking or Fully Blocking?
Unfortunately you can't see in the log whether the task is partially blocking (like union or merge join) or fully blocking (like sort and aggregate), but you can see it in the Data Flow Task when running your package in Visual Studio (BIDS/SSDT). The Row Sampling Transformation is fully blocking because it apparently need all data before it sends data to its output. So try not to use it unnecessarily.
The Row Sampling isn't just doing a TOP X, but it spreads the sampled rows over all buffers. Because you don't know the number of records or buffers that is coming you have to have all rows before you can pick randomly X records from the whole set.
An alternative (with less random rows) could be to use a Script Component that adds a row number and then use a Conditional Split to select the first X rows (and perhaps combine it with a modulo expression like: Rownumber % 3 == 0 && Rownumber <= 3000). The Conditional Split is a non-blocking component. Note: this isn't necessarily faster. Check it first for your case!
So why is the Percentage Sampling not blocking? It just takes the same percentage of each buffer and can therefore be synchronous.
Is the Row Sampling Transformation a (fully) blocking component or a non-blocking component? Various blogs and forums disagree on each other. Is it possible to see it in SSIS?
Solution
By adding some logging to the package you can see the Pipeline Execution Trees of your Data Flow. This will tell you whether a transformation creates a new buffer. When it creates a new buffer then the component is a-synchronously / blocking.
1) Add logging
Open your package and go to the SSIS menu and choose Logging...
SSIS Menu - Logging... |
2) SQL logging
For this example I selected SQL Server logging, pressed the add button and then selected a Connection Manager.
SQL Server logging |
Next check the Data Flow Task (1) and the enable the SQL Server Log (2) for it. After that go to the Details tab (3).
Enable logging for the Data Flow Task |
In the details tab scroll down and select PipelineExecutionTrees. Then click on the advanced button.
Logging Details |
In the advanced section make sure that the MessageText is checked. In this column you will find the execution trees text. After that click OK to close the logging window.
Check MessageText |
3) Run and check log
Run the package and then check the log. In the column Message you can find the Execution Trees.
SQL Server Logging. Check message column. |
Now check these examples and see which is blocking. Multiple paths means that there is a transformation that creates new buffers. The last example is the Row Sampling:
Plain Data Flow: 1 path - No blocking |
Data Flow with Sort: 2 paths - Sort is blocking |
Data Flow with Percentage Sampling: 1 path - No blocking |
Data Flow with Row Sampling: 3 paths - Row Sampling is blocking |
4) Partial Blocking or Fully Blocking?
Unfortunately you can't see in the log whether the task is partially blocking (like union or merge join) or fully blocking (like sort and aggregate), but you can see it in the Data Flow Task when running your package in Visual Studio (BIDS/SSDT). The Row Sampling Transformation is fully blocking because it apparently need all data before it sends data to its output. So try not to use it unnecessarily.
Row Sampling is Fully Blocking |
The Row Sampling isn't just doing a TOP X, but it spreads the sampled rows over all buffers. Because you don't know the number of records or buffers that is coming you have to have all rows before you can pick randomly X records from the whole set.
An alternative (with less random rows) could be to use a Script Component that adds a row number and then use a Conditional Split to select the first X rows (and perhaps combine it with a modulo expression like: Rownumber % 3 == 0 && Rownumber <= 3000). The Conditional Split is a non-blocking component. Note: this isn't necessarily faster. Check it first for your case!
So why is the Percentage Sampling not blocking? It just takes the same percentage of each buffer and can therefore be synchronous.
No comments:
Post a Comment