Thursday, October 16, 2014

SQL TIPS

SQL Server triggers execution precedence View(s): 7621 I had created two update triggers for two columns in a table. When i use update statement on this table on which order or precedence trigger will fires? Answer 1) -------------------------------------------------------------------------------- There is no guarantee as to what order they'll fire in. But still you can use sp_settriggerorder to define the first and last trigger to execute. For the rest the order is undefined. If precedence is matters to you then put all the logic in one trigger.

Replace Function: The reason is in SQL 2005 REPLACE trims trailing spaces whereas in SQL 2008 it preserves the space characters.

Full cache mode(Lookup) Partial & No cache(Lookup) comparison happens inside SSIS engine itself (Case sensitive) comparison takes place in T-SQL engine and so it will be case insensitive by default (Case In-sensitive) If same case in-sensitive functionality use derived column then lookup



Checkpoint data is not saved for ForLoop and ForEach Loop containers. Whenever package restarts the For Loop and ForEach Loop containers and the child containers are
run again. So as conclusion for loop a container if a child container in the loop runs successfully, it is not recorded in the checkpoint file, instead it is rerun.
There is work around for this as listed on Microsoft website: put loop task inside the sequence container task. Now one big thing which you have to remember that checkpoint applies for control flow task only it means if there is failure in data-flow task it will rerun the data-flow task again. For example there is data-flow task where out of 50000 records 20000 transferred in to destination. At this point a general error occurs and fails the data-flow task. Now what will checkpoint do will the data-flow continue for rest records or it will start from new when package restart. Answer simply entire data-flow task will restart from beginning.


Q: The webcast mentions SSIS restartability, but not checkpointing. Why?

There are several reasons for this:
  • Checkpointing is not that straightforward to implement.
  • Checkpointing works well for only a limited number of scenarios. Looping or parallelism are difficult to handle.
  • Checkpoints are not that flexible.
  • Event handlers are problematic.
  • Checkpoints do not store variables of type Object.
SQL Server MVP Jamie Thomson explains his objections in the blog post Why I don't use SSIS checkpoint files (it's an old post, some links are broken). The discussion is also interesting to read.
Personally I prefer to create small modular packages. Each package only does a basic task that shouldn't take too long. For example, load a dimension. I won't create one single huge package that will load all dimensions of a data warehouse. The most time is usually spent in the data flow, which isn't covered by a checkpoint anyway. If I would have an Execute SQL Task at the end of the package taking an hour to run, I will most likely put it in a separate package so it can easily be restarted.
If you have a valid use case for using checkpoint files, you can check out the tip Integration Services Checkpoints to restart package from failure.




Q. Would you recommend using “Check Points” in SSIS?
Ans:
As per my experience I could say “NO” as there are compatibility issues with various options hence using checkpoints may give unpredictable results. Checkpoints doesn’t work properly when a SSIS package contains
  • Complex logic
  • Iterations/Loops
  • Transactions Enabled
  • “Object” type variables
  • Parallel execution
Checkpoints works fine when the package is having straightforward control flow with a single thread.


Q: What are the most efficient ways to run a large aggregate SQL Server Integration Services Job? I split my data flows to do aggregation, then handle the results separately.

Not exactly a question, but rather a statement on my best practice that aggregations (and sorting) should be done in the SQL Server database, not in the SSIS data flow. The reason for this is that SSIS uses blocking components to do these types of transformations which means they need to read all of the rows into memory before they can output even one single row.
Separating the flow into separate streams has certain merit, since you're doing a "divide and conquer" strategy. High-level this corresponds with the same map-reduce method of big data systems. However, this approach has two options: you either execute the different data flows in parallel or you execute them one after another. The first option still has the issue that everything has to be loaded into memory at once. The second option has a performance impact since a data flow always has to wait for the previous data flow to finish. Both options still need sorted data. This might be resolved by reading from a table with an index supporting the sort order, since then the database engine can just read the sorted data. You still have to mark the input as sorted in the advanced editor of the source component in the data flow. The tip Merge multiple data sources with SQL Server Integration Services shows how you can do this.


My opinion is that in most cases, the database engine is best suited to handle large set based operations such as aggregating and sorting data. The data flow is capable of handling small data sets which fit in memory without issues. The SSIS package however becomes less scalable for larger data loads. SQL Server can scale reasonable well. If the aggregation is too much even for SQL Server to handle, alternative options such as the map-reduce method can be considered.





The both transformations accept one input and give us multiple outputs
Multicast
Conditionalsplit
Directs each row of the source to every o/p
Directs each row to a single o/p



Both transformations can add new columns.
Dervied column
Copy column
Can add new columns without any help from existing columns.
can add new columns only through existing columns 
can assign different transformations and data types to the new columns
No options available
supports error output 
whereas Copy column  can’t support error output


Project parameters
Package parameters
 A project parameter can be shared among all of the packages in an SSIS project.
A package parameter is exactly the same as a project parameter – except that the scope of a package parameter is the individual package it resides in
 You want to use a project parameter when, at run-time, the value is the same for all packages
 You want to use a package parameter when, at run-time, the value is different for each package. 
 project parameters do *not* have an expressions property to define their value

Ex:For instance if you wanted to have a single parameter containing a server name to be used by multiple package


Note: Parameters can also pass values to all kinds of objects in SSIS – basically any property that allows an expression.
2.You can think of parameters are the replacement for package configurations which were used in earlier versions of SSIS.
3.You can use a parameter anywhere in an expression by referring it using the following syntax:
@[$<>::<>]
E.g.
The evaluation order is similar to using a variable in an expression and then executing a package. Hence, the value gets evaluated and assigned to the property at Validation phase.


Variable
Parameter
Within a package, SSIS variables have an expression property. 

A variable can be scoped to the package or an object within the package (there’s no such thing as a project variable though). 

Variables often consume values from parameters within their expressions (an example of this is in the next screen shot).  Variables can also communicate values to objects like Execute SQL tasks, For Each containers, Send Mail tasks, and so on.

control flow defines the tasks used in performing data management operations;
it determines the order in which these tasks are executed and the conditions of their execution.

Dataflow:is a special contral flow task  used specifically in data movement operations and data transformations.

Lineage columns:used for auditing and never exposed to end users


Mywords:implement Staging table and other objects in a sepearte dbase.

Usage Staging table:temporarily store source data before cleansing it or merging it with data from other sources.

staging table-->serve as intermediate layer (b/w DW and source tables) and then starts reqular ETL process.

DSA:The part of a DW containing staging tables is called the data staging area(DSA).



Implementing Fact Tables:It is not necessary that all foreign keys together uniquely identify each row of a fact table.

inferred member:A row in a demension added during fact table load is called an inferred member

You can add lineage information to your dimensions and fact tables to audit changes to your DW on a row level(summary)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

Tasks representing logical units of work can be grouped in containers.

SSIS does not support receiving email messages.

Design time validation can be disabled for every ssis object using delayvalidition property =>True


 A failure constraint will allow the following task to begin if the preceding
task has encountered errors and failed.

The External Process tasks will be executed in sequence, and the process  will continue, even if one or more of them fail.

Use a Raw File destination if you have to temporarily store data to be used by SSIS

■■ Use the Resolve References dialog box to solve mapping errors.

 Which data flow transformation would you use if had to combine data from two different
database tables that exist on two different servers? 


Variables and expressions can also be used to extend the elementary functionality of
precedence constraints based on information available at run time that usually is not available at design time.

Lookup: more matching rows exist from the referenced dataset,only first one will be retreived
Mergejoin:more rows will retreived,because all matching data is retreving

The Merge transformation is similar to Union All, but with Merge, the sources have to be sorted and the sort position is preserved.

Is it possible to change the settings of an SSIS object at run time?

When are property expressions evaluated as a package is running?

Answer Unlike parameters that are read at the start of package execution, property expressions are updated when the property is accessed by the package during package execution.
A property expression can change the value of a property in the middle of package execution, so that the new value is read when the property is needed by the package.


The Recordset Destination is a handy component to use when you do not need to (nor want to) persist your data. One drawback of using the in-memory ADO dataset is that Integration Services does not provide any out-of-box methods to get the row count in the dataset. This is when Row Count transformation becomes invaluable to me.

setbased update:avoid the usage of delete or update oeration,this recommandation is mostly focused on large fact table.even we need few update or deletes to dim table's

using Oledbcommand(row by row) or staging +execute sqltask

DFT-->spaces->RC click->entry point package(ssis package design pattern is to make use of parent package to control the execution of several child packages)

SSIS 2012 Imporvements:

Flatfile source :flatfilesource parser:it can now support ragged-right delimited files and embedded qualifiers

Grouping:DFT,it is like sequence container in controlflow,except it exist only in designer,and also not affect the way dataflow actually runs.
we can also disable,if compilicate dataflow.


Parameter configurations are stored within a visualstudioprojectfile(.dtproj).To save any changes made in manage parameters values dialogbox,you must save the project

1.you can change variable values inside a package execution,but you cannot change parameter values
2.From a usage perspective,you usually use parameters to define the contract b/w a package and its caller,whereas you use variables primarily for exchanging
information insida a package.

Parameters:
parent package and child package require some info to perform its job;like name of the dimensiontable,it should be update.

package Parameter are explicit way to pass the info,you can easily see what info is required when you open the package.

it is also easy for ssis to validate whether the value passed parameters were assigned values at runtime

what to configuration:

1.servername--dev,prod
2.contrlow flow--is executed based on conditional expression on it,this expression might depend on a variable value that need to provided before package execution.
3.sql statement(require to change) for execute sql task

SSIS Varaiables supports char,DBNull and object,ssis parameters do not.

indirect configuration using enviornment varaiables: problem is xml filepath,connectionstring,but moving packages from systems to systems problem,solution is
use the enviornment vairable to configure those parameters.

variables configuration:/set option to override the property value of the variable.

Myword:engine exposes while configuring,validating,executing a package.

integrate external processes:Execute Process task or a Script task with the appropriate business logic to control the execution of the external applications.
To improve reusability and simplify deployment, a Custom task could also be developed to replace the Script task.

Project deployment model:with introduction of project-scoped parameters and connection managers, it becomes possible for individual SSIS packages to share resources

Variables:Variables are not exposed outside of SSIS packages and cannot be accessed from other packages. This shortcoming is reduced with the introduction of
parameters, which can be used to pass scalar values to child packages. On the other hand, variables never were the only method available for exchanging data between

the elements of an SSIS solution. (Hint: consider files and staging tables.)


New today

The default value of MaxBufferSize can be changed, but it cannot be larger than 100 MB.

Breakpoints are a powerful design-time troubleshooting option.,You can set a breakpoint on any control task.

 Data viewers enable you to monitor data passing through the pipeline at design time.


Data taps are only available when you are executing a package deployed to the SSISDB catalog.


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

  

Wednesday, October 8, 2014

SSIS - Merge join vs Lookup transform Performance – Case Study

Hi ,
 
In this post i'll compare the two join components: the merge join and the lookup join with the relation join in order to determine which solutions is the best possible way to solve your problem regarding joining data from certain data sources. This post is based on SQL Server 2008 R2.

Merge Join
The merge join is used for joining data from two or more datasources. The more common ETL scenarios will require you to access two or more disparate datasources simultaneously and merge their results together into a single datasource. An example could be a normalized source system and you want to merge the normalized tables into a denormalized table.
The merge join in SSIS allows you to perform an inner or outer join in a streaming fashion. This component behaves in a synchronous way. The component accepts two sorted input streams, and outputs a single stream, that combines the chosen columns into a single structure. It’s not possible to configure a separate non-matched output.

Lookup
The lookup component in SQL Server Integration Services allows you to perform the equivalent of relational inner and outer hash joins. It’s not using the algorithms stored in the database engine. You would use this component within the context of an integration process, such as a ETL Layer that populates a datawarehouse from multiple non equal source systems.
The transform is written to behave in a synchronous manner in that it does not block the pipeline while it’s doing its work, mostly. In certain cache modes the component will initially block the package’s execution for a period of time.


Conclusion: There's only one reason for using the merge join and that is a lack of memory, otherwise always use the lookuptransform. There's done lot of improvement of the lookup transform in SQL Server 2008.


Case Study on Merge Join Vs Lookup Performance - I:


TheViewMaster wrote:
So here we go:
I'm running the tests on my workstation WinXP, 2.93GHz, 2.5gb ram.
The DB is accessed over the LAN.
Test1 (Lookup):
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Lookup is a query access table 248250 records pulling 61280 records and about 25 columns
2 outputs - Listing Found (56523 rows) and Error Listing Not found (118990 rows)
Also lookup is Full Cache mode and gives Warning: found duplicate key values.
Result:
Finished, 4:11:00 PM, Elapsed time: 00:00:15.437
Note: Memory usage of PC peaked at 1.8GB with CPU usage jumping to 100% once.

Test 2 (Merge Join):
1st Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
2nd source: OLE DB Source with query access table 248250 records pulling 61280 records and about 25 columns with ORDER BY ID. Out put is marked sorted by ID column.
1st source is Sorted using "Sort transform".
Then "Merge Joined" with ole db via Left outer join (Sort on left)
Then "Conditional Split" based on ISNULL(oledbsource.ID)
Result:
Finished, 4:49:33 PM, Elapsed time: 00:01:14.235
Note: Memory usage of PC peaked at 2.6GB with CPU usage jumping to 100% twice.

Test3 (Script Transform) -
Source: multi-flat-file source (4 .txt's) with total of 175513 records and 88 columns
Script transform to do a lookup based on key column for each row in pipeline.
Result:
Cancelled after 30 minutes of processing - during which it had process 11547 records (out of 175513)
Note: Memory usage was stable around 1GB and CPU near 5% usage

My Conclusion:
Although I was concerned with the performace of lookup transform - for testing whether data to be inserted or updated - it seems thats not the culprit - the root of evil seems to be OLE DB update command and OLE DB Destination source (atm we r using SQL 2000 db - upgrading to 2005 soon).
Although Script transform consumed least amount of machine resources - executing 100K+ sql queries against db will take too long.
Although merge join Elapse time is not bad - resource usage and 3 more steps than lookup are negatives.
So i think next weekends performance testing is how to make faster INSERTs/UPDATEs to DB

Test 1 & 2 are based on Jamie Thomson article –

Test 3 is based on Greg Van Mullem article –
 


 


 


Case Study on Merge Join Vs Lookup Performance - II:


Screenshot #1 shows few points to distinguish between Merge Join transformation and Lookup transformation.


Regarding Lookup:


If you want to find rows matching in source 2 based on source 1 input and if you know there will be only one match for every input row, then I would suggest to use Lookup operation. An example would be you OrderDetails table and you want to find the matching Order Id and Customer Number, then Lookup is a better option.


Regarding Merge Join:


If you want to perform joins like fetching all Addresses (Home, Work, Other) from Address table for a given Customer in the Customer table, then you have to go with Merge Join because the customer can have 1 or more addresses associated with them.


An example to compare:


Here is a scenario to demonstrate the performance differences between Merge Join and Lookup. The data used here is a one to one join, which is the only scenario common between them to compare.


1.      I have three tables named dbo.ItemPriceInfo, dbo.ItemDiscountInfo and dbo.ItemAmount. Create scripts for these tables are provided under SQL scripts section.


2.      Tablesdbo.ItemPriceInfo and dbo.ItemDiscountInfo both have 13,349,729 rows. Both the tables have the ItemNumber as the common column. ItemPriceInfo has Price information and ItemDiscountInfo has discount information. Screenshot #2 shows the row count in each of these tables. Screenshot #3 shows top 6 rows to give an idea about the data present in the tables.


3.      I created two SSIS packages to compare the performance of Merge Join and Lookup transformations. Both the packages have to take the information from tables dbo.ItemPriceInfo and dbo.ItemDiscountInfo, calculate the total amount and save it to the table dbo.ItemAmount.


4.      First package used Merge Join transformation and inside that it used INNER JOIN to combine the data. Screenshots #4 and #5 show the sample package execution and the execution duration. It took 05 minutes 14 seconds 719 milliseconds to execute the Merge Join transformation based package.


5.      Second package used Lookup transformation with Full cache (which is the default setting). creenshots #6 and #7 show the sample package execution and the execution duration. It took 11 minutes 03 seconds 610 milliseconds to execute the Lookup transformation based package. You might encounter the warning message Information: The buffer manager has allocated nnnnn bytes, even though the memory pressure has been detected and repeated attempts to swap buffers have failed. Here is a link that talks about how to calculate lookup cache size. During this package execution, even though the Data flow task completed faster, the Pipeline cleanup took lot of time.


6.      This doesn't mean Lookup transformation is bad. It's just that it has to be used wisely. I use that quite often in my projects but again I don't deal with 10+ million rows for lookup everyday. Usually, my jobs handle between 2 and 3 millions rows and for that the performance is really good. Upto 10 million rows, both performed equally well. Most of the time what I have noticed is that the bottleneck turns out to be the destination component rather than the transformations. You can overcome that by having multiple destinations. Here is an example that shows the implementation of multiple destinations.


7.      Screenshot #8 shows the record count in all the three tables. Screenshot #9 shows top 6 records in each of the tables.


Hope that helps.


SQL Scripts:


CREATE TABLE [dbo].[ItemAmount](


    [Id] [int] IDENTITY(1,1) NOT NULL,


    [ItemNumber] [nvarchar](30) NOT NULL,


    [Price] [numeric](18, 2) NOT NULL,


    [Discount] [numeric](18, 2) NOT NULL,


    [CalculatedAmount] [numeric](18, 2) NOT NULL,


CONSTRAINT [PK_ItemAmount] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]


GO


 


CREATE TABLE [dbo].[ItemDiscountInfo](


    [Id] [int] IDENTITY(1,1) NOT NULL,


    [ItemNumber] [nvarchar](30) NOT NULL,


    [Discount] [numeric](18, 2) NOT NULL,


CONSTRAINT [PK_ItemDiscountInfo] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]


GO


 


CREATE TABLE [dbo].[ItemPriceInfo](


    [Id] [int] IDENTITY(1,1) NOT NULL,


    [ItemNumber] [nvarchar](30) NOT NULL,


    [Price] [numeric](18, 2) NOT NULL,


CONSTRAINT [PK_ItemPriceInfo] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]


GO


Screenshot #1:





Screenshot #2:





Screenshot #3:





Screenshot #4:





Screenshot #5:





Screenshot #6:





Screenshot #7:





Screenshot #8:





Screenshot #9:





 


Source: