Monday, September 15, 2014

SSIS Useful

The dangers:
       Dirty data
       Complex or poorly defined ETL requirements
       Unexpected metadata changes
       Unstable sources/destinations
       Project managers

What to note?
       Success and failure of operations
       Row counts
       Run times
       Validation information

       Know what to expect
       Plan for growth
       Cover your assets
In SSIS word’s
It’s all about the log.
SSIS logging
SQL Server log
Custom logging

SSIS Package Logging
·         It’s already there
·         Easy to start
·         Flexible events and destinations
·         Can be unwieldy
SSIS Catalog Logging
·         Version 2012 only
·         Easiest to configure
·         Design time or runtime
·         Least flexible

Server/engine logging
  • SQL Engine error log
  • DMVs
  • Third party tools
  • Windows log
  • PerfMon
Perform at your Best
It’s not just SSIS
Proper query techniques for relational sources
Effective indexing for sources and destinations
Streamline your data flows
·         Transformations matter!
·         Know how the blocking properties of transformations
Nonblocking transforms do not hold buffers
·         Derived Column
·         Conditional Split
·         Row Count
Partially blocking transforms will queue up buffers as needed
  • Merge Join
  • Lookup
  • Union All
Fully blocking transforms will not pass any data through until all of the data has been buffered at that transformation
  • Sort
  • Aggregate
Be aware of memory use!
  • LOB (large object) columns will always spool to disk rather than staying in memory.
  • Memory buffers may spill over to disk

Manage your sources
·         Don’t use table drop down list – specify your query including only the necessary columns
·         Be mindful of indexes when writing data retrieval queries
Manage your destinations
  • Use FAST LOAD for SQL Server destinations
  • Index management (drop?)
Go Parallel!
·         Parallel operations can yield faster data flows

Clean it UP
In ETL, the greatest dangers often lie in the small things
·         Like an infection, bad data can fester for a while until it’s too late
·         Caught early, problems with dirty data are more easily solved

What is dirty data?
Types of dirty data:
       Data type mismatches
       Domain violations
       Semantic violations
       Technical errors
       Simple inaccuracies
Data type mismatches
        Non-numeric data in numeric fields
        Decimal data in integer fields
        Incorrect precision / rounding
Domain violations
       Invalid dates
       Incorrect addresses
Semantic violations
       Data outside of a reasonable range (such as a person’s age in the thousands of years)
Inconsistent use of NULL, blanks, and zeroes
Technical errors
        Improperly formatted dates
        Out-of-alignment flat files
        Too many/too few delimiters
Simple inaccuracies
       Improper formatting (email addresses, phone numbers)

What causes dirty data?

Causes of dirty data:
       Unvalidated user input 
       Lack of proper database constraints and/or application logic
       Import bad data from other systems
       ETL errors
Now What?
Clean it up
       Test your cleansing logic in stage/test/QA first
        Cleanse directly in production
        Don’t cleanse at all

What to do with unresolvable bad data?
       Update to NULL or unknown member
       Mark as suspect
       Write to triage
      Stop the ETL

In Denali, this kind of file is parsed differently – by default, we’ll always look for a row delimiter in unquoted data, and start a new row if it’s seen. So the table above is parsed as one would expect:
This new behavior is on by default, but can be disabled at any time using the AlwaysCheckForRowDelimiters property on the connection manager.

Embedded Qualifiers

Another custom in delimited files is the use of a qualifier character to “escape” or embed a qualifier character into a qualified string, for example:
1148Can’t Buy a Thrill
1149Echoes, Silence, Patience & Grace

To allow the literal commas to be part of the string, the use of qualifiers is typical in a flat file representing this data. However if an apostrophe(') is used for qualifying, the apostrophe in the first row must be escaped, and a typical way to accomplish this is by doubling the qualifier character:
1148,'Can''t Buy a Thrill'
1149,'Echoes, Silence, Patience & Grace'
Prior to Denali, such files would always fail to parse: though SSIS supported qualifiers, SSIS had no support for embedded qualifiers, and would treat the first qualifier character (in this case the first apostrophe after the “n” in “Can’t Buy a Thrill”) as the end of that field, and throw an error due to the lack of a delimiter immediately after the apostrophe.
In Denali, this is fixed, and the data above is properly parsed as:


SSIS: How to resolve Excel Import 255 character Truncation issue?

So in our first scenario all went fine, the Ms-Excel scanned the 8 rows and determined that the length of the column Empcomments would be equal to at least 686 characters.

In our second example, we inserted a row in between and as you can see the first 8 rows has maximum 77 characters length in column EmpComments. The 9th row has 686 characters length so the Package failed to insert and return truncation error.

So what is the solution? All we need to tell Ms-Excel is to scan some more rows before assuming the data type and data length of the columns. Right?

Yes, that is exactly what we have to do. I went to the registry and changed the value to 0. When the TypeGuessRows values become 0 it scans the 16,384 rows before assuming the data type and data length of the column.

After changing the TypeGuessRows values the SSIS package executed successfully. All 9 rows were imported from Ms-Excel spreadsheet.

No comments:

Post a Comment