The dangers:
• Dirty
data
• Complex
or poorly defined ETL requirements
• Unexpected
metadata changes
• Unstable
sources/destinations
• Project
managers
What to note?
• Success
and failure of operations
• Row
counts
• Run
times
• Validation
information
• Warnings
Why?
• Know
what to expect
• Plan
for growth
• Cover
your assets
In SSIS word’s
It’s all about the log.
SSIS logging
SQL Server log
Custom logging
SSIS Package Logging
·
It’s already there
·
Easy to start
·
Flexible events and destinations
·
Can be unwieldy
SSIS Catalog Logging
·
Version 2012 only
·
Easiest to configure
·
Design time or runtime
·
Least flexible
Server/engine logging
- SQL Engine error log
- DMVs
- Third party tools
- Windows log
- PerfMon
Perform at your Best
It’s not just SSIS
Proper query techniques for relational sources
Effective indexing for sources and destinations
Using OPTION (FAST )
Streamline your data flows
·
Transformations matter!
·
Know how the blocking properties of
transformations
Nonblocking transforms do not hold buffers
·
Derived Column
·
Conditional Split
·
Row Count
Partially blocking transforms will
queue up buffers as needed
- Merge Join
- Lookup
- Union All
Fully blocking transforms will not pass any data through
until all of the data has been buffered at that transformation
- Sort
- Aggregate
Be aware of memory use!
- LOB (large object) columns will always spool to disk rather than staying in memory.
- [N]VARCHAR(MAX)
- Memory buffers may spill over to disk
Manage your sources
·
Don’t use table drop down list – specify your
query including only the necessary columns
·
Be mindful of indexes when writing data
retrieval queries
Manage your destinations
- Use FAST LOAD for SQL Server destinations
- Index management (drop?)
Go Parallel!
·
Parallel operations can yield faster data flows
Clean it UP
In ETL, the greatest dangers often lie in the small
things
·
Like an infection, bad data can fester for a
while until it’s too late
·
Caught early, problems with dirty data are more
easily solved
What is dirty data?
Types of dirty data:
• Data
type mismatches
• Domain
violations
• Semantic
violations
• Technical
errors
• Simple
inaccuracies
Data type mismatches
• Non-numeric data in numeric fields
• Decimal data in integer fields
• Incorrect precision / rounding
• Truncation
Domain violations
• Invalid
dates
• Incorrect
addresses
Semantic violations
• Data
outside of a reasonable range (such as a person’s age in the thousands of
years)
Inconsistent use of NULL, blanks, and zeroes
Technical errors
• Improperly formatted dates
• Out-of-alignment flat files
• Too many/too few delimiters
Simple inaccuracies
• Misspellings
• Duplications
• Improper
formatting (email addresses, phone numbers)
• Case
What causes dirty data?
Causes of dirty data:
• Internal:
• Unvalidated
user input
• Lack
of proper database constraints and/or application logic
• External:
• Import
bad data from other systems
• ETL
errors
Now What?
Clean it up
• Test
your cleansing logic in stage/test/QA first
• Cleanse directly in production
• Don’t cleanse at all
What to do with unresolvable bad data?
• Delete
• Update
to NULL or unknown member
• Mark
as suspect
• Write
to triage
•
Stop the ETL
In Denali, this kind of file is parsed differently – by default, we’ll always look for a row delimiter in unquoted data, and start a new row if it’s seen. So the table above is parsed as one would expect:
This new behavior is on by default, but can be disabled at any time using the AlwaysCheckForRowDelimiters property on the connection manager.
To allow the literal commas to be part of the string, the use of qualifiers is typical in a flat file representing this data. However if an apostrophe(') is used for qualifying, the apostrophe in the first row must be escaped, and a typical way to accomplish this is by doubling the qualifier character:
In Denali, this is fixed, and the data above is properly parsed as:
flat-file-source-cannot-handle-file-with-uneven-number-of-columns-in-each-row.aspx
This new behavior is on by default, but can be disabled at any time using the AlwaysCheckForRowDelimiters property on the connection manager.
Embedded Qualifiers
Another custom in delimited files is the use of a qualifier character to “escape” or embed a qualifier character into a qualified string, for example:ID | Title |
1148 | Can’t Buy a Thrill |
1149 | Echoes, Silence, Patience & Grace |
To allow the literal commas to be part of the string, the use of qualifiers is typical in a flat file representing this data. However if an apostrophe(') is used for qualifying, the apostrophe in the first row must be escaped, and a typical way to accomplish this is by doubling the qualifier character:
ID,Title 1148,'Can''t Buy a Thrill' 1149,'Echoes, Silence, Patience & Grace'Prior to Denali, such files would always fail to parse: though SSIS supported qualifiers, SSIS had no support for embedded qualifiers, and would treat the first qualifier character (in this case the first apostrophe after the “n” in “Can’t Buy a Thrill”) as the end of that field, and throw an error due to the lack of a delimiter immediately after the apostrophe.
In Denali, this is fixed, and the data above is properly parsed as:
flat-file-source-cannot-handle-file-with-uneven-number-of-columns-in-each-row.aspx
SSIS: How to resolve Excel Import 255 character Truncation issue?
So in our first scenario all went fine, the Ms-Excel scanned the 8 rows and
determined that the length of the column Empcomments would be equal to at least
686 characters.
In our second example, we inserted a row in between and as you can see the first 8 rows has maximum 77 characters length in column EmpComments. The 9th row has 686 characters length so the Package failed to insert and return truncation error.
So what is the solution? All we need to tell Ms-Excel is to scan some more rows before assuming the data type and data length of the columns. Right?
Yes, that is exactly what we have to do. I went to the registry and changed the value to 0. When the TypeGuessRows values become 0 it scans the 16,384 rows before assuming the data type and data length of the column.
After changing the TypeGuessRows values the SSIS package executed successfully. All 9 rows were imported from Ms-Excel spreadsheet.
In our second example, we inserted a row in between and as you can see the first 8 rows has maximum 77 characters length in column EmpComments. The 9th row has 686 characters length so the Package failed to insert and return truncation error.
So what is the solution? All we need to tell Ms-Excel is to scan some more rows before assuming the data type and data length of the columns. Right?
Yes, that is exactly what we have to do. I went to the registry and changed the value to 0. When the TypeGuessRows values become 0 it scans the 16,384 rows before assuming the data type and data length of the column.
After changing the TypeGuessRows values the SSIS package executed successfully. All 9 rows were imported from Ms-Excel spreadsheet.
No comments:
Post a Comment