Data Warehouse Logical Design
Lesson 1: Introducing Star and Snowflake
Schemas. . . . . . . . . . . . . . . . . . . .
Lesson Summary
■■ The
Star schema is the most common design for a DW.
■■ The
Snowflake schema is more appropriate for POC projects.
■■ You
should also determine the granularity of fact tables, as well as auditing and
lineage
needs.
Lesson Review
Answer the following questions to test your knowledge of the
information in this lesson. You
can find the answers to these questions and explanations of
why each answer choice is correct
or incorrect in the “Answers” section at the end of this
chapter.
1. Reporting from a Star schema is simpler than reporting
from a normalized online
transactional processing (OLTP) schema. What are the reasons
for wanting simpler
reporting? (Choose all that apply.)
A. A Star schema
typically has fewer tables than a normalized schema. Therefore,
queries are simpler
because they require fewer joins.
B. A Star schema has better support for numeric data types
than a normalized relational
schema; therefore, it is easier to create aggregates.
C. There are specific Transact-SQL expressions that deal
with Star schemas.
D. A Star schema is
standardized and narrative; you can find the information you
need for a report
quickly.
2. You are creating a quick POC project. Which schema is the
most suitable for this kind
of a project?
A. Star schema
B. Normalized schema
C. Snowflake schema
D. XML schema
3. A Star schema has two types of tables. What are those two
types? (Choose all that
apply.)
A. Lookup tables
B. Dimensions
C. Measures
D. Fact tables
Lesson 2: Designing
Dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
Lesson Summary
■■ In
a dimension, you have the following column types: keys, names, attributes,
member
properties, translations, and lineage.
■■
Some attributes form natural hierarchies.
■■
There are standard solutions for the Slowly Changing Dimensions (SCD) problem.
Lesson Review
1. You implement a Type 2 solution for an SCD problem for a
specific column. What do
you actually do when you get a changed value for the column
from the source system?
A. Add a column for
the previous value to the table. Move the current value of the
updated column to the
new column. Update the current value with the new value
from the source
system.
B. Insert a new row for the same dimension member with the
new value for the
updated column. Use a surrogate key, because the business
key is now duplicated.
Add a flag that denotes which row is current for a member.
C. Do nothing, because in a DW, you maintain history, you do
not update dimension
data.
D. Update the value of the column just as it was updated in
the source system.
2. Which kind of a column is not a part of a dimension?
A. Attribute
B. Measure
C. Key
D. Member property
E. Name
3. How can you spot natural hierarchies in a Snowflake
schema?
A. You need to analyze the content of the attributes of each
dimension.
B. Lookup tables for each dimension provide natural
hierarchies.
C. A Snowflake schema does not support hierarchies.
D. You should convert the Snowflake schema to the Star
schema, and then you would
spot the natural hierarchies immediately.
Lesson 3: Designing
Fact Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . 27
Lesson Summary
■■
Fact tables include measures, foreign keys, and possibly an additional primary
key and
lineage columns.
■■
Measures can be additive, non-additive, or semi-additive.
■■ For
many-to-many relationships, you can introduce an additional intermediate
dimension.
34 Chapter 1 Data
Warehouse Logical Design
Lesson Review
1. Over which dimension can you not use the SUM aggregate
function for semi-additive
measures?
A. Customer
B. Product
C. Date
D. Employee
2. Which measures would you expect to be non-additive?
(Choose all that apply.)
A. Price
B. Debit
C. SalesAmount
D. DiscountPct
E. UnitBalance
3. Which kind of a column is not part of a fact table?
A. Lineage
B. Measure
C. Key
D. Member property
Case Scenarios
In the following case scenarios, you apply what you’ve
learned about Star and Snowflake
schemas, dimensions, and the additivity of measures. You can
find the answers to these questions
in the “Answers” section at the end of this chapter.
Case Scenario 1: A Quick POC Project
You are hired to implement a quick POC data warehousing
project. You have to prepare the
schema for sales data. Your customer’s SME would like to
analyze sales data over customers,
products, and time. Before creating a DW and tables, you
need to make a couple of decisions
and answer a couple of questions:
Suggested Practices Chapter 1 35
1. What kind of schema would you use?
2. What would the dimensions of your schema be?
3. Do you expect additive measures only?
Case Scenario 2: Extending the POC Project
After you implemented the POC sales data warehouse in Case
Scenario 1, your customer
was very satisfied. In fact, the business would like to
extend the project to a real, long-term
data warehouse. However, when interviewing analysts, you
also discovered some points of
dissatisfaction.
Interviews
Here’s a list of company personnel who expressed some
dissatisfaction during their interviews,
along with their statements:
■■
Sales SME “I don’t
see correct aggregates over regions for historical data.”
■■ DBA
Who Creates Reports “My queries are still complicated,
with many joins.”
You need to solve these issues.
Questions
1. How would you address the Sales SME issue?
2. What kind of schema would you implement for a long-term
DW?
3. How would you address the DBA’s issue?
Suggested Practices
To help you successfully master the exam objectives
presented in this chapter, complete the
following tasks.
Analyze the AdventureWorksDW2012 Database
Thoroughly
To understand all kind of dimensions and fact tables, you
should analyze the Adventure-
WorksDW2012 sample database thoroughly. There are cases for
many data warehousing
problems you might encounter.
■■
Practice 1 Check all fact tables. Find all semi-additive measures.
■■ Practice
2 Find all hierarchies possible for the DimCustomer dimension. Include
attributes in the dimension and attributes in the lookup
DimGeography table.
36 Chapter 1 Data Warehouse Logical Design
Check the SCD and Lineage in the AdventureWorksDW2012 Database
Although the AdventureWorksDW2012 database exemplifies many
cases for data warehousing,
not all possible problems are covered. You should check for
what is missing.
■■
Practice 1 Is there room for lineage information in all dimensions and fact
tables?
How would you accommodate this information?
■■
Practice 2 Are there some important dimensions, such as those representing
customers
and products, that are not prepared for a Type 2 SCD
solution? How would you
prepare those dimensions for a Type 2 SCD solution?
Answers Chapter 1 37
Answers
This section contains answers to the lesson review questions
and solutions to the case scenarios
in this chapter.
Lesson 1
1. Correct Answers: A and D
A. Correct: A Star schema typically has fewer tables than a normalized
schema.
B. Incorrect: The support for data types depends on the
database management
system, not on the schema.
C. Incorrect: There are no specific Transact-SQL expressions
or commands for Star
schemas. However, there are some specific optimizations for
Star schema queries.
D. Correct: The Star schema is a de facto standard for data
warehouses. It is narrative;
the central table—the fact table—holds the measures, and the
surrounding tables,
the dimensions, give context to those measures.
2. Correct Answer: C
A. Incorrect: The Star schema is more suitable for long-term
DW projects.
B. Incorrect: A normalized schema is appropriate for OLTP
LOB applications.
C. Correct: A
Snowflake schema is appropriate for POC projects, because dimensions
are normalized and
thus closer to source normalized schema.
D. Incorrect: An XML schema is used for validating XML
documents, not for a DW.
3. Correct Answers: B and D
A. Incorrect: Lookup tables are involved in both Snowflake
and normalized schemas.
B. Correct: Dimensions are part of a Star schema.
C. Incorrect: Measures are columns in a fact table, not
tables by themselves.
D. Correct: A fact table is the central table of a Star
schema.
Lesson 2
1. Correct Answer: B
A. Incorrect: This is Type 3 SCD management.
B. Correct: This is how you handle changes when you
implement a Type 2 SCD
solution.
C. Incorrect: Maintaining history does not mean that the
content of a DW is static.
D. Incorrect: This is Type 1 SCD management.
38 Chapter 1 Data Warehouse Logical Design
2. Correct Answer: B
A. Incorrect: Attributes are part of dimensions.
B. Correct: Measures are part of fact tables.
C. Incorrect: Keys are part of dimensions.
D. Incorrect: Member properties are part of dimensions.
E. Incorrect: Name columns are part of dimensions.
3. Correct Answer: B
A. Incorrect: You need to analyze the attribute names and
content in order to spot
the hierarchies in a Star schema.
B. Correct: Lookup
tables for dimensions denote natural hierarchies in a Snowflake
schema.
C. Incorrect: A Snowflake schema supports hierarchies.
D. Incorrect: You do not need to convert a Snowflake to a
Star schema to spot the
hierarchies.
Lesson 3
1. Correct Answer: C
A. Incorrect: You can use SUM aggregate functions for
semi-additive measures over
the Customer dimension.
B. Incorrect: You can use SUM aggregate functions for
semi-additive measures over
the Product dimension.
C. Correct: You cannot use SUM aggregate functions for
semi-additive measures
over the Date dimension.
D. Incorrect: You can use SUM aggregate functions for
semi-additive measures over
the Employee dimension.
2. Correct Answers: A and D
A. Correct: Prices are not additive measures.
B. Incorrect: Debit is an additive measure.
C. Incorrect: Amounts are additive measures.
D. Correct: Discount percentages are not additive measures.
E. Incorrect: Balances are semi-additive measures.
Answers Chapter 1 39
3. Correct Answer: D
A. Incorrect: Lineage columns can be part of a fact table.
B. Incorrect: Measures are included in a fact table.
C. Incorrect: A fact table includes key columns.
D. Correct0: Member property is a type of column in a
dimension.
Case Scenario 1
1. For a quick POC project, you should use the Snowflake
schema.
2. You would have customer, product, and date dimensions.
3. No, you should expect some non-additive measures as well.
For example, prices and
various percentages, such as discount percentage, are
non-additive.
Case Scenario 2
1. You should implement a Type 2 solution for the slowly
changing customer dimension.
2. For a long-term DW, you should choose a Star schema.
3. With Star schema design, you would address the DBA’s
issue automatically.
Implementing a Data Warehouse
Lesson Summary
Lesson 1:
Implementing Dimensions and Fact Tables
■■ In
this lesson, you learned about implementing a data warehouse.
■■ For
a data warehouse database, you should use the Simple recovery model.
■■
When creating a database, allocate enough space for data files and log files to
prevent
autogrowth of the files.
■■ Use
surrogate keys in dimensions in which you expect SCD Type 2 changes.
■■ Use
computed columns.
Lesson Review
1. Which database objects and object properties can you use
for autonumbering?
(Choose all that apply.)
A. IDENTITY property
B. SEQUENCE object
C. PRIMARY KEY constraint
D. CHECK constraint
2. What columns do you add to a table to support Type 2 SCD
changes? (Choose all that
apply.)
A. Member properties
B. Current row flag
C. Lineage columns
D. Surrogate key
3. What is an inferred member?
A. A row in a fact table added during dimension load
B. A row with aggregated values
C. A row in a
dimension added during fact table load
D. A computed column in a fact table
Lesson Summary
Lesson 2: Managing the Performance of
a Data Warehouse
■■ In
this lesson, you learned how to optimize data warehouse query performance.
■■ In
a DW, you should not use many nonclustered indexes.
■■ Use
small, integer surrogate columns for clustered primary keys.
■■ Use
indexed views.
■■ Use
columnstore indexes and exploit batch processing.
Chapter 2
Implementing a Data Warehouse
Lesson Review
1. Which types of data compression are supported by SQL
Server? (Choose all that apply.)
A. Bitmap
B. Unicode
C. Row
D. Page
2. Which operators can benefit from batch processing?
(Choose all that apply.)
A. Hash Join
B. Merge Join
C. Scan
D. Nested Loops Join
E. Filter
3. Why would you use indexed views? (Choose all that apply.)
A. To speed up queries that aggregate data
B. To speed up data load
C. To speed up selective queries
D. To speed up queries that involve multiple joins
Lesson Summary
Lesson 3: Loading and
Auditing Loads
■■
Table partitioning is extremely useful for large fact tables with columnstore
indexes.
■■
Partition switch is a metadata operation only if an index is aligned with its
base table.
■■ You
can add lineage information to your dimensions and fact tables to audit changes
to your DW on a row level.
Lesson Review
Answer the following questions to test your knowledge of the
information in this lesson. You
can find the answers to these questions and explanations of
why each answer choice is correct
or incorrect in the “Answers” section at the end of this
chapter.
1. The database object that maps partitions of a table to
filegroups is called a(n)
A. Aligned index
B. Partition function
C. Partition column
D. Partition scheme
2. If you want to switch content from a nonpartitioned table
to a partition of a partitioned
table, what conditions must the nonpartitioned table meet?
(Choose all that apply.)
A. It must have the same constraints as the partitioned
table.
B. It must have the same compression as the partitioned
table.
C. It must be in a special PartitionedTables schema.
D. It must have a check constraint on the partitioning
column that guarantees that all
of the data goes to exactly one partition of the partitioned
table.
E. It must have the same indexes as the partitioned table.
3. Which of the following T-SQL functions is not very useful
for capturing lineage
information?
A. APP_NAME()
B. USER_NAME()
C. DEVICE_STATUS()
D. SUSER_SNAME()
Case Scenarios
In the following case scenarios, you apply what you’ve
learned about optimized querying and
securing a data warehouse. You can find the answers to these
questions in the “Answers” section
at the end of this chapter.
Suggested Practices Chapter 2 79
Case Scenario 1: Slow DW Reports
You have created a data warehouse and populated it. End
users have started using it for
reports. However, they have also begun to complain about the
performance of the reports.
Some of the very slow reports calculate running totals. You
need to answer the following
questions.
1. What changes can you implement in your DW to speed up the
reports?
2. Does it make sense to check the source queries of the
reports with running totals?
Case Scenario 2: DW Administration Problems
Your end users are happy with the DW reporting performance.
However, when talking with
a DBA, you were notified of potential problems. The DW
transaction log grows by more than
10 GB per night. In addition, end users have started to
create reports from staging tables, and
these reports show messy data. End users complain to the DBA
that they cannot trust your
DW if they get such messy data in a report.
1. How can you address the runaway log problem?
2. What can you do to prevent end users from using the
staging tables?
Suggested Practices
To help you successfully master the exam objectives
presented in this chapter, complete the
following tasks.
Test Different Indexing Methods
For some queries, indexed views could be the best performance
booster. For other queries,
columnstore indexes could be more appropriate. Still other
queries would benefit from nonclustered
indexes on foreign keys.
■■
Practice 1 Write an aggregate query for Internet sales in the
AdventureWorkDW2012
sample database. Create an appropriate indexed view and run
the aggregate query.
Check the statistics IO and execution plan.
■■
Practice 2 Drop the indexed view and create a columnstore index. Run the query
and check the statistics IO and execution plan again.
■■
Practice 3 Drop the columnstore index and create nonclustered indexes on all
foreign
keys of the fact table included in joins. Run the query and
check the statistics IO
and execution plan again.
■■
Practice 4 In the DimCustomer dimension of the AdventureWorksDW2012 sample
database, there is a Suffix column. It is NULL for all rows
but three. Create a filtered nonclustered
index on this column and test queries that read data from
the DimCustomer
dimension using different filters. Check how the query
performs when Suffix is NULL
and when Suffix is known (is not NULL).
80 Chapter 2 Implementing a Data Warehouse
Test Table Partitioning
In order to understand table partitioning thoroughly, you
should test it with aligned and
nonaligned indexes.
■■
Practice 1 Partition the FactInternetSales table in the AdventureWorkDW2012
sample
database. Create aligned nonclustered indexes on all foreign
keys of the fact table
included in joins of the query from the previous practice.
Run the query and check the
execution plan.
■■
Practice 2 Create nonaligned nonclustered indexes on all foreign keys of the
fact
table included in joins of the query from the previous
practice. Run the query and
check the execution plan again.
Answers Chapter 2 81
Answers
This section contains answers to the lesson review questions
and solutions to the case scenarios
in this chapter.
Lesson 1
1. Correct Answers: A and B
A. Correct: The IDENTITY property autonumbers rows.
B. Correct: You can use the new SQL Server 2012 SEQUENCE
object for
autonumbering.
C. Incorrect: Primary keys are used to uniquely identify
rows, not for
autonumbering.
D. Incorrect: Check constraints are used to enforce data
integrity, not for
autonumbering.
2. Correct Answers: B and D
A. Incorrect: Member properties are dimension columns used
for additional information
on reports only.
B. Correct: You need a current flag for denoting the current
row when you implement
Type 2 SCD changes.
C. Incorrect: Lineage columns are used, as their name
states, to track the lineage
information.
D. Correct: You need a new, surrogate key when you implement
Type 2 SCD changes.
3. Correct Answer: C
A. Incorrect: You do not add rows to a fact table during
dimension load.
B. Incorrect: You do not create rows with aggregated values.
C. Correct: A row in a dimension added during fact table
load is called an inferred
member.
D. Incorrect: A computed column is just a computed column,
not an inferred member.
Lesson 2
1. Correct Answers: B, C, and D
A. Incorrect: SQL Server does not support bitmap
compression.
B. Correct: SQL Server supports Unicode compression. It is
applied automatically
when you use either row or page compression.
C. Correct: SQL Server supports row compression.
D. Correct: SQL Server supports page compression.
82 Chapter 2 Implementing a Data Warehouse
2. Correct Answers: A, C, and E
A. Correct: Hash joins can use batch processing.
B. Incorrect: Merge joins do not use batch processing.
C. Correct: Scan operators can benefit from batch
processing.
D. Incorrect: Nested loops joins do not use batch
processing.
E. Correct: Filter operators use batch processing as well.
3. Correct Answers: A and D
A. Correct: Indexed views are especially useful for speeding
up queries that aggregate
data.
B. Incorrect: As with any indexes, indexed views only slow
down data load.
C. Incorrect: For selective queries, nonclustered indexes
are more appropriate.
D. Correct: Indexed views can also speed up queries that
perform multiple joins.
Lesson 3
1. Correct Answer: D
A. Incorrect: Aligned indexes are indexes with the same
partitioning as their base
table.
B. Incorrect: The partition function does logical
partitioning.
C. Incorrect: The partition column is the column used for
partitioning.
D. Correct: The partition scheme does physical partitioning.
2. Correct Answers: A, B, D, and E
A. Correct: It must have the same constraints as the
partitioned table.
B. Correct: It must have the same compression as the
partitioned table.
C. Incorrect: There is no special schema for partitioned
tables.
D. Correct: It must have a check constraint to guarantee
that all data goes to a single
partition.
E. Correct: It must have the same indexes as the partitioned
table.
3. Correct Answer: C
A. Incorrect: The APP_NAME() function can be useful for
capturing lineage
information.
B. Incorrect: The USER_NAME() function can be useful for
capturing lineage
information.
C. Correct: There is no DEVICE_STATUS() function in T-SQL.
D. Incorrect: The SUSER_SNAME() function can be useful for
capturing lineage
information.
Answers Chapter 2 83
Case Scenario 1
1. You should consider using columnstore indexes, indexed
views, data compression, and
table partitioning.
2. Yes, it is definitely worth checking the queries of the
running totals reports. The queries
probably use joins or subqueries to calculate the running
totals. Consider using
window functions for these calculations.
Case Scenario 2
1. You should check the DW database recovery model and
change it to Simple. In addition,
you could use the DBCC SHRINKFILE command to shrink the transaction
log to a
reasonable size.
2. End users apparently have permissions, at least the
SELECT permission, on the staging
tables. Advise the DBA to revoke permissions from end users
on staging tables. In
addition, to speed up the security administration, you
should put all staging tables in a
separate schema, thus allowing the DBA to administer them as
a group
Chapter 3 Creating SSIS Packages
Lesson Summary
Lesson 1: Using the SQL Server Import and Export Wizard
■■ The
SQL Server Import and Export Wizard can be used for simple data movement
operations.
■■ The
wizard allows you to create the destination database.
■■
Multiple objects can be transferred in the same operation.
■■ If
the destination objects do not already exist, they can be created by the
process.
■■ The
SSIS package created by the wizard can be saved and reused.
Lesson Review
Answer the following questions to test your knowledge of the
information in this lesson. You
can find the answers to these questions and explanations of
why each answer choice is correct
or incorrect in the “Answers” section at the end of this
chapter.
1. You need to move data from a production database into a
testing database. You need
to extract the data from several objects in the source
database, but your manager
has asked you to only copy about 10 percent of the rows from
the largest production
tables. The testing database already exists, but without any
tables. How would you approach
this task?
A. Use the Import and Export Wizard, copy all tables from
the source database to the
empty destination database, and delete the excess rows from
the largest tables.
B. Use the Import and Export Wizard multiple times—once for
all the smaller tables,
and once for each large table, using the Write A Query To
Specify The Data To
Transfer option to restrict the rows.
100 CHAPTER 3 Creating SSIS Packages
C. Use the Import and Export Wizard, copy all tables from
the source database to the
empty destination database, and restrict the number of rows
for each large table
by using the Edit SQL option in the Column Mappings window.
D. Use the Import and Export Wizard, configure it to copy
all tables from the source
database to the empty destination database, save the SSIS
package, and then,
before executing it, edit it by using SSDT to restrict the
number of rows extracted
from the large tables.
2. You need to move data from an operational database into a
data warehouse for the
very first time. The data warehouse has already been set up,
and it already contains
some reference data. You have just finished preparing views
in the operational database
that correspond to the dimension and fact tables of the data
warehouse. How
would you approach this task?
A. Use the Import and Export Wizard and copy data from the
dimension and fact
views in the operational database into the tables in the
data warehouse, by using
the Drop And Re-create The Destination Table option in the
Column Mappings
window for every non-empty destination table.
B. Use the Import and Export Wizard, configure it to copy
data from the dimension
and fact views in the operational database into the tables
in the data warehouse,
save the SSIS package, and then edit it by using SSDT to add
appropriate data
merging functionalities for all destination tables.
C. Use the Import and Export Wizard and copy data from the
dimension and fact
views in the operational database into the tables in the
data warehouse, by using
the Merge Data Into The Destination Table option in the
Column Mappings window
for every non-empty destination table.
D. Use SSDT instead of the Import and Export Wizard, because
the wizard lacks appropriate
data transformation and merging capabilities.
3. When SSIS packages are saved to DTSX files, what format
is used to store the SSIS
package definitions?
A. They are stored as binary files.
B. They are stored as plain text files.
C. They are stored as XML files.
D. They are stored as special Microsoft Word documents.
Lesson Summary
Lesson 2: Developing SSIS Packages in SSDT. . . . . . . . .
. . . . . . . . . . . . . . . 101
■■
SSIS projects are developed by using SSDT, a specialized version of Visual
Studio.
■■
SSDT provides the complete integrated development environment (IDE) required
for
efficient development of SSIS packages.
■■ The
SSIS toolbox is context-aware and will either allow you access to control flow
tasks
or data flow components, depending on whether you are
designing the control flow or
the data flow.
108 CHAPTER 3 Creating SSIS Packages
Lesson Review
Answer the following questions to test your knowledge of the
information in this lesson. You
can find the answers to these questions and explanations of
why each answer choice is correct
or incorrect in the “Answers” section at the end of this
chapter.
1. Which statements best describe SQL Server Development
Tools (SSDT)? (Choose all
that apply.)
A. SSDT is an extension of the SQL Server Management Studio
that can be used to
create SSIS packages by means of a special wizard.
B. SSDT is a special edition of the SQL Server Management
Studio, designed to provide
an improved user experience to developers who are not
particularly familiar
with database administration.
C. SSDT is a special edition of Visual Studio, distributed
with SQL Server 2012, providing
a rich database development tool set.
D. SSDT is a new service in SQL Server 2012 that can be used
to perform SQL Server
maintenance tasks, such as data movements and similar data
management processes.
2. Which of the following statements about simple and
complex data movements are
true? (Choose all that apply.)
A. Simple data movements only have a single data source and
a single data
destination.
B. Complex data movements require data to be transformed
before it can be stored
at the destination.
C. In simple data movements, data transformations are
limited to data type
conversion.
D. In complex data movements, additional programmatic logic
is required to merge
source data with destination data.
3. Which of the following statements are true? (Choose all
that apply.)
A. An SSIS package can contain one or more SSDT solutions,
each performing a
specific data management operation.
B. An SSIS project can contain one or more SSIS packages.
C. An SSIS project can contain exactly one SSIS package.
D. SSIS packages contain programmatic logic used in data
movements and data
transformation operations.
Lesson Summary
Lesson 3: Introducing Control Flow, Data Flow, and
Connection Managers . . . . 109
■■
Existing SSIS packages can be added to SSIS projects in SQL Server Data Tools
(SSDT).
■■
Control flows contain the definitions of data management operations.
■■
Control flows determine the order and the conditions of execution.
■■
SSIS package settings can be parameterized, which allows them to be changed
without
direct access to SSIS package definitions.
Lesson Review
Answer the following questions to test your knowledge of the
information in this lesson. You
can find the answers to these questions and explanations of
why each answer choice is correct
or incorrect in the “Answers” section at the end of this
chapter.
1. The Execute SQL Task allows you to execute SQL statements
and commands against
the data store. What tools do you have at your disposal when
developing SSIS packages
to develop and test a SQL command? Choose all that apply.
A. SQL Server Management Studio (SSMS)
B. SQL Server Data Tools (SSDT)
C. The Execute SQL Task Editor
D. SQL Server Enterprise Manager (SSEM)
2. You need to execute two data flow operations in parallel
after an Execute SQL Task has
been completed. How can you achieve that? (Choose all that
apply.)
A. There is no way for two data flow operations to be
executed in parallel in the same
SSIS package.
B. You can place both data flows inside the same data flow
task and create a precedence
constraint leading from the preceding Execute SQL Task to
the data flow task.
C. You can create two separate data flow tasks and create
two precedence constraints
leading from the preceding Execute SQL Task to each of the
two data flow tasks.
D. You can create two separate data flow tasks, place them
inside a third data flow
task, and create a precedence constraint leading from the
preceding Execute SQL
Task to the third data flow task.
3. Which precedence constraint can you use to allow Task B
to execute after Task A even
if Task A has failed?
A. The failure precedence constraint, leading from Task A to
Task B.
B. The success precedence constraint, leading from Task A to
Task B.
C. The completion precedence constraint, leading from Task A
to Task B.
D. Use two precedence constraints—a success precedence
constraint, and a failure
precedence constraint, both leading from Task A to Task B.
Suggested Practices CHAPTER 3 125
Case Scenarios
In the following case scenarios, you apply what you’ve
learned about creating SSIS packages.
You can find the answers to these questions in the “Answers”
section at the end of this
chapter.
Case Scenario 1: Copying Production Data to Development
Your IT solution has been deployed to production, version
one is complete, and it is now time
to start the work on the next version. To keep the data in
the development and testing environment
as up to date as possible, your manager has asked you to
design a data movement
solution to be used on a regular basis to copy a subset of
production data from the production
data store into the development data store.
1. What method would you use to transfer the data on demand?
2. How would you maximize the reusability of the solution?
Case Scenario 2: Connection Manager Parameterization
Data warehousing maintenance solutions have outgrown your
company’s existing infrastructure,
so new servers had to be installed, and this time all data
warehousing applications will
use a dedicated network. In phase 1, all of your SSIS
solutions must be redeployed to new
servers, and the system administrator has decided that SSIS
projects deserve more network
bandwidth, so in phase 2 you will be allowed to dedicate a
greater share of the network
bandwidth to your data movement processes.
1. How much additional development work will you have to do
to complete phase 1?
2. What will you have to do to reconfigure all of the
connection managers to use larger
network packets for phase 2?
Suggested Practices
To help you successfully master the exam objectives
presented in this chapter, complete the
following tasks.
Use the Right Tool
If you can combine the benefits of early deployment with the
benefits of providing ultimate
stability for the entire data warehouse solution, you might
be able to achieve “the best of
both worlds.” For instance, early deployment might allow
other members of your team to
begin their work sooner (for example, if you deploy an
initial version of the data warehouse
before the model has been finalized, this might allow report
development tasks to start early).
126 CHAPTER 3 Creating SSIS Packages
And stabilizing the entire data warehouse solution has
obvious benefits (for example, if you
implement changes to the data warehouse model in stages,
this might allow the iterations in
report development to be as reasonable as possible).
■■ Practice
1 Develop an initial data movement by using the Import and Export Wizard,
using views in the source data store to emulate data
transformations.
■■
Practice 2 Modify the initial data movement—add
proper data transformation logic
as well as appropriate logic to merge new or modified data
with existing data.
Account for the Differences Between Development and
Production Environments
After a final version of a data warehousing solution has
been deployed to production, any
additional work on the current version, even if these
development activities could in fact be
reduced to “tweaking,” will eventually cause delays in the
development of the next version.
With good parameterization, the burden of “tweaking” an
existing solution is lifted from the
shoulders of the developer and is placed on the shoulders of
the administrator.
■■
Practice 1 Review your existing data movement solutions, and create a list of
settings
that could be beneficial to their maintenance in production.
■■
Practice 2 Create a list of elements in your data movement solution that could
be
parameterized, but due to their possible effect on the
operational characteristics of the
solution probably should not be parameterized.
Answers CHAPTER 3 127
Answers
This section contains answers to the lesson review questions
and solutions to the case scenarios
in this chapter.
Lesson 1
1. Correct Answers: B and D
A. Incorrect: Even though this might seem like the quickest
solution, it might only be
quick to develop. Copying a large amount of data from the
production environment
to a testing environment should be avoided, especially if
most of the data is
just going to be discarded from the destination database
afterward.
B. Correct: It might appear cumbersome to design several
SSIS packages for a single
data movement operation, but this approach will solve the
principal problem while
also following good data management practices, such as
avoiding unnecessary
data movements.
C. Incorrect: The Edit SQL option in the Column Mappings
window of the Import and
Export Wizard cannot be used to modify the data retrieval
query, only the destination
table definition.
D. Correct: An SSIS package created by the Import and Export
Wizard can be edited
by using SSDT.
2. Correct Answers: B and D
A. Incorrect: Dropping and re-creating tables cannot be used
to merge data.
B. Correct: You can use SSDT to add data merging
capabilities to an SSIS package
created by the Import and Export Wizard.
C. Incorrect: No such option exists in the Import and Export
Wizard.
D. Correct: You can use SSDT to design pretty much any kind
of data movement
processes, especially when you want complete control over
the operations needed
by the process, but keep in mind that designing SSIS
packages “from scratch” may
not be as time efficient as possible.
3. Correct Answer: C
A. Incorrect: SSIS package files are not stored in binary
format.
B. Incorrect: SSIS package files might appear as if they are
saved as plain text files,
but they are actually well-formed XML files.
C. Correct: SSIS package files are stored in XML format; the
DTSX file extension is
used for distinction.
D. Incorrect: SSIS packages are not Microsoft Word
documents.
128 CHAPTER 3 Creating SSIS Packages
Lesson 2
1. Correct Answer: C
A. Incorrect: SSDT is not an extension of SSMS. It is a
stand-alone application.
B. Incorrect: SSDT is not a special edition of SSMS. It is a
special edition of Visual
Studio.
C. Correct: SSDT is a special edition of Visual Studio, with
a complete database development
tool set.
D. Incorrect: SSDT is not a service.
2. Correct Answers: B and D
A. Incorrect: Simple data movements can have as many data
sources and as many
data destinations as needed.
B. Correct: Data transformations are present in complex data
movements.
C. Incorrect: Typically, in simple data movements, no
transformations are needed,
because the data is transferred unchanged. However, it is
possible to transform the
data at the source—such as by making retrieval queries or by
using views or other
similar techniques.
D. Correct: Additional programmatic logic to merge source
data with destination
data is present in complex data movements.
3. Correct Answers: B and D
A. Incorrect: SSIS packages cannot contain SSDT solutions.
B. Correct: An SSIS project can contain as many SSIS
packages as needed.
C. Incorrect: An SSIS project can contain more than a single
SSIS package.
D. Correct: SSIS packages contain the programmatic logic
used in data management
operations, such as data movements and data transformations.
Lesson 3
1. Correct Answers: A and B
A. Correct: SSMS provides all the necessary functionalities
to develop and test
SQL code.
B. Correct: SSDT does provide a query designer; it is
available from the Data menu,
under Transact-SQL Editor/New Query Connection.
Alternatively, the query editor
can also be started from the SQL Server Object Explorer by
right-clicking a database
node, and selecting New Query... from the shortcut menu.
C. Incorrect: The Execute SQL Task Editor is just a text box
into which you can type or
paste a SQL statement.
D. Incorrect: The Enterprise Manager was replaced with SSMS
in SQL Server 2005.
Answers CHAPTER 3 129
2. Correct Answers: B and C
A. Incorrect: Parallel data flow execution is supported.
B. Correct: You can place multiple data flow operations
inside the same data flow task.
C. Correct: You can, of course, place data flows in separate
data flow tasks, and you
can create multiple precedence constraints leading from or
to the same task, as
long as any two tasks are connected to each other only once.
D. Incorrect: You cannot place a data flow task inside a
data flow task, because it
cannot contain tasks, only data flow components.
3. Correct Answer: C
A. Incorrect: The failure precedence constraint will allow
Task B to execute only if
Task A has failed.
B. Incorrect: The success precedence constraint will prevent
Task B from executing
if Task A fails.
C. Correct: The completion precedence constraint will allow
Task B to execute
regardless of whether Task A has succeeded or has failed.
D. Incorrect: Only a single precedence constraint can be
used to connect two
distinct tasks.
Case Scenario 1
1. An SSIS package stored in the file system, in the
database, or in an unscheduled SQL
Server Agent Job would be appropriate.
2. At the very least, the SSIS package would have to be
parameterized so that it can be
configured appropriately for the specific environment in
which it is going to be used.
Additionally, the programmatic logic should account for
merging new or modified
data with existing data.
Case Scenario 2
1. A properly parameterized SSIS package can be redeployed
and reconfigured as many
times as needed, without the need for any additional
development activities.
2. The network packet size property of OLE DB connections is
not exposed to parameterization;
therefore, the entire connection string would have to be
parameterized
Designing and Implementing Control Flow
Lesson Summary
Lesson 1: Connection Managers
■■
Connection managers are used to establish connections to data sources.
■■
Different data sources require different types of connection managers.
■■ The
usability of a connection manager within an SSIS project or an SSIS package is
determined by its scope.
Lesson Review
Answer the following questions to test your knowledge of the
information in this lesson. You
can find the answers to these questions and explanations of
why each answer choice is correct
or incorrect in the “Answers” section at the end of this
chapter.
1. You need to extract data from delimited text files. What
connection manager type
would you choose?
A. A Flat File connection manager
B. An OLE DB connection manager
C. An ADO.NET connection manager
D. A File connection manager
2. Some of the data your company processes is sent in from
partners via email.
How would you configure an SMTP connection manager to
extract files from email
messages?
A. In the SMTP connection manager, configure the
OperationMode setting to Send
And Receive.
B. It is not possible to use the SMTP connection manager in
this way, because it can
only be used by SSIS to send email messages.
C. The SMTP connection manager supports sending and
receiving email messages by
default, so no additional configuration is necessary.
D. It is not possible to use the SMTP connection manager for
this; use the IMAP (Internet
Message Access Protocol) connection manager instead.
3. You need to extract data from a table in a SQL Server
2012 database. What connection
manager types can you use? (Choose all that apply.)
A. An ODBC connection manager
B. An OLE DB connection manager
C. A File connection manager
D. An ADO.NET connection manager
Lesson Summary
Lesson 2: Control Flow Tasks and Containers
■■ A
rich collection of tasks supporting the most common data management operations
is provided by the SSIS design model.
■■
Control flow is defined by precedence constraints that determine the order and
conditions
of execution.
■■
Tasks representing logical units of work can be grouped in containers.
■■
Loop containers allow a unit of work to be executed repeatedly.
Lesson Review
Answer the following questions to test your knowledge of the
information in this lesson. You
can find the answers to these questions and explanations of
why each answer choice is correct
or incorrect in the “Answers” section at the end of this
chapter.
1. In your SSIS solution, you need to load a large set of
rows into the database as quickly
as possible. The rows are stored in a delimited text file,
and only one source column
needs its data type converted from String (used by the
source column) to Decimal
(used by the destination column). What control flow task
would be most suitable for
this operation?
A. The File System task would be perfect in this case,
because it can read data from
files and can be configured to handle data type conversions.
164 Chapter 4 Designing and Implementing Control Flow
B. The Bulk Insert task would be the most appropriate,
because it is the quickest and
can handle data type conversions.
C. The data flow task would have to be used, because the
data needs to be transformed
before it can be loaded into the table.
D. No single control flow task can be used for this
operation, because the data needs
to be extracted from the source file, transformed, and then
loaded into the destination
table. At least three different tasks would have to be
used—the Bulk Insert
task to load the data into a staging database, a Data
Conversion task to convert
the data appropriately, and finally, an Execute SQL task to
merge the transformed
data with existing destination data.
2. A part of your data consolidation process involves extracting
data from Excel workbooks.
Occasionally, the data contains errors that cannot be
corrected automatically.
How can you handle this problem by using SSIS?
A. Redirect the failed data flow task to an External Process
task, open the problematic
Excel file in Excel, and prompt the user to correct the file
before continuing the
data consolidation process.
B. Redirect the failed data flow task to a File System task
that moves the erroneous
file to a dedicated location where an information worker can
correct it later.
C. If the error cannot be corrected automatically, there is
no way for SSIS to continue
with the automated data consolidation process.
D. None of the answers above are correct. Due to Excel’s
strict data validation rules,
an Excel file cannot ever contain erroneous data.
3. In your ETL process, a few values need to be retrieved
from a database at run time,
based on another value available at run time, and they
cannot be retrieved as part of
any data flow task. Which task can you use in this case?
A. The Execute T-SQL Statement task
B. The Execute SQL task
C. The Expression task
D. The Execute Process task
Lesson Summary
Lesson 3: Precedence Constraints
■■
Precedence constraints determine the order of execution and the conditions that
must
be met for the process to either continue or stop.
■■
Precedence constraints can even be used to allow the process to recover from
failures.
Lesson Review
Answer the following questions to test your knowledge of the
information in this lesson. You
can find the answers to these questions and explanations of
why each answer choice is correct
or incorrect in the “Answers” section at the end of this
chapter.
1. How is the order of execution, or the sequence of
operations, defined in an SSIS
package?
A. The SSIS run time engine determines the order of
execution automatically, based
on the type of operations, the available software and
hardware resources, and the
size of data.
B. The sequence is defined by using precedence constraints.
C. The sequence is defined by using the Sequence container.
D. The sequence is defined at design time by using
precedence constraints and
Sequence containers, but at run time the SSIS engine
executes the operations in
the order set by the most appropriate execution plan for
maximum performance.
170 Chapter 4 Designing and Implementing Control Flow
2. How does the Failure constraint affect the execution
order?
A. The following task will only execute after the preceding
task has failed.
B. The following task will only execute if the preceding
task has not failed.
C. The following task will never execute, because this
constraint is only used at
design time.
D. The following task will execute regardless of whether the
preceding task has failed,
but an error will be written to the SSIS log.
3. In your ETL process, there are three external processes
that need to be executed in
sequence, but you do not want to stop execution if any of
them fails. Can this be
achieved by using precedence constraints? If so, which
precedence constraints can
be used?
A. No, this cannot be achieved just by using precedence
constraints.
B. Yes, this can be achieved by using completion precedence
constraints between the
first and the second and between the second and the third
Execute Process tasks,
and by using a success precedence constraint between the
third Execute Process
task and the following task.
C. Yes, this can be achieved by using completion precedence
constraints between
the first and the second, between the second and the third,
and between the third
Execute Process task and the following task.
D. Yes, this can be achieved by using failure precedence
constraints between the first
and the second, and between the second and the third Execute
Process tasks, and
by using a completion precedence constraint between the
third Execute Process
task and the following task.
Case Scenarios
In the following case scenarios, you apply what you’ve
learned about designing and implementing
control flow. You can find the answers to these questions in
the “Answers” section at
the end of this chapter.
Case Scenario 1: Creating a Cleanup Process
In your data management system, there are two data
warehouses (that is, in addition to
the operational data store); the principal data warehouse
contains all the data, including all
historical data, and the secondary data warehouse is used by
web applications, exposing your
data to customers, and should therefore only contain current
data. Your data warehousing
solution is already in place, moving data into both data
warehouses.
Suggested Practices Chapter 4 171
You have been tasked with creating an additional process to
determine which data is no
longer current and must therefore be removed from the
secondary data warehouse.
1. How will you determine which rows need to be removed?
2. What technique would you use to perform the removal as
efficiently as possible?
Case Scenario 2: Integrating External Processes
In its data management scenarios, your company uses a mix of
proprietary solutions that your
team has developed and third party, off-the-shelf solutions
that to you, a seasoned developer,
appear just like black boxes—you can trust these solutions
to work as expected without
even a faint clue about how this is actually done.
In your data warehousing solution, you need to consolidate
data from a solution of your
own and from two diverse black boxes, one of which has a
special data extraction tool (a
stand-alone application) that retrieves data from the
internal store and saves it to files in the
file system, while the other one exposes its API layer,
providing access to its internal data
retrieval functionalities.
1. What functionalities does SSIS provide that can be used
to integrate such diverse solutions
into a single SSIS process?
2. How would you use the SQL Server platform to solve this
problem?
Suggested Practices
To help you successfully master the exam objectives
presented in this chapter, complete the
following tasks.
A Complete Data Movement Solution
Think about all the processes involved in preparing data to
be consumed by the analytical
tools used by your managers when determining company
strategy. The data must be consolidated,
cleansed, loaded into the data warehouse (which includes
transforming it appropriately
and merging it with existing data), and finally, it must be loaded
into the multidimensional
SQL Server Analysis Database.
All of these processes need to be performed automatically,
at predefined times, and provide
current and coherent results in the end. Otherwise, the
operations and—eventually—
even the future of the company might be at stake.
I feel SSIS is the most useful and important aspect to solve some very complex IT operations.
ReplyDeleteSSIS Postgresql Write