Use SSIS for ETL from Hadoop

by Mark Kromer
May 08, 2013

HDInsight is the Microsoft version of Hadoop on Windows which provides most of the commonly-used aspects of the Apache Hadoop Big Data platform including the HDFS file system, sqoop for data import/export, Hive for SQL queries, the MapReduce distributed programming infrastructure and ODBC drivers to connect to your data in HDFS from tools like Excel and SQL Server. In this demo, I am going to show you how to use those ODBC drivers to extract data from Hadoop using HDInsight as the data store, transform it in SSIS, and load the data into SQL Server (extraction, transformation and loading or ETL).

First, you need to create a simple schema on top of your data in Hadoop. I have a simple text file that I've loaded into HDInsight (Microsoft's Hadoop on Windows) by generating a schema in Hive (using HQL) and loading that flat file into that schema. Also note that I am only loading a few simple rows into the Hive table from a flat file for this simple example:

create external table ext_sales
  lastname string,
  productid int,
  quantity int,
  sales_amount float
row format delimited fields terminated by ',' stored as textfile location '/user/makromer/hiveext/input';

LOAD DATA INPATH '/user/makromer/import/sales.csv' OVERWRITE INTO TABLE ext_sales;
Kromer  123     5       55
Smith   567     1       25
Jones   123     9       99
James   11      12      1
Johnson 456     2       2.5
Singh   456     1       3.25
Yu      123     1       11
We can view the data in Hadoop from that Hive table using the HDInsight Web user tool:

Now, in Visual Studio (SSDT, to be precise), I've created a new Integration Services project that will use that Hadoop-based data store as the source by using the ODBC driver for Hive as the source. Note that I first created a user DSN in Windows with that ODBC connector which I can than reference in any tool uses ODBC such as SSIS, Excel, etc:

I used that ODBC source to connect with HDInsight as a source, used a Derived Column tranformation to add a 7% sales tax each sales row and then add that to the total sales amount, creating 2 new fields which will be added to the data flow and inserted into the target SQL Server 2012 database, which is using the normal SQL Server destination in SSIS:

Mapping the fields, changing data types between unicode and DTS data types all are the same actions that you take in SSIS as if your were using a file or another RDBMS as a source. No different here in that SSIS just sees Hive as an ODBC source:

Now you can run the package and the rows are tranfered into SQL Server, including the new additional columns from my Derived Column transformation:

That's pretty much it. Pretty simple & straightforward. With this technique for ETL, you can use the power of Hadoop's distributed nodes and MapReduce to crunch very large complex and unstructured data and then ETL portions of that into SQL Server for other systems, analytics, etc. It also is a more natural way to import data into SQL Server from Hadoop without needing the learn or use sqoop, which should benefit data warehouse and business intelligence developers and architects who are already well-versed in tools like SSIS.

Discuss this Blog Entry 4

on Jun 14, 2013

Not sure why you have so small screen shots, could barely see anything. Wish you had bigger screen shots or at least link to each of them

on Jun 17, 2013

Thank your for the feedback. We're having some CMS image issues, so our sincere apologies!

on Jul 2, 2013

This is great stuff. The only thing I would want additionally is inclusion of (or a hyperlink to) a large sample set to show the strength of HDInsight with SSIS. Sometimes it is hard to see a solution's value when sample data is simple/small/clean.

on Sep 6, 2013

Thanks! If you download the on-prem HD Insight installer from Microsoft, you will see in the Getting Started samples, a couple of decent data sets with Web log output that you may find useful.

Please log in or register to post comments.