Azure Data Factory vs SSIS
- ADF is a cloud-based service (via ADF editor in Azure portal) and since it is a PaaS tool does not require hardware or any installation. SSIS is a desktop tool (via SSDT) and requires a good-sized server that you must manage, and you have to install SQL Server with SSIS
- ADF uses JSON scripts for its orchestration (coding), while SSIS uses drag-and-drop tasks (no coding)
- ADF is pay-as-you-go via an Azure subscription, SSIS is a license cost as part of SQL Server
- ADF can fire-up HDInsight’s clusters and run Pig and Hive scripts. SSIS can also via the Azure Feature Pack for Integration Services (SSIS)
- SSIS has a powerful GUI, intelligence, and debugging. ADF has a basic editor and no intelligence or debugging
Data Factory Resources
Mapping Data Flows – Develop graphical data transformation logic at scale without writing code using Mapping Data Flows.
Triggers – Automate pipeline runs by creating and scheduling triggers. Data Factory supports three types of triggers: schedule, tumbling window or event-based.
Control flows – Visually construct workflows to orchestrate data integration and data transformation processes at scale.
Predefined templates – Use the expanding library of templates for common tasks such as building pipelines, copying from a database, executing SSIS packages in Azure and ETL.
Wrangling Data Flows – Explore your data at scale without writing code. Use Wrangling Data Flows, now in public preview, for code-free data preparation at scale.
Advantages
One of the great advantages that ADF has is integration with other Azure Services. For example, if you have a use case where you drop files in a blob storage and need to move those files into Azure DW or Azure SQL DB setting up that connection is relatively easy.
Another advantage to me is that it has the capability of being parameterized where one pipeline can copy specific files (from a group) based on which event trigger is scheduled. This means one pipeline can be used for multiple files.
UseCase
For example, imagine a gaming company that collects petabytes of game logs that are produced by games in the cloud. The company wants to analyse these logs to gain insights into customer preferences, demographics, and usage behaviour. It also wants to identify up-sell and cross-sell opportunities, develop compelling new features, drive business growth, and provide a better experience to its customers.
To analyse these logs, the company needs to use reference data such as customer information, game information, and marketing campaign information that is in an on-premises data store. The company wants to utilize this data from the on-premises data store, combining it with additional log data that it has in a cloud data store.
To extract insights, it hopes to process the joined data by using a Spark cluster in the cloud (Azure HDInsight), and publish the transformed data into a cloud data warehouse such as Azure SQL Data Warehouse to easily build a report on top of it. They want to automate this workflow and monitor and manage it on a daily schedule. They also want to execute it when files land in a blob store container.