Friday, August 10, 2018

Best ETL / Data Warehousing Tools in 2018

With many Database Warehousing tools available in the market, it becomes difficult to select the top tool for your project. Following is a curated list of most popular open source/commercial ETL tools with key features and download links.

1) QuerySurge

QuerySurge is ETL testing solution developed by RTTS. It is built specifically to automate the testing of Data Warehouses & Big Data. It ensures that the data extracted from data sources remains intact in the target systems as well.
  • Improve data quality & data governance
  • Accelerate your data delivery cycles
  • Helps to automate manual testing effort
  • Provide testing across the different platform like Oracle, Teradata, IBM, Amazon, Cloudera, etc.
  • It speeds up testing process up to 1,000 x and also providing up to 100% data coverage
  • It integrates an out-of-the-box DevOps solution for most Build, ETL & QA management software
  • Deliver shareable, automated email reports and data health dashboards

2) MarkLogic:

MarkLogic is a data warehousing solution that makes data integration easier and faster using an array of enterprise features. This tool helps to perform very complex search operations. It can query data including documents, relationships, and metadata.
  • The Optic API can perform joins and aggregates over documents, triples, and rows.
  • It allows specifying more complex security rules for all the elements within documents
  • Writing, reading, patching, and deleting documents in JSON, XML, text, or binary formats
  • Database Replication for Disaster Recovery
  • Specify Output Options on the App Server Configuration
  • Importing and Exporting Configuration Information

3) Oracle:

Oracle data warehouse software is a collection of data which is treated as a unit. The purpose of this database is to store and retrieve related information. It helps the server to reliably manage huge amounts of data so that multiple users can access the same data.
  • Distributes data in the same way across disks to offer uniform performance
  • Works for single-instance and real application clusters
  • Offers real application testing
  • Common architecture between any Private Cloud and Oracle's public cloud
  • Hi-Speed Connection to move large data
  • Works seamlessly with UNIX/Linux and Windows platforms
  • It provides support for virtualization
  • Allows connecting to the remote database, table, or view

4) Amazon RedShift:

Amazon Redshift is an easy to manage, simple, and cost-effective data warehouse tool. It can analyze almost every type of data using standard SQL.
  • No Up-Front Costs for its installation
  • It allows automating most of the common administrative tasks to monitor, manage, and scale your data warehouse
  • Possible to change the number or type of nodes
  • Helps to enhance the reliability of the data warehouse cluster
  • Every data center is fully equipped with climate control
  • Continuously monitors the health of the cluster. It automatically re-replicates data from failed drives and replaces nodes when needed

5) Domo:

Domo is a cloud-based Data warehouse management tool that easily integrates various types of data sources, including spreadsheets, databases, social media and almost all cloud-based or on-premise Data warehouse solutions.
  • Help you to build your dream dashboard
  • Stay connected anywhere you go
  • Integrates all existing business data
  • Helps you to get true insights into your business data
  • Connects all of your existing business data
  • Easy Communication & messaging platform
  • It provides support for ad-hoc queries using SQL
  • It can handle most concurrent users for running complex and multiple queries

6) Teradata Corporation:

The Teradata Database is the only commercially available shared-nothing or Massively Parallel Processing (MPP) data warehousing tool. It is one of the best data warehousing tool for viewing and managing large amounts of data.
  • Simple and Cost Effective solutions
  • The tool is best suitable option for organization of any size
  • Quick and most insightful analytics
  • Get the same Database on multiple deployment options
  • It allows multiple concurrent users to ask complex questions related to data
  • It is entirely built on a parallel architecture
  • Offers High performance, diverse queries, and sophisticated workload management

7) SAP:

SAP is an integrated data management platform, to maps all business processes of an organization. It is an enterprise level application suite for open client/server systems. It has set new standards for providing the best business information management solutions.
  • It provides highly flexible and most transparent business solutions
  • The application developed using SAP can integrate with any system
  • It follows modular concept for the easy setup and space utilization
  • You can create a Database system that combines analytics and transactions. These next next-generation databases can be deployed on any device
  • Provide support for On-premise or cloud deployment
  • Simplified data warehouse architecture
  • Integration with SAP and non-SAP applications

8) SAS:

SAS is a leading Datawarehousing tool that allows accessing data across multiple sources. It can perform sophisticated analyses and deliver information across the organization.
  • Activities managed from central locations. Hence, user can access applications remotely via the Internet
  • Application delivery typically closer to a one-to-many model instead of one-to-one model
  • Centralized feature updating, allows the users to download patches and upgrades.
  • Allows viewing raw data files in external databases
  • Manage data using tools for data entry, formatting, and conversion
  • Display data using reports and statistical graphics

9) IBM – DataStage:

IBM data Stage is a business intelligence tool for integrating trusted data across various enterprise systems. It leverages a high-performance parallel framework either in the cloud or on-premise. This data warehousing tool supports extended metadata management and universal business connectivity.
  • Support for Big Data and Hadoop
  • Additional storage or services can be accessed without need to install new software and hardware
  • Real time data integration
  • Provide trusted ETL data anytime, anywhere
  • Solve complex big data challenges
  • Optimize hardware utilization and prioritize mission-critical tasks
  • Deploy on-premises or in the cloud

10) Informatica:

Informatica PowerCenter is Data Integration tool developed by Informatica Corporation. The tool offers the capability to connect & fetch data from different sources.
  • It has a centralized error logging system which facilitates logging errors and rejecting data into relational tables
  • Build in Intelligence to improve performance
  • Limit the Session Log
  • Ability to Scale up Data Integration
  • Foundation for Data Architecture Modernization
  • Better designs with enforced best practices on code development
  • Code integration with external Software Configuration tools
  • Synchronization amongst geographically distributed team members
Download link:

11) MS SSIS:

SQL Server Integration Services is a Data warehousing tool that used to perform ETL operations; i.e. extract, transform and load data. SQL Server Integration also includes a rich set of built-in tasks.
  • Tightly integrated with Microsoft Visual Studio and SQL Server
  • Easier to maintain and package configuration
  • Allows removing network as a bottleneck for insertion of data
  • Data can be loaded in parallel and various locations
  • It can handle data from different data sources in the same package
  • SSIS consumes data which are difficult like FTP, HTTP, MSMQ, and Analysis services, etc.
  • Data can be loaded in parallel to many varied destinations

12) Talend Open Studio:

Open Studio is an open source data warehousing tool developed by Talend. It is designed to convert, combine and update data in various locations. This tool provides an intuitive set of tools which make dealing with data lot easier. It also allows big data integration, data quality, and master data management.
  • It supports extensive data integration transformations and complex process workflows
  • Offers seamless connectivity for more than 900 different databases, files, and applications
  • It can manage the design, creation, testing, deployment, etc of integration processes
  • Synchronize metadata across database platforms
  • Managing and monitoring tools to deploy and supervise the jobs

13) The Ab Initio software:

The Ab Initio is a data analysis, batch processing, and GUI based parallel processing data warehousing tool. It is commonly used to extract, transform and load data.
  • Meta data management
  • Business and Process Metadata management
  • Ability to run, debug Ab Initio jobs and trace execution logs
  • Manage and run graphs and control the ETL processes
  • Components can execute simultaneously on various branches of a graph

14) Dundas:

Dundas is an enterprise-ready Business Intelligence platform. It is used for building and viewing interactive dashboards, reports, scorecards and more. It is possible to deploy Dundas BI as the central data portal for the organization or integrate it into an existing website as a custom BI solution.
  • Data warehousing tool for Business Users and IT Professionals
  • Easy access through web browser
  • Allows to use sample or Excel data
  • Server application with full product functionality
  • Integrate and access all kind of data sources
  • Ad hoc reporting tools
  • Customizable data visualizations
  • Smart drag and drop tools
  • Visualize data through maps
  • Predictive and advanced data analytics

15) Sisense:

Sisense is a business intelligence tool which analyses and visualizes both big and disparate datasets, in real-time. It is an ideal tool for preparing complex data for creating dashboards with a wide variety of visualizations.
  • Unify unrelated data into one centralized place
  • Create a single version of truth with seamless data
  • Allows to build interactive dashboards with no tech skills
  • Query big data at very high speed
  • Possible to access dashboards even in the mobile device
  • Drag-and-drop user interface
  • Eye-grabbing visualization
  • Enables to deliver interactive terabyte-scale analytics
  • Exports data to Excel, CSV, PDF Images and other formats
  • Ad-hoc analysis of high-volume data
  • Handles data at scale on a single commodity server
  • Identifies critical metrics using filtering and calculations

16) TabLeau:

Tableau Server is an online Data warehousing with 3 versions Desktop, Server, and Online. It is secure, shareable and mobile friendly data warehouse solution.
  • Connect to any data source securely on-premise or in the cloud
  • Ideal tool for flexible deployment
  • Big data, live or in-memory
  • Designed for mobile-first approach
  • Securely Sharing and collaborating Data
  • Centrally manage metadata and security rules
  • Powerful management and monitoring
  • Connect to any data anywhere
  • Get maximum value from your data with this business analytics platform
  • Share and collaborate in the cloud
  • Tableau seamlessly integrates with existing security protocols

17) MicroStrategy:

MicroStrategy is an enterprise business intelligence application software. This platform supports interactive dashboards, scorecards, highly formatted reports, ad hoc query and automated report distribution.
  • Unmatched speed, performance, and scalability
  • Maximize the value of investment made by enterprises
  • Eliminating the need to rely on multiple tools
  • Support for advanced analytics and big data
  • Get insight into complex business processes for strengthening organizational security
  • Powerful security and administration feature

18) Pentaho

Pentaho is a Data Warehousing and Business Analytics Platform. The tool has a simplified and interactive approach which empowers business users to access, discover and merge all types and sizes of data.
  • Enterprise platform to accelerate the data pipeline
  • Community Dashboard Editor allows the fast and efficient development and deployment
  • Big data integration without a need for coding
  • Simplified embedded analytics
  • Visualize data with custom dashboards
  • Ease of use with the power to integrate all data
  • Operational reporting for mongo dB
  • Platform to accelerate the data pipeline

19) BigQuery:

Google's BigQuery is an enterprise-level data warehousing tool. It reduces the time for storing and querying massive datasets by enabling super-fast SQL queries. It also controls access to both the project and also offering the feature of view or query the data.
  • Offers flexible Data Ingestion
  • Read and write data in via Cloud Dataflow, Hadoop, and Spark.
  • Automatic Data Transfer Service
  • Full control over access to the data stored
  • Easy to read and write data in BigQuery via Cloud Dataflow, Spark, and Hadoop
  • BigQuery provides cost control mechanisms

20) Numetric:

Numetric is the fast and easy BI tool. It offers business intelligence solutions from data centralization and cleaning, analyzing and publishing. It is powerful enough for anyone to use. This data warehousing tool helps to measure and improve productivity.
  • Data benchmarking
  • Budgeting & forecasting
  • Data chart visualizations
  • Data analysis
  • Data mapping & dictionary
  • Key performance indicators

21) Solver BI360 Suite:

Solver BI360 is a most comprehensive business intelligence tool. It gives 360ยบ insights into any data, using reporting, data warehousing, and interactive dashboards. BI360 drives effective, data-based productivity.
  • Excel-based reporting with predefined templates
  • Currency conversion and inter-company transactions elimination can be automated
  • User-friendly budgeting and forecasting feature
  • It reduces the amount of time spent for the preparation of reports and planning
  • Easy configuration with User-friendly interface
  • Automated data loading
  • Combine Financial and Operational Data
  • Allows to view data in Data Explorer
  • Easily add modules and dimensions
  • Unlimited Trees on any dimension
  • Support for Microsoft SQL Server/SQL Azure

ETL Tools Categorywise

ETL Tools: A Modern List

We take a look at how ETL tools have evolved over the years to incorporate the cloud and open source. See which tools work best in which situations.

Extract, Transform, and Load (ETL) tools enable organizations to make their data accessible, meaningful, and usable across disparate data systems. When it comes to choosing the right ETL tool, there are many options to chose from. So, where should you start?
We've prepared a list that is simple to digest, organized into four categories to better help you find the best solution for your needs.

Incumbent Batch ETL Tools

Until recently, most of the world’s ETL tools were on-prem and based on batch processing. Historically, most organizations used to utilize their free compute and database resources, during off-hours, to perform nightly batches of ETL jobs and data consolidation. This is why, for example, you used to see your bank account updated only a day after you made any financial transaction.

Cloud Native ETL Tools

With IT moving to the cloud, more and more cloud-based ETL services started to emerge. Some of them keep the same basic batch model of the legacy platforms, while others start to offer real-time support, intelligent schema detection, and more.

Open Source ETL Tools

Similarly to other areas of software infrastructure, ETL has had its own surge of open source tools and projects. Most of them were created as a modern management layer for scheduled workflows and batch processes. For example, Apache Airflow was developed by the engineering team at AirBnB, and Apache NiFi by the US National Security Agency (NSA).

Real-Time ETL Tools

Doing your ETL in batches makes sense only if you do not need your data in real-time. It might be good enough for salary reporting or tax calculations. However, most modern applications require a real-time access to data from different sources. When you upload a picture to your Facebook account, you want your friends to see it immediately, not a day after.
This shift to real-time led to a profound change in architecture: from a model based on batch processing to a model based on distributed message queues and stream processing. Apache Kafka has emerged as the leading distributed message queue for modern data applications, and companies like Alooma and others are building modern ETL solutions on top of it, either as a SaaS platform or an on-prem solution.


Now that you know how ETL tooling has advanced over the years and which options work best in which scenarios, let's take a more visual look at how those tools have evolved:
Image title
This post contains some representative examples for each category to help you make the choice that meets your needs.

How to select the right ETL tool

First things first, if you don't think you need real-time updates or if you aren't handling data from streaming sources, you can get away with using a tool from any of the categories above.
That said, if you're dealing with streaming data, or very large amounts of data, or if you would rather build your own solution based on open source technology, you're going to want an ETL tool or platform that can keep up with your specific requirements.
If you want to work with your existing vendors, use on-prem technology, and don't rely on real-time processing, consider an incumbent batch tool.
If you prefer to use tools built and delivered via the cloud, or if you want to avoid the overhead of equipment and maintenance costs as your data needs expand, consider a cloud-based solution.
If you want to build the solution yourself and/or if you're comfortable administering, maintaining, and operating open source tools, look into open source offerings.
If your business depends on real-time processing of events, especially large volume data sources and streams, you're going to want a modern ETL platform designed with modern needs in mind.

Wednesday, July 11, 2018

When to use the different log levels?

There are different ways to log messages, in order of fatality:
  1. FATAL
  2. ERROR
  3. WARN
  4. INFO
  5. DEBUG
  6. TRACE
  • Trace - Only when I would be "tracing" the code and trying to find one part of a function specifically.
  • Debug - Information that is diagnostically helpful to people more than just developers (IT, sysadmins, etc.).
  • Info - Generally useful information to log (service start/stop, configuration assumptions, etc). Info I want to always have available but usually don't care about under normal circumstances. This is my out-of-the-box config level.
  • Warn - Anything that can potentially cause application oddities, but for which I am automatically recovering. (Such as switching from a primary to backup server, retrying an operation, missing secondary data, etc.)
  • Error - Any error which is fatal to the operation, but not the service or application (can't open a required file, missing data, etc.). These errors will force user (administrator, or direct user) intervention. These are usually reserved (in my apps) for incorrect connection strings, missing services, etc.
  • Fatal - Any error that is forcing a shutdown of the service or application to prevent data loss (or further data loss). I reserve these only for the most heinous errors and situations where there is guaranteed to have been data corruption or loss.

Wednesday, June 20, 2018

Hadoop Vs. MongoDB: Which Platform is Better for Handling Big Data?

Hadoop Vs. MongoDB: Which Platform is Better for Handling Big Data?

NoSQL Database

What is NoSQL?

NoSQL is an approach to database design that can accomodate a wide variety of data models, including key-value, document, columnar and graph formats. NoSQL, which stand for "not only SQL," is an alternative to traditional relational databases in which data is placed in tables and data schema is carefully designed before the database is built. NoSQL databases are especially useful for working with large sets of distributed data.

The Benefits of NoSQL

When compared to relational databases, NoSQL databases are more scalable and provide superior performance, and their data model addresses several issues that the relational model is not designed to address:
  • Large volumes of rapidly changing structured, semi-structured, and unstructured data
  • Agile sprints, quick schema iteration, and frequent code pushes
  • Object-oriented programming that is easy to use and flexible
  • Geographically distributed scale-out architecture instead of expensive, monolithic architecture

NoSQL Database Types

  • Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
  • Graph stores are used to store information about networks of data, such as social connections. Graph stores include Neo4J and Giraph.
  • Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as 'integer', which adds functionality.
  • Wide-column stores such as Cassandra and HBase are optimized for que

Tuesday, March 3, 2015

Talend Integration Best Practices

1. To start project with Business Model, also attach the document to it whenever Buniness model gets changed

2. Talend workspace path should not contain any spaces.

3. Have a document for important components

4. Maintain the job version and document , do not forget record it in document, when changes happen

5. Create Repository Metadata for DB connections and retrieve database table schema for DB tables. 

6.Use Repository Schema for Files/DB and DB connections.

7.Create Database connection using t<Vendor>Connection component and use this connection in the Job. Do not make new connection with every component 

8. Create a Repository Document corresponding to every Talend job including revision history.

9.Provide Sub Job title for every sub job to describe the sub job purpose/objective.

10.Avoid Hard Coding in Talend Job component. Instead use Talend context variables

11.Create Context Groups in Repository

12.Use file to provide the values to context variables using tContextLoad

13. Create Variables in tMap and use the variables to assign the values to target fields.

14.Create user routines/functions for common transformation and validation

15.Always rename Main Flows in Talend Job to meaningful names

Open Source ETL tools vs Commercial ETL tools

The ETL-tools are validated on the following categories

Platforms supportedDebugging facilitiesData Quality / profiling
PerformanceFuture prospectsReusability
ScalabilityBatch vs Real-timeNative connectivity

Figure 1: Simple schematic for a data warehous... 

Pentaho Kettle vs Talend

  1. Pentaho is a commerical open-source BI suite that has a product called Kettle for data integration.
  2. It uses an innovative meta-driven approach and has a strong and very easy-to-use GUI.
  3. The company started around 2001 (2002 was when kettle was integrated into it).
  4. It has a strong community of 13,500 registered users.
  5. It has a stand-alone java engine that process the jobs and tasks for moving data between many different databases and files.
  6. It can schedule tasks (but you need a schedular for that - cron).
  7. It can run remote jobs on "slave servers" on other machines.
  8. It has data quality features: from its own GUI, writing more customised SQL queries, Javascript and regular expressions.

  1. Talend is an open-source data integration tool (not a full BI suite).
  2. It uses a code-generating approach. Uses a GUI, but within Eclipse RC.
  3. It started around October 2006
  4. It has a much smaller community then Pentaho but has 2 finance companies supporting it.
  5. It generates java or perl code which you later run on your server.
  6. It can schedule tasks (also with using schedulars like cron).
  7. It has data quality features: from its own GUI, writing more customised SQL queries and Java.

Comparison - (from my understanding)
  • Pentaho is faster (twice as fast maybe) then Talend.
  • Pentaho's GUI is easier to use then Talend's GUI and takes less time to learn.

My impression
Pentaho is easier to use because of its GUI.
Talend is more a tool for people who are making already a Java program and want to save lots and lots of time with a tool that generates code for them.

Assuming Pentaho made it to the next round....

Pentaho Kettle vs Informatica

  1. Informatica is a very good commercial data integration suite.
  2. It was founded in 1993
  3. It is the market share leader in data integration (Gartner Dataquest)
  4. It has 2600 customers. Of those, there are fortune 100 companies, companies on the Dow Jones and government organization.
  5. The company's sole focus is data integration.
  6. It has quite a big package for enterprises to integrate their systems, cleanse their data and can connect to a vast number of current and legacy systems.
  7. Its very expensive, will require training some of your staff to use it and probably require hiring consultants as well. (I hear Informatica consultants are well paid).
  8. Its very fast and can scale for large systems. It has "Pushdown Optimization" which uses an ELT approach that uses the source database to do the transforming - like Oracle Warehouse Builder.

  • Pentaho's Javascipt is very powerful when writing transformation tasks.
  • Informatica has many more enterprise features, for example, load balancing between database servers.
  • Pentaho's GUI requires less training then Informatica.
  • Penatho doesn't require huge upfront costs as Informatica does. (that part you saw coming, I'm sure)
  • (edited)Informatica is faster then Pentaho. Infromatica has Pushdown Optimization, but with some tweaking to Pentaho and some knowledge of the source database, you can improve the speed of Pentaho. (also see line below)
  • (new)You can place Pentaho Kettle on many different servers (as many as you like, its free) and use it as a cluster.
  • Informatica has much better monitoring tools then Pentaho.