Open source

Friday, August 10, 2018

Best ETL / Data Warehousing Tools in 2018

With many Database Warehousing tools available in the market, it becomes difficult to select the top tool for your project. Following is a curated list of most popular open source/commercial ETL tools with key features and download links.

1) QuerySurge

QuerySurge is ETL testing solution developed by RTTS. It is built specifically to automate the testing of Data Warehouses & Big Data. It ensures that the data extracted from data sources remains intact in the target systems as well.

Features:

Improve data quality & data governance
Accelerate your data delivery cycles
Helps to automate manual testing effort
Provide testing across the different platform like Oracle, Teradata, IBM, Amazon, Cloudera, etc.
It speeds up testing process up to 1,000 x and also providing up to 100% data coverage
It integrates an out-of-the-box DevOps solution for most Build, ETL & QA management software
Deliver shareable, automated email reports and data health dashboards

2) MarkLogic:

MarkLogic is a data warehousing solution that makes data integration easier and faster using an array of enterprise features. This tool helps to perform very complex search operations. It can query data including documents, relationships, and metadata.

Features:

The Optic API can perform joins and aggregates over documents, triples, and rows.
It allows specifying more complex security rules for all the elements within documents
Writing, reading, patching, and deleting documents in JSON, XML, text, or binary formats
Database Replication for Disaster Recovery
Specify Output Options on the App Server Configuration
Importing and Exporting Configuration Information

Download Link: https://developer.marklogic.com/products

3) Oracle:

Oracle data warehouse software is a collection of data which is treated as a unit. The purpose of this database is to store and retrieve related information. It helps the server to reliably manage huge amounts of data so that multiple users can access the same data.

Features:

Distributes data in the same way across disks to offer uniform performance
Works for single-instance and real application clusters
Offers real application testing
Common architecture between any Private Cloud and Oracle's public cloud
Hi-Speed Connection to move large data
Works seamlessly with UNIX/Linux and Windows platforms
It provides support for virtualization
Allows connecting to the remote database, table, or view

Download Link: https://www.oracle.com/downloads/index.html

4) Amazon RedShift:

Amazon Redshift is an easy to manage, simple, and cost-effective data warehouse tool. It can analyze almost every type of data using standard SQL.

Features:

No Up-Front Costs for its installation
It allows automating most of the common administrative tasks to monitor, manage, and scale your data warehouse
Possible to change the number or type of nodes
Helps to enhance the reliability of the data warehouse cluster
Every data center is fully equipped with climate control
Continuously monitors the health of the cluster. It automatically re-replicates data from failed drives and replaces nodes when needed

Download Link: https://aws.amazon.com/redshift/

5) Domo:

Domo is a cloud-based Data warehouse management tool that easily integrates various types of data sources, including spreadsheets, databases, social media and almost all cloud-based or on-premise Data warehouse solutions.

Features:

Help you to build your dream dashboard
Stay connected anywhere you go
Integrates all existing business data
Helps you to get true insights into your business data
Connects all of your existing business data
Easy Communication & messaging platform
It provides support for ad-hoc queries using SQL
It can handle most concurrent users for running complex and multiple queries

Download Link: https://www.domo.com/product

6) Teradata Corporation:

The Teradata Database is the only commercially available shared-nothing or Massively Parallel Processing (MPP) data warehousing tool. It is one of the best data warehousing tool for viewing and managing large amounts of data.

Features:

Simple and Cost Effective solutions
The tool is best suitable option for organization of any size
Quick and most insightful analytics
Get the same Database on multiple deployment options
It allows multiple concurrent users to ask complex questions related to data
It is entirely built on a parallel architecture
Offers High performance, diverse queries, and sophisticated workload management

Download Link: https://downloads.teradata.com/

7) SAP:

SAP is an integrated data management platform, to maps all business processes of an organization. It is an enterprise level application suite for open client/server systems. It has set new standards for providing the best business information management solutions.

Features:

It provides highly flexible and most transparent business solutions
The application developed using SAP can integrate with any system
It follows modular concept for the easy setup and space utilization
You can create a Database system that combines analytics and transactions. These next next-generation databases can be deployed on any device
Provide support for On-premise or cloud deployment
Simplified data warehouse architecture
Integration with SAP and non-SAP applications

Download Link: https://support.sap.com/en/my-support/software-downloads.html

8) SAS:

SAS is a leading Datawarehousing tool that allows accessing data across multiple sources. It can perform sophisticated analyses and deliver information across the organization.

Features:

Activities managed from central locations. Hence, user can access applications remotely via the Internet
Application delivery typically closer to a one-to-many model instead of one-to-one model
Centralized feature updating, allows the users to download patches and upgrades.
Allows viewing raw data files in external databases
Manage data using tools for data entry, formatting, and conversion
Display data using reports and statistical graphics

Download Link: https://www.sas.com/en_in/home.html

9) IBM – DataStage:

IBM data Stage is a business intelligence tool for integrating trusted data across various enterprise systems. It leverages a high-performance parallel framework either in the cloud or on-premise. This data warehousing tool supports extended metadata management and universal business connectivity.

Features:

Support for Big Data and Hadoop
Additional storage or services can be accessed without need to install new software and hardware
Real time data integration
Provide trusted ETL data anytime, anywhere
Solve complex big data challenges
Optimize hardware utilization and prioritize mission-critical tasks
Deploy on-premises or in the cloud

Download Link: http://www-01.ibm.com/support/docview.wss?uid=swg24037518

10) Informatica:

Informatica PowerCenter is Data Integration tool developed by Informatica Corporation. The tool offers the capability to connect & fetch data from different sources.

Features:

It has a centralized error logging system which facilitates logging errors and rejecting data into relational tables
Build in Intelligence to improve performance
Limit the Session Log
Ability to Scale up Data Integration
Foundation for Data Architecture Modernization
Better designs with enforced best practices on code development
Code integration with external Software Configuration tools
Synchronization amongst geographically distributed team members

Download link: https://informatica.com/

11) MS SSIS:

SQL Server Integration Services is a Data warehousing tool that used to perform ETL operations; i.e. extract, transform and load data. SQL Server Integration also includes a rich set of built-in tasks.

Features:

Tightly integrated with Microsoft Visual Studio and SQL Server
Easier to maintain and package configuration
Allows removing network as a bottleneck for insertion of data
Data can be loaded in parallel and various locations
It can handle data from different data sources in the same package
SSIS consumes data which are difficult like FTP, HTTP, MSMQ, and Analysis services, etc.
Data can be loaded in parallel to many varied destinations

Download link: https://www.microsoft.com/en-us/download/details.aspx?id=39931

12) Talend Open Studio:

Open Studio is an open source data warehousing tool developed by Talend. It is designed to convert, combine and update data in various locations. This tool provides an intuitive set of tools which make dealing with data lot easier. It also allows big data integration, data quality, and master data management.

Features:

It supports extensive data integration transformations and complex process workflows
Offers seamless connectivity for more than 900 different databases, files, and applications
It can manage the design, creation, testing, deployment, etc of integration processes
Synchronize metadata across database platforms
Managing and monitoring tools to deploy and supervise the jobs

Download Link: https://www.talend.com/download/

13) The Ab Initio software:

The Ab Initio is a data analysis, batch processing, and GUI based parallel processing data warehousing tool. It is commonly used to extract, transform and load data.

Features:

Meta data management
Business and Process Metadata management
Ability to run, debug Ab Initio jobs and trace execution logs
Manage and run graphs and control the ETL processes
Components can execute simultaneously on various branches of a graph

Download Link: https://www.abinitio.com/en/

14) Dundas:

Dundas is an enterprise-ready Business Intelligence platform. It is used for building and viewing interactive dashboards, reports, scorecards and more. It is possible to deploy Dundas BI as the central data portal for the organization or integrate it into an existing website as a custom BI solution.

Features:

Data warehousing tool for Business Users and IT Professionals
Easy access through web browser
Allows to use sample or Excel data
Server application with full product functionality
Integrate and access all kind of data sources
Ad hoc reporting tools
Customizable data visualizations
Smart drag and drop tools
Visualize data through maps
Predictive and advanced data analytics

Download link: http://www.dundas.com/support/dundas-bi-free-trial

15) Sisense:

Sisense is a business intelligence tool which analyses and visualizes both big and disparate datasets, in real-time. It is an ideal tool for preparing complex data for creating dashboards with a wide variety of visualizations.

Features:

Unify unrelated data into one centralized place
Create a single version of truth with seamless data
Allows to build interactive dashboards with no tech skills
Query big data at very high speed
Possible to access dashboards even in the mobile device
Drag-and-drop user interface
Eye-grabbing visualization
Enables to deliver interactive terabyte-scale analytics
Exports data to Excel, CSV, PDF Images and other formats
Ad-hoc analysis of high-volume data
Handles data at scale on a single commodity server
Identifies critical metrics using filtering and calculations

Download Link: https://www.sisense.com/get/watch-demo/

16) TabLeau:

Tableau Server is an online Data warehousing with 3 versions Desktop, Server, and Online. It is secure, shareable and mobile friendly data warehouse solution.

Features:

Connect to any data source securely on-premise or in the cloud
Ideal tool for flexible deployment
Big data, live or in-memory
Designed for mobile-first approach
Securely Sharing and collaborating Data
Centrally manage metadata and security rules
Powerful management and monitoring
Connect to any data anywhere
Get maximum value from your data with this business analytics platform
Share and collaborate in the cloud
Tableau seamlessly integrates with existing security protocols

Download Link: https://public.tableau.com/en-us/s/download

17) MicroStrategy:

MicroStrategy is an enterprise business intelligence application software. This platform supports interactive dashboards, scorecards, highly formatted reports, ad hoc query and automated report distribution.

Features:

Unmatched speed, performance, and scalability
Maximize the value of investment made by enterprises
Eliminating the need to rely on multiple tools
Support for advanced analytics and big data
Get insight into complex business processes for strengthening organizational security
Powerful security and administration feature

Download link: https://www.microstrategy.com/us/get-started

18) Pentaho

Pentaho is a Data Warehousing and Business Analytics Platform. The tool has a simplified and interactive approach which empowers business users to access, discover and merge all types and sizes of data.

Features:

Enterprise platform to accelerate the data pipeline
Community Dashboard Editor allows the fast and efficient development and deployment
Big data integration without a need for coding
Simplified embedded analytics
Visualize data with custom dashboards
Ease of use with the power to integrate all data
Operational reporting for mongo dB
Platform to accelerate the data pipeline

Download now: http://www.pentaho.com/testdrive

19) BigQuery:

Google's BigQuery is an enterprise-level data warehousing tool. It reduces the time for storing and querying massive datasets by enabling super-fast SQL queries. It also controls access to both the project and also offering the feature of view or query the data.

Features:

Offers flexible Data Ingestion
Read and write data in via Cloud Dataflow, Hadoop, and Spark.
Automatic Data Transfer Service
Full control over access to the data stored
Easy to read and write data in BigQuery via Cloud Dataflow, Spark, and Hadoop
BigQuery provides cost control mechanisms

Download now: https://cloud.google.com/bigquery/

20) Numetric:

Numetric is the fast and easy BI tool. It offers business intelligence solutions from data centralization and cleaning, analyzing and publishing. It is powerful enough for anyone to use. This data warehousing tool helps to measure and improve productivity.

Features:

Data benchmarking
Budgeting & forecasting
Data chart visualizations
Data analysis
Data mapping & dictionary
Key performance indicators

Download Link: https://www.numetric.com/schedule-a-demo/

21) Solver BI360 Suite:

Solver BI360 is a most comprehensive business intelligence tool. It gives 360º insights into any data, using reporting, data warehousing, and interactive dashboards. BI360 drives effective, data-based productivity.

Features:

Excel-based reporting with predefined templates
Currency conversion and inter-company transactions elimination can be automated
User-friendly budgeting and forecasting feature
It reduces the amount of time spent for the preparation of reports and planning
Easy configuration with User-friendly interface
Automated data loading
Combine Financial and Operational Data
Allows to view data in Data Explorer
Easily add modules and dimensions
Unlimited Trees on any dimension
Support for Microsoft SQL Server/SQL Azure

Download link: http://www.solverglobal.com/products/

ETL Tools Categorywise

Extract, Transform, and Load (ETL) tools enable organizations to make their data accessible, meaningful, and usable across disparate data systems. When it comes to choosing the right ETL tool, there are many options to chose from. So, where should you start?

We've prepared a list that is simple to digest, organized into four categories to better help you find the best solution for your needs.

Incumbent Batch ETL Tools

Until recently, most of the world’s ETL tools were on-prem and based on batch processing. Historically, most organizations used to utilize their free compute and database resources, during off-hours, to perform nightly batches of ETL jobs and data consolidation. This is why, for example, you used to see your bank account updated only a day after you made any financial transaction.

Cloud Native ETL Tools

With IT moving to the cloud, more and more cloud-based ETL services started to emerge. Some of them keep the same basic batch model of the legacy platforms, while others start to offer real-time support, intelligent schema detection, and more.

Open Source ETL Tools

Similarly to other areas of software infrastructure, ETL has had its own surge of open source tools and projects. Most of them were created as a modern management layer for scheduled workflows and batch processes. For example, Apache Airflow was developed by the engineering team at AirBnB, and Apache NiFi by the US National Security Agency (NSA).

Real-Time ETL Tools

Doing your ETL in batches makes sense only if you do not need your data in real-time. It might be good enough for salary reporting or tax calculations. However, most modern applications require a real-time access to data from different sources. When you upload a picture to your Facebook account, you want your friends to see it immediately, not a day after.

This shift to real-time led to a profound change in architecture: from a model based on batch processing to a model based on distributed message queues and stream processing. Apache Kafka has emerged as the leading distributed message queue for modern data applications, and companies like Alooma and others are building modern ETL solutions on top of it, either as a SaaS platform or an on-prem solution.

Conclusion

Now that you know how ETL tooling has advanced over the years and which options work best in which scenarios, let's take a more visual look at how those tools have evolved:

This post contains some representative examples for each category to help you make the choice that meets your needs.

How to select the right ETL tool

First things first, if you don't think you need real-time updates or if you aren't handling data from streaming sources, you can get away with using a tool from any of the categories above.

That said, if you're dealing with streaming data, or very large amounts of data, or if you would rather build your own solution based on open source technology, you're going to want an ETL tool or platform that can keep up with your specific requirements.

If you want to work with your existing vendors, use on-prem technology, and don't rely on real-time processing, consider an incumbent batch tool.

If you prefer to use tools built and delivered via the cloud, or if you want to avoid the overhead of equipment and maintenance costs as your data needs expand, consider a cloud-based solution.

If you want to build the solution yourself and/or if you're comfortable administering, maintaining, and operating open source tools, look into open source offerings.

If your business depends on real-time processing of events, especially large volume data sources and streams, you're going to want a modern ETL platform designed with modern needs in mind.

Wednesday, July 11, 2018

When to use the different log levels?

There are different ways to log messages, in order of fatality:

FATAL
ERROR
WARN
INFO
DEBUG
TRACE

Trace - Only when I would be "tracing" the code and trying to find one part of a function specifically.
Debug - Information that is diagnostically helpful to people more than just developers (IT, sysadmins, etc.).
Info - Generally useful information to log (service start/stop, configuration assumptions, etc). Info I want to always have available but usually don't care about under normal circumstances. This is my out-of-the-box config level.
Warn - Anything that can potentially cause application oddities, but for which I am automatically recovering. (Such as switching from a primary to backup server, retrying an operation, missing secondary data, etc.)
Error - Any error which is fatal to the operation, but not the service or application (can't open a required file, missing data, etc.). These errors will force user (administrator, or direct user) intervention. These are usually reserved (in my apps) for incorrect connection strings, missing services, etc.
Fatal - Any error that is forcing a shutdown of the service or application to prevent data loss (or further data loss). I reserve these only for the most heinous errors and situations where there is guaranteed to have been data corruption or loss.

Wednesday, June 20, 2018

Hadoop Vs. MongoDB: Which Platform is Better for Handling Big Data?

https://www.aptude.com/blog/entry/hadoop-vs-mongodb-which-platform-is-better-for-handling-big-data

NoSQL Database

What is NoSQL?

NoSQL is an approach to database design that can accomodate a wide variety of data models, including key-value, document, columnar and graph formats. NoSQL, which stand for "not only SQL," is an alternative to traditional relational databases in which data is placed in tables and data schema is carefully designed before the database is built. NoSQL databases are especially useful for working with large sets of distributed data.

The Benefits of NoSQL

When compared to relational databases, NoSQL databases are more scalable and provide superior performance, and their data model addresses several issues that the relational model is not designed to address:

Large volumes of rapidly changing structured, semi-structured, and unstructured data
Agile sprints, quick schema iteration, and frequent code pushes
Object-oriented programming that is easy to use and flexible
Geographically distributed scale-out architecture instead of expensive, monolithic architecture

NoSQL Database Types

Document databases pair each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
Graph stores are used to store information about networks of data, such as social connections. Graph stores include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as an attribute name (or 'key'), together with its value. Examples of key-value stores are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as 'integer', which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for que

Tuesday, March 3, 2015

Talend Integration Best Practices

1. To start project with Business Model, also attach the document to it whenever Buniness model gets changed

2. Talend workspace path should not contain any spaces.

3. Have a document for important components

4. Maintain the job version and document , do not forget record it in document, when changes happen

5. Create Repository Metadata for DB connections and retrieve database table schema for DB tables.

6.Use Repository Schema for Files/DB and DB connections.

7.Create Database connection using t<Vendor>Connection component and use this connection in the Job. Do not make new connection with every component

8. Create a Repository Document corresponding to every Talend job including revision history.

9.Provide Sub Job title for every sub job to describe the sub job purpose/objective.

10.Avoid Hard Coding in Talend Job component. Instead use Talend context variables

11.Create Context Groups in Repository

12.Use Talend.properties file to provide the values to context variables using tContextLoad

13. Create Variables in tMap and use the variables to assign the values to target fields.

14.Create user routines/functions for common transformation and validation

15.Always rename Main Flows in Talend Job to meaningful names

Open Source ETL tools vs Commercial ETL tools

The ETL-tools are validated on the following categories

√	Infrastructure	√	Functionality	√	Usability
√	Platforms supported	√	Debugging facilities	√	Data Quality / profiling
√	Performance	√	Future prospects	√	Reusability
√	Scalability	√	Batch vs Real-time	√	Native connectivity

Figure 1: Simple schematic for a data warehous...

Pentaho Kettle vs Talend

Pentaho

Pentaho is a commerical open-source BI suite that has a product called Kettle for data integration.
It uses an innovative meta-driven approach and has a strong and very easy-to-use GUI.
The company started around 2001 (2002 was when kettle was integrated into it).
It has a strong community of 13,500 registered users.
It has a stand-alone java engine that process the jobs and tasks for moving data between many different databases and files.
It can schedule tasks (but you need a schedular for that - cron).
It can run remote jobs on "slave servers" on other machines.
It has data quality features: from its own GUI, writing more customised SQL queries, Javascript and regular expressions.

Talend

Talend is an open-source data integration tool (not a full BI suite).
It uses a code-generating approach. Uses a GUI, but within Eclipse RC.
It started around October 2006
It has a much smaller community then Pentaho but has 2 finance companies supporting it.
It generates java or perl code which you later run on your server.
It can schedule tasks (also with using schedulars like cron).
It has data quality features: from its own GUI, writing more customised SQL queries and Java.

Comparison - (from my understanding)

Pentaho is faster (twice as fast maybe) then Talend.
Pentaho's GUI is easier to use then Talend's GUI and takes less time to learn.

My impression
Pentaho is easier to use because of its GUI.
Talend is more a tool for people who are making already a Java program and want to save lots and lots of time with a tool that generates code for them.

Assuming Pentaho made it to the next round....

Pentaho Kettle vs Informatica

Informatica

Informatica is a very good commercial data integration suite.
It was founded in 1993
It is the market share leader in data integration (Gartner Dataquest)
It has 2600 customers. Of those, there are fortune 100 companies, companies on the Dow Jones and government organization.
The company's sole focus is data integration.
It has quite a big package for enterprises to integrate their systems, cleanse their data and can connect to a vast number of current and legacy systems.
Its very expensive, will require training some of your staff to use it and probably require hiring consultants as well. (I hear Informatica consultants are well paid).
Its very fast and can scale for large systems. It has "Pushdown Optimization" which uses an ELT approach that uses the source database to do the transforming - like Oracle Warehouse Builder.

Comparison

Pentaho's Javascipt is very powerful when writing transformation tasks.
Informatica has many more enterprise features, for example, load balancing between database servers.
Pentaho's GUI requires less training then Informatica.
Penatho doesn't require huge upfront costs as Informatica does. (that part you saw coming, I'm sure)
(edited)Informatica is faster then Pentaho. Infromatica has Pushdown Optimization, but with some tweaking to Pentaho and some knowledge of the source database, you can improve the speed of Pentaho. (also see line below)
(new)You can place Pentaho Kettle on many different servers (as many as you like, its free) and use it as a cluster.
Informatica has much better monitoring tools then Pentaho.

Pages

Friday, August 10, 2018

Best ETL / Data Warehousing Tools in 2018

20 Best ETL / Data Warehousing Tools in 2018

1) QuerySurge

2) MarkLogic:

3) Oracle:

4) Amazon RedShift:

5) Domo:

6) Teradata Corporation:

7) SAP:

8) SAS:

9) IBM – DataStage:

10) Informatica:

11) MS SSIS:

12) Talend Open Studio:

13) The Ab Initio software:

14) Dundas:

15) Sisense:

16) TabLeau:

17) MicroStrategy:

18) Pentaho

19) BigQuery:

20) Numetric:

21) Solver BI360 Suite:

ETL Tools Categorywise

ETL Tools: A Modern List

We take a look at how ETL tools have evolved over the years to incorporate the cloud and open source. See which tools work best in which situations.

Incumbent Batch ETL Tools

Cloud Native ETL Tools

Open Source ETL Tools

Real-Time ETL Tools

Conclusion

How to select the right ETL tool

Wednesday, July 11, 2018

When to use the different log levels?

Wednesday, June 20, 2018

Hadoop Vs. MongoDB: Which Platform is Better for Handling Big Data?

Hadoop Vs. MongoDB: Which Platform is Better for Handling Big Data?

NoSQL Database

The Benefits of NoSQL

NoSQL Database Types

Tuesday, March 3, 2015

Talend Integration Best Practices

Open Source ETL tools vs Commercial ETL tools

The ETL-tools are validated on the following categories

Pentaho Kettle vs Talend

Pentaho Kettle vs Informatica

Total Pageviews

My blog