Big SQL - Architecture and Tutorial (1 of 5)
Part-1 Introduction to Big SQL
When considering
SQL-on-Hadoop, the most fundamental question is: What is the
right tool for the job? For interactive queries that require a few
seconds (or even milliseconds) of
response time, MapReduce
(MR) is the wrong choice.
On the other hand, for queries
that require massive scale and runtime fault tolerance, an MR
framework works well. MR was built for large-scale processing
on big data, viewed mostly as “batch” processing.
As enterprises start using
Apache Hadoop as a central data repository for all data —
originating from sources as varied as operational systems,
sensors, smart devices, metadata and internal applications —
SQL processing becomes an optimal choice. A fundamental
reason is that most enterprise data management and analytical
tools rely on SQL.
As a tool for interactive
query execution, SQL processing (of relational data)
benefits from decades of research, usage experience and
optimizations. Clearly, the SQL skills pool
far exceeds that of MR
developers and data scientists. As a general-purpose processing
framework, MR may still be appropriate for ad hoc
analytics, but that is as far as it can go with current
technology.
The first version of Big SQL
from IBM (an SQL interface to IBM InfoSphere® BigInsights™
software, which is a Hadoopbased platform) took an SQL
query sent to Hadoop and decomposed it into a series of MR
jobs to be processed by the cluster. For smaller,interactive queries, a
built-in optimizer rewrote the query as a local job to help minimize
latencies.
Big SQL benefited from Hadoop’s dynamic
scheduling and fault tolerance. Big SQL supported the ANSI 2011
SQL standard and introduced Java Database Connectivity
(JDBC) and Open Database Connectivity (ODBC) client drivers.
Big SQL 3.0 from IBM
represents an important leap forward. It replaces MR with a
massively parallel processing (MPP) SQL engine. The MPP engine
deploys directly on the physical Hadoop Distributed File
System (HDFS) cluster. A fundamental difference from other MPP
offerings on Hadoop is that this engine actually pushes
processing down to the same nodes that hold the data. Because it
natively operates in a shared-nothing environment, it does not
suffer from limitations common to shared-disk architectures
(e.g., poor scalability and networking caused by the need to move
“shared” data around).
Big SQL 3.0 introduces
- a “beyond MR” low-latency parallel execution infrastructure that is able to access Hadoop data natively for reading and writing.
- It extends SQL:2011 language
- support with broad relational data type support, including support for stored procedures. Its focus on comprehensive SQL support translates into industry-leading application transparency and portability.
- It is designed for concurrency with automatic memory management and comes equipped with a rich set of workload management tools. Other features include scale out parallelism to hundreds of data processing nodes and scale up parallelism to dozens of cores. With respect to security, it introduces capabilities on par with those of traditional relational data warehouses.
- In addition, it can access and join with data originating from heterogeneous federated sources.
Right now, users have an
excellent opportunity to jump into the world of big data and
Hadoop with the introduction of Big SQL 3.0.In terms of ease of
adoption and transition for existing analytic workloads, it delivers
uniquely powerful capabilities.
Comments
Post a Comment