This is part III of Big Data Overview Blogs for developers:
In part I, I covered the basics of big data and started with Ambari introduction and in part II, I talked about Hadoop technologies used for ingestion like Sqoop, Flume and Atlas. If you have not already read those articles, it would be helpful to read them before proceeding to reading this to understand the flow.
In this article I will talk about all technologies in Hadoop ecosystem that can be used to access, transform, or analyze data. In short, technologies encompassing the ‘data’ in big data.
Hadoop can store data from multiple sources and in both structured and unstructured form. Hive is used to query this data using SQL queries. Hive creates table similar to RDBMS tables for the data in HDFS and user or analysts can query these tables to understand and explore the data. Metadata for these hive tables is stored in metadata table. Hive data in form of tables are stored as corresponding HDFS directories within one database directory. Each of these table directories contain files containing the data. If the data is partitioned, there are subdirectories within this table directory and each partition directory has its files. Data within partitions can further be divided into buckets.
Hbase is NoSQL column-oriented distributed database which runs on top of HDFS. It is modelled after Google’s BigTable. It provides real time read/write access to large datasets stored in Hadoop. Hbase is well suited for multi-structured or sparse datasets and can scale linearly to handle table worth billions of rows. Hbase data is stored as tables of rows and columns. Each table must have an element defined as Primary key which is used for all access calls made to this table to retrieve data.
Pig is a platform to analyze the large datasets in Hadoop. It consists of two components: one is the programming language called PigLatin and other is the runtime environment where PigLatin scripts are executed. Pig excels at describing data analysis problems as data flows. Pig can ingest data from files, streams or other sources using the User Defined Functions (UDF). Once it has the data, it can perform select, iteration, and other transforms over the data. Again the UDF feature allows passing the data to more complex algorithms for the transform. Finally Pig can store the results into the Hadoop Data File System. Pig translates scripts written in PigLatin into a series of MapReduce jobs that are run on the Apache Hadoop cluster.
Fast, reliable, fault-tolerable publish-subscribe messaging system. Main components of Kafka are topics, producers, consumers and brokers. Topics are the categories to which messages are published. Producers publish messages to one or more topics. Consumers subscribe to one or more topics and consume message in sequential order from within a partition. Topics can contain one or more partitions and writes to partitions are sequential. Brokers are servers which track messages and manage the persistence and replication of messages. Kafka consumer can consume messages from an earlier point in time as well since Kafka retains messages on disk and for a configurable amount of time.
Storm is a framework that provides real time processing of streaming data. Storm is extremely fast (can process millions of records per second per node in a cluster of moderate size) and is scalable, fault-tolerant, reliable and easy to operate. I storm, data is passed as streams of tuples originating from spouts, hopping multiple bolts and producing output stream. This entire network of spouts and bolts in a storm system is called a topology. Storm users define topologies and data is processed through spouts and bolts based on defined topology. Example use cases of storm are preventing credit card fraud in real time and sending real time offers to customers based on their location or usage.
Spark is a in-memory data processing engine which can either run either inside Hadoop (on YARN), or in Mesos, standalone, or in cloud. It can access data from multiple sources like HDFS, Cassandra, HBase, and S3. It runs faster than MapReduce as it has DAG (Directed Acyclic graph) execution engine that supports acyclic data flows and in-memory computing. It offers connectors to write applications in languages like Java, Scala, Python and R. Spark provides libraries for handling streaming (Spark Streaming), machine learning (MLLib), SQL capabilities (Spark SQL), and processing graphs (GraphX).
Tez is designed to build application frameworks which allow for processing complex DAG of tasks in short time. It is built on top of YARN. It maintains the scalability of MapReduce while improving the speed dramatically compared to MapReduce. This is the reason other projects like Hive and Pig use Tez as the execution engine. Data processing in Tez is modeled as data flow graph with vertices representing the tasks and edges representing the flow of data. Each vertex running data processing logic is composed on Inputs, Processors and Outputs.