Data Ingestion in Hadoop

Like it? Share...Share on FacebookPin on PinterestTweet about this on TwitterShare on LinkedInShare on Google+Email this to someone

This is part II of Big Data Overview Blogs :
1. Part I : What is Big Data, What is Hadoop and Hadoop Ecosystem, managing Hadoop Cluster.
2. Part II : Data Ingestion in Hadoop
3. Part III : Data Access And Analysis in Hadoop

Data ingestion can be understood well by first understanding the data management layer of Hadoop. Hadoop Distributed File System (HDFS) and YARN form the data management layer of Hadoop. Hadoop uses a Java based file system which spans across a large number of commodity servers. This file system is known as Hadoop Distributed File System (HDFS) and it provides a scalable, fault-tolerant, reliable and cost efficient storage for big data. YARN (Yet Another Resource Negotiator) is the architectural core of Hadoop. It handles cluster resource management and is the central platform on top of which batch/interactive and data access technologies run.

Apache Sqoop

Sqoop is used to transfer bulk data between Hadoop components and structured data stores like RDBMS. And this transfer is coordinated by YARN. Sqoop works in both directions as it can be used for both, importing the data into Hadoop and exporting it data from Hadoop back to RDBMS. Sqoop can be used to transfer entire datastore, selected tables, selected rows, selected columns, or just the output of a conditional query. User can also write jobs to schedule these transfers and define the format in which the data is to be imported in. Data format can be either sequencefile, textfile (default), ORCFiles or avrodatafile. Sqoop does parallel data transfers and hence copies data very fast.

Apache Flume

Flume is used to ingest large amount of streaming data in Hadoop in a fast, reliable and fault-tolerant manner. Streaming data can be anything from social media data, application logs, geo-location data, IoT data to machine data. Flume can easily scale horizontally as more nodes can be added to the cluster to ingest more data. Flume provides guaranteed data delivery as it uses channel based communication. Flume system comprises of one or more agents where each agent is a system of source, channel and sink. Flume system consumes events from external source and data is transferred via agents into the destination like HDFS. An event is deleted from channel of one agent only after it is received by channel of next agent in the flow. This ensures guaranteed data semantics. Flume also allows users to design complex flows with multiple hops, fan-in fan-out flows, contextual routing and back-up routing for fail-over flows.

Apache Atlas

Atlas is a data governance and metadata platform for Hadoop. Atlas enables exchange of metadata with tools and processes inside and outside of Hadoop stack. Atlas provides a set of core foundational services which allows enterprises to meet their compliance requirements. Atlas provides data classification by providing the ability to define, annotate and automate the capture of relationships between data sets and underlying elements. Atlas serves as security and policy engine as users can define policies for preventing data derivation based on classification. Atlas also provides features of central auditing of operational and security access information of every application, process and interaction with data. Users can also do search and lineage to explore data classification and audit information.

Like it? Share...Share on FacebookPin on PinterestTweet about this on TwitterShare on LinkedInShare on Google+Email this to someone

Leave a Reply