This is part I of Big Data Overview Blogs for developers :
- Part I : What is Big Data, What is Hadoop and Hadoop Ecosystem, managing Hadoop Cluster.
- Part II : Data Ingestion in Hadoop
- Part III : Data Access And Analysis in Hadoop
What is big data?
Technology usage has grown exponentially in last two decades. This naturally has increased the amount of information being collected from usage of these technologies. Source of these large volume of data spans from websites, apps, mobiles, software tools to drones, chips, automobiles and everything on planet which uses technology in any way. And with increase in usage of technology, this is only going to go upwards. And this forms the basis of definition of big data, it refers to trend in data that is too diverse (Variety), high speed (Velocity), and massive (Volume) for traditional technologies, infrastructure and skills to process with efficiency. If this V3 (Variety, Velocity, Volume) data is to be made useful, it would require skills and resources. This is an emerging field and below is the overview of some technologies popular in big data circles.
What is Apache Hadoop?
Apache Hadoop is a open source software platform for distributed storage and distributed processing of big data on computer clusters. These cluster of machines use a distributed file system called Hadoop Distributed File System (HDFS) for data storage. It also uses programming model called MapReduce for distributed processing of data. Hadoop includes a component called YARN (acronym for Yet Another Resource Negotiator) which handles the task of managing cluster resources and scheduling applications in the cluster. There is another piece called Hadoop Common which consists of common libraries and utilities used by other modules. These 4 components : HDFS, MapReduce, YARN and Common form the base Apache Hadoop framework.
It refers to technologies which utilize capabilities of one or more core components of Hadoop and provide comprehensive system for either data ingestion, processing or analysis. Example of one such technology is Apache Sqoop which uses provides utilities to import and export data from HDFS. In this blog, we will cover Ambari, and more information will be covered in upcoming blogs in this series.
Ambari is an open source web tool aimed at managing, monitoring and provisioning of Hadoop clusters. It uses REST APIs to communicate to services installed in the clusters and these APIs can also be integrated with other applications to use Ambari capabilities. It provides features such as :
- Installing hadoop services across large number of nodes in cluster (Provisioning)
- Managing configuration for these services based on node (Provisioning/Managing)
- Provide ability to start, stop services on single node or entire cluster (Managing)
- Metrics Collection and Alerting features are provided using Ambari’s Metrics System and Alerts Framework. (Monitoring)