Hadoop is not like another technology where you can learn or offer services, rather than this Hadoop is a platform where you can build splendid applications. From the rise of the Hadoop there are dozens of software communities have been developing modules which can address variety of problem specs and meeting different needs.
In this article we are going to get brief introduction about the world of Hadoop and its Ecosystem. With Hadoop Ecosystem there are tons of commercial as well as open source products which are widely used to make Hadoop laymen-accessible and more usable. Here we are going to discuss about some specific components which gives core functionality and speed to Hadoop.
HDFS – Hadoop Distributed File System
When file size grows from GBs to TBs, we need some concrete system who can manage this entire file. As we all know, that database can be distributed over multiple servers, as the same way when multiple file size increased, a single machine cant process that files, we need distributed file system (who manage storage across network). HDFS is specially designed to store huge amount of file size where operations are write once and read multiple times are done. For Low Latency data access, HDFS is not a good option, as there are lots of small file included in operation for the update purpose.
Map Reduce
Map Reduce is uses to process highly distributable issues where so many computers are collectively used (known as cluster). In terms of programming, there are two functions which are most common in Map Reduce.
1) Map
2) Reduce
In map step, Master computer or node takes input (Large data set to process) and divide it into smaller parts and distribute it on other worker nodes (other machine/Hardware). All worker nodes solve their own small problem and give answer to the master node.
In Reduce step, Master node combines all answers coming from worker node and forms it in some form of output which is answer of our big distributed problem.
Hive
If you have gone through the limitations or disadvantages of Hadoop you might found that to work with Hadoop is not easy for end users especially who are not familiar with Mapreduce framework. Hive specially developed for People who have strong control on SQL Queries, Hive provides SQL like language called HiveQL.
The main building blocks of Hive are –
- Metastore – To store metadata about columns, partition and system catalogue.
- Driver – To manage the lifecycle of a HiveQL statement
- Query Compiler – To compiles HiveQL into a directed acyclic graph.
- Execution Engine – To execute the tasks in proper order which are produced by the compiler.
- HiveServer – To provide a Thrift interface and a JDBC / ODBC server.
When all these components are merged, it makes the Hadoop very user friendly.
HBase
As said earlier, HDFS (Hadoop Distributed File System) works on write once and read many times pattern, but this isn’t a case always. We may require real time read/write random access for huge dataset; this is where HBase comes into the picture. HBase is built on top of HDFS and distributed on column-oriented database. HBase is not relational and doesn’t support SQL, but it’s specially designed to perform which RDBMS can’t such as to host very large dataset over a cluster.
Mahout
A success of big data is depends on how fast and efficiently you can convert a vast amount of information to some actionable information. Whether it is analyzing all log (visited web pages) of Cyberoam for particular company, like Azilen Technologies, to know which site is most frequently opened in Azilen or if you want to proceed thousands of personal email messages and to generate analytics and then to organize and extract information, Mahot is being used here.
If you have seen the English season “Person of Interest” you may found that in this serial Person builds a machine to collect the data of people of New York City to identify terrorists and machine separates these people in two sections relevant and irrelevant. Relevant people are people who can be threat to national security and others are irrelevant. As the Same way Mahout Recommender Engines tries to identify unknown items from their area of interest and distributes in more categorized way.
Sqoop
To understand Sqoop, let’s say we have released one version of software and now it’s in production and you are implementing some new features and then releasing a new version which requires migration from old data to new data. In the same way, even if you love the new concepts of big data, your existing data is still in SQL over distributed server. Now, how to convert/migrate these data in such form that it can be useful to Big Data concepts. This is the scenario where Sqoop comes in the picture.
To load bulk data from production system and to access it in mapreduce application can be a challenging task. Transferring data using SQL or other technology scripts is inefficient and time-consuming. Sqoop is similarly to a SQL Server Integration Service which allows easy import and export of data from structured data such as relational databases, enterprise data warehouses, and NoSQL systems. Sqoop also sliced up data into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset.
Pig
To analyze and querying a huge data set is not easy with any ordinary language. When we are talking about analyzing Pera bytes of data, it requires extensive level of parallel mechanism to analyze data.
Pig is used to analyze and to fire query on large data set that consists high level language which is for expressing data analysis programs, coupled with infrastructure for evaluating these. Pig Latin has main three key properties:
- Extensibility
- Optimization opportunities
- Ease of programming
Flume
In real world there are many common problems which can’t be solved by the existing components like if I have well defined structured database and ask help to some of software engineer then each one can implement their logic and do analytics for me. But, data is not always in structured manner. Let’s take one example, if you assume Google is taking logs. I assume that Google has very strategic and good way to do this but this is only an assumption that on each day every keyword typed in Google search box are written in file. Now, do analytics for Google. I guess you are getting my points. All logs, whether they are event based or specific action based, are most difficult to analyze. these data that is where I may present Flume.
Flume is a utility for harvesting, aggregating and moving huge amounts of log data or text files in and out of Hadoop. Input of flume can be anything like Avro (another Hadoop component), files, system logs, HDFS, HBase. Flume itself has a query processing engine, so it’s easy to transform each new batch of data before it is shuttled to the intended sink.
So far,
From All above discussed components just few are well known components which are developed by famous software companies and each one is designed for very specific purpose so, for Hadoop I can say that it’s not single person or company’s idea to build it. With Hadoop and its ecosystem most interesting thing I found is names of each component. You can also see that each component has unique name and every name has its own story. Hadoop is powerful because it is extensible and it is easy to integrate with any component.