Big-Data - Hadoop Multi-Node Cluster over AWS using Ansible

Apache Framework — To Do Big Data Computing and Storage through Distributed Approach. Big Size files are Stripped out in some block sizes and Stores at different Data Storage Nodes using HDFS protocol. Benefit of the tool is that I/O process become faster…

Akanksha Singh
9 min readApr 21, 2021
Created by Akanksha

Modern world is running over Big Data— Google, Facebook, Amazon, Flipcard, Zomato, Instagram and all these big brands which we use/need more than 2400 times in a day. These Companies are leveraging services through automated online payment and delivery system, behind the seen they have to deal with zeta bytes of data that they receive from their customers. And as we know that input output operation on data depends on the file size. Hence in the process of retrieving data from storage, then searching for the client and managing the service request become easy when we use some technique to store data in small stripes other then managing whole in a single storage. Big data problem are clearly visible to the fact that more data we have more recourses will be consumed and more time would be wasted in traversing data.

🤔 What Exactly Big Data is?

Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

To know more about Big Data and how companies are using it to benefit their business. Read This — One of my blog!!

Problems that contributes to bigdata are:-

1. Volume -> the huge amount of data that is produced each day by companies.
2. Variety -> It refers to the diversity of data types and data sources.
3. Velocity -> refers to the speed with which the data is generated, analyzed and reprocessed.
4. Validity -> It guarantees of data quality or, alternatively, Veracity is the authenticity and credibility of the data.
5. Value -> It denotes the added value for companies. Many companies have recently established their own data platforms, filled their data pools and invested a lot of money in infrastructure.

🤔 How to solve this Big Data Management Challenge?

When we go with a simple concept to store array. And while traversing we need to load whole over our RAM similarly while working with Big Data we need to do so. But due to the above mentioned 5 key problems we are unable to so that. Hence, NFS concept replaces with HDFS concept. — Sounds Techie Right!!

Lets start with this blog’s main Tool around which the solution to the problem revolves — Hadoop

Hadoop

Hadoop is an Apache Framework to Do Big Data Computing and Storage through Distributed Approach. Big Size files are Stripped out in some block sizes and Stores at different Data Storage Nodes using HDFS protocol. Benefit of the tool is that I/O process become faster. Data is reliable as internally it replicated in containers and also using Map Reduce Concept we can do Analysis and Processing of Data easily.

HDFS Hadoop Distributed File System :

HDFS is a distributed file system that handles large data sets running on commodity hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even thousands) of nodes. HDFS is one of the major components of Apache Hadoop, the others being MapReduce and YARN.

Mainly it works over Striping the files in block sizes and then sending it to master for storing into the Hadoop cluster. Behind the scene these block size files get stored into the Data node and the metadata information is given to the client so that at the time of retrieval of file client get the data directly from slave and this process of Hadoop reduces i/o speed and increases parallels operations intend to give low latency and data availability to the client.

Benefit of HDFS are:

source : IBM

Let me Direct you to the practical part now,

We have three instances, one Master and two Slaves to create Hadoop cluster. One Client System for joining our master node and further store data over the cluster. Have done all these parts over AWS.

We have four roles in totality, First for the Instance creation over AWS EC2 instance and further three for Hadoop-Master (aka Name-node), Hadoop-Slave (aka Data-Node) and Hadoop-Client configurations.

My Local Machine Configurations:
> RHEL 8 VM on top of VBox Manager with 2 CPU, 4GB RAM.
> Ansible version 2.10.4 installed.
> Proper Network connectivity using Bridge Adapter.

Step 1: Create Ansible Configuration file:

Ansible being Agentless Automation Tool need the inventory file at the Controller node which i have mentioned to be our local System. Inventory file could be created either globally inside controller node in the path (/etc/ansible/ansible.cfg) or could be created at the workspace where we are going to run our playbooks/roles.

Create Workspace for this Project:

# mkdir hadoop-ws
# cd hadoop-ws
# mkdir roles

Configuration file:

ansible.cfg

For explanations of the above file Visit!

Step 2: Create two files

  • First:

Ansible Vault file named cred.yml which contain IAM access and Secret Key for Authentication to your AWS Account. In your directory 'my-ws' create using ansible-vault create cred.yml (give password)

file format:
access_key: GUJGWDUYGUEWVVFEWGVFUYV
secret_key: huadub7635897^%&hdfqt57gvhg
  • Second

Create file named hadoop_instance.pem which is key-pair that we use to create the ec2-instance remotely over our AWS account.

Steps:

1. Go to AWS Management Console

2. EC2 dashboard

3. Key-pairs

4. Craete new Key-Pair

5. Give name as hadoop_instance

6. Select ‘.PEM’ file format

7. Download the key to your local system

8. Transfer the key to Ansible Controller node in same directory where your role is hadoop-ws

Step 3: Creating Ansible Roles

Next we will create four main Roles i.e.
◼ AWS-EC2 instance Creation and dynamic IP retrieval Role,
◼ Hadoop-Master Configuration and Starting Namenode Service Role,
◼ Hadoop-Slave Configuration and Starting Datanode Service Role,
◼ Hadoop-Client Configuration and Registration to Cluster.

# mkdir role
# cd role
# ansible-galaxy init ec2
# ansible-galaxy init hadoop_master
# ansible-galaxy init hadoop_slave
# ansible-galaxy init hadoop_client

Now we have all the tasks, template, vars file other Ansible artifacts based on a known file structure that are pre-embedded in roles just we need to write the declaration/description (in YAML Language) of what all things we need by including Modules and jinja attribute annotations.

To know more about roles visit!

Step 4: Writing EC2 role:

EC2 Module of Ansible provide property to launch and provision instances over AWS Cloud. We have preferred t2.micro as instance_type and Amazon Linux 2 image as AMI. Also we have security group allowing all traffic from anywhere, instead of that you may go with the HDFS Protocol and some SSH and few ports inbound outbound rules like ports 9001, 50070 and few more.

# cd role/hadoop_master/tasks
# vim main.yml
task.yml

Included some variables and files in templates and vars folder like instance_tags, Python_pkg, sg_name, region_name, subnet_name, ami_id, key_pair, instance_flavour. These variables can directly be called in ec2 roles/tasks/main.yml file as per Ansible Artifact.

# cd role/kube_master/vars
# vim main.yml
var.yml

Step 5: Writing Hadoop_master role:

Hadoop master has the main metadata of the cluster like Cluster ID and ports written on the core-site.xml file in XML format, Critical thing that we need to perform over this master node is formatting of our mater shared folder which need to be done only once. Further we have to give the Port from where master will communicate to it’s slaves and client i.e. Port 9001.

Consider following code that need to be written inside main.yml of the hadoop_master/tasks folder.

task.yml

We have variable file that contain the value which would be directly called / replaced at the time of role execution. These variables include — pkgs_name and hadoop_folder.

var.yml

There are two files in Hadoop that we need to configure one is core-site.xml and another is hdfs-site.xml to configure both these files through ansible we have two strategies one is just to copy them using copy module or copy them doing some changes using template module. Usually we use template for the data processing at the time of coping and these template files are written in jinja2 file format.

What you need to do is put core-site.xml file in files folder of our role. So just go to hadoop_master/files and put the following file.

core-site.xml

The jinja 2 format file that we include inside template folder this file is of hdfs-site.xml that our master configures at the time of master configuration:

hdfs.xml

Step 6: Writing Hadoop_slave role:

Now we have to configure Hadoop-slave, there can be as many Hadoop slave as big we want to gain the Distributed Storage power and Computing Resources. Following is the task/main.yml file where all the steps are declared in YAML format. Starting from installing dependencies then at the end we have final task in slave to run the data node services.

task.yml

Following is the var/main.yml file from our hadoop_slave role where we have two vars, pkgs_name and hadoop_folder that we need to call for the respective values.

var.yml

As we have to use template module there in the task/main.yml file so as to copy both the files after doing some processing or say changes into the file and then the files to the slave node and the files are core-site.xml and hdfs-site.xml those are mentioned there inside templates folder in role in jinja 2 file format:

core-site.xml
hdfs-site.xml

Step 7: Writing Hadoop_client role:

Client have the target to reach to the cluster using master IP and then put/ read/ write / edit / delete any other operations to store the files of big sizes in the desirable block size and the further replication that will increase the availability of their files. Following is the task/main.yml file of our role where the descriptive code of client configuration is given:

task.yml

The variables are to be mentioned in the vars/main.yml folder which includes- pkgs_name only.

var.yml

As only we need to mention the master node name while configuring client node, so only file we need to change is core-site.xml in jinja 2 format which we will put in template folder of our role:

core-site.xml

Here we are retrieving mater IP through the facts hostvars within our dynamic inventory and that we get from our master provision output with the port on which mater would listen/respond i.e. 9001.

Finally we are done with the creation of all the roles now we just need to run these, for which we have to create a playbook and just keep on running these roles by including them one after the other in logical way. So we know that firstly the nodes need to be launched over AWS then Master configuration followed by slave configuration and then finally client registration.

Step 8: Create Setup Playbook to run all the roles and Create the Cluster over AWS:

Following is the playbook file — setup.yml that need to be there inside our working directory which is hadoop-ws

setup.yml

Step 9: Running the Playbook Setup.yml

To run the playbook using ansible-playbook command and giving vault password to authenticate and login into AWS Account.

# ansible-playbook setup.yml -ask-vault-pass

Hence we have achieved our target of “Hadoop Multi-Node Cluster Over AWS”.

You can find this project over my GitHub, just fork and lets make the project more system independent and reliable to customers.

To contribute the Project, and for further query or opinion you may connect me over LinkedIN:

Thanks for reading. Hope this blog have given you some valuable inputs!!

--

--

Akanksha Singh

Platform Engineer | Kubernetes | Docker | Terraform | Helm | AWS | Azure | Groovy | Jenkins | Git, GitHub | Sonar | NMAP and other Scan and Monitoring tool