Skip to content

imWiki/hadoop_cust_ecosystem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hadoop Env Setup with Sample ETL & Spark Program

Pre-Requesites

  1. Any Browser to view the zeppelin and other services
  2. Git (Optional to download the contents of this repository)
  3. Vagrant - Download relevant platform installer & run the setup
  4. VirtualBox - To run the virtual ubuntu machine with hadoop stack

Services

The virtual machine will be running the following services:

  • HDFS NameNode + DataNode
  • YARN ResourceManager/NodeManager + JobHistoryServer + ProxyServer
  • Hive metastore and server2
  • Spark history server
  • Zeppelin Server
  • At the end batch script will download csv file & load to hive. PySpark will then load the output table with result dataset which will be visualized in Apache Zeppelin

Getting Started

  1. Download and install VirtualBox & Vagrant with above given links.
  2. Clone this repo git clone [email protected]:imWiki/hadoop_cust_ecosystem.git
  3. In your terminal/cmd change your directory into the project directory (i.e. cd hadoop_cust_ecosystem).
  4. Run vagrant up --provider=virtualbox to create the VM using virtualbox as a provider (NOTE This will take a while the first time as many dependencies are downloaded - subsequent deployments will be quicker as dependencies are cached in the resources directory).
  5. Once above command is completed, by this time the data from given URL would've been loaded to Hive, You can see something like below at end of the execution i.e. Result Set Loaded into ASSESSMENT_RESULTS Target table,

picture

  1. Execute vagrant ssh to login to the VM.
  2. Execute beeline -u 'jdbc:hive2://vigneshm:10000/default;' --color=true -n vagrant -p vagrant to login to the hive & see the tables created with requested data loaded.

picture

  1. Main ETL Functionality is implemented in a shell script within the scripts directory data_proc.sh & PySpark is written in asmt_results.py
  2. Navigate to http://vigneshm:8080/#/notebook/2E4H2MXP3 for simple visualization built on Zeppelin with given dataset. This is how it should look like,

picture

  1. In case if there are any issues running %spark.sql within zeppelin dashboards, it must be a conf glitch, please restart the service with commands given below after logging into vagrant virtual machine instance
cd /home/ubuntu/zeppelin-0.8.0-bin-netinst/bin/
sudo -sE
./zeppelin-daemon.sh restart

picture

Work out the ip-address of the virtualbox VM

The ip address of the virtualbox machine will be 10.211.55.101. Please add this entry to your hosts file in your machine to access the services with hostname instead of IP in browser. As shown below,

picture

Web user interfaces

Here are some URL to navigate to various service UI's:

Substitute the ip address of the container or virtualbox VM for vigneshm if necessary.

Shared Folder

Vagrant automatically mounts the folder containing the Vagrant file from the host machine into the guest machine as /vagrant inside the guest.

Managment of Vagrant VM

To stop the VM and preserve all setup/data within the VM: -

vagrant halt

or

vagrant suspend

Issue a vagrant up command again to restart the VM from where you left off.

To completely wipe the VM so that vagrant up command gives you a fresh machine: -

vagrant destroy

picture

About

Custom Hadoop Stack Setup

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published