Skip to content

Apache Hadoop

Hadoop Logo

Application description

The Apache Hadoop project provides an open source framework for enabling massively distributed compute scaling both for computational and data storage scale. Hadoop clusters of nodes are built on top of the Hadoop Distributed FileSystem (HDFS) to provide localised access to shared data without impacting the overall locking and serialisation of the complete dataset. In this manner, local storage and CPU performance is an important factor in scaling overall system performance.

Infrastructure Environment tested

The Apache Hadoop cluster application has been tested on the following:

Resource Value
Host OS Ubuntu 18.04
Kernel Version 4.15.0-51-generic
Package manager apt
Application version 3.1.2
Environments tested AWS r5.2xlarge, Sunlight r5.2xlarge, VMWare r5.2xlarge

Configuration and Setup description

Test data sizes are 200/800/800 GB. Benchmark tested is the built-in hadoop stress test utility.

Resource Value
Approx Package installation time 5 mins
Approx test execution time 60 mins
bin/hdfs namenode -format
bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.2-tests.jar TestDFSIO -write -nrFiles 8 -fileSize 100GB
bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.2-tests.jar TestDFSIO -read -nrFiles 8 -fileSize 100GB
bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-3.1.2-tests.jar TestDFSIO -clean

Download the full installation script for Ubuntu 18.04 here.

Data results table

Test completion time

Flavour Write Read Combined RW (50:50)
AWS r5.2xlarge + guaranteed IOPs 2126.11 3837.57 2981.84
Sunlight r5.2xlarge 1533.36 1325.39 1429.375

Throughput test

Flavour Write Read Combined RW (50:50)
AWS r5d.2xlarge 385.77 213.59 299.68
Sunlight r5d.2xlarge 535.17 619.3 577.235

Equivalent test on Sunlight executes around 2x faster than on AWS.

Performance graphs

Hadoop Read and Write test completion time

Hadoop RW throughput