The Apache Hadoop community announced Hadoop 3.0 GA in December, 2017 and Hadoop 3.1 in April, 2018 loaded with great features and improvements.
One of the biggest challenges while upgrading to a new major release of a software platform is its compatibility. Apache Hadoop community has focused on ensuring wire and binary compatibility for Hadoop 2 clients and workloads.
There are some challenges to be addressed by admins while upgrading to a major release of Hadoop. Users running workloads on Hadoop 2 should ideally be able to seamlessly run or migrate their workloads onto Hadoop 3. This blog talks about the motivation for upgrading to Hadoop 3 and provide a cluster upgrade guide for admins and workload migration guide for users of Hadoop.
Hadoop 3 has been an eagerly awaited major release with great features and improvements in both YARN and HDFS. Below, we have highlighted only a few of the newly added features. There are tons of exciting features and improvements that were part of Hadoop 3 which will delight customers and increase Hadoop adoption without costing significantly more.
HDFS introduced Erasure coding in Hadoop 3 which enables significant storage cost savings and reduction in storage overhead from 200% to 50%.
HDFS Intra-Data Node disk Balancer is a new feature added in Hadoop 3 which adds the capability to balance data across disks within a single Datanode.
HDFS Federation with support for namespaces for better scalability is GA in Hadoop 3.
YARN scheduler added a bunch of improvements particularly Capacity Scheduler with changes to scheduler’s core for faster(global) scheduling and native support for new resource types like GPUs.
Support for Containerized applications on YARN ( via Docker) is GA in Hadoop 3 along with support for long running services and service discovery.
Admins are recommended to Express Upgrade clusters from Hadoop 2 to 3
With this being a major version upgrade, there are few challenges and issues in supporting Rolling Upgrades. Some of the open issues are
Hadoop 3 doesn’t preserve full API level compatibility due to the following changes
Some of the major configuration/script changes introduced in Hadoop 3 are listed below
For more details on changes in Hadoop 3 and how they would affect upgrades or workloads post upgrade, refer to the slides for the talk we gave recently at Dataworks Summit
Recommend users to upgrade to at least Hadoop 2.8.4 before migrating to Hadoop 3.
Hadoop 3 requires Java 8 and above to be deployed on the cluster
Minimum supported Docker version on Hadoop 3 is 1.12.5
Minimum version required is Bash V3. POSIX shells are NOT supported
Hadoop 3 requires additional YARN components in the cluster for running YARN Service applications
Apache Hadoop upgrade is fairly involved with elaborate steps for pre-upgrade, upgrade process and post-upgrade. Please refer to the slides for the talk we gave recently at Dataworks Summit for further details.
MapReduce is fully binary compatible and workloads should run as is without any changes required.
Apache Slider is retiring from Apache Incubator and users are required to port their Slider applications to YARN Services. Please refer to our recent blog on Apache Hive LLAP as a YARN Service on how easy it is to port Slider apps to YARN Services.
The following Apache versions already support Hadoop 3 or are in the process of supporting Hadoop 3 in the community.
Express upgrades are the recommended way to go for upgrading from Apache Hadoop 2 to Hadoop 3.
Apache Hadoop community recently released Hadoop 3.1.1 with some features and fixes for bugs identified since Apache Hadoop 3.1.0. Recommend users to upgrade to Apache Hadoop 3.1.1 from Apache Hadoop 2.8.4. For end users, applications on HDFS and YARN should run mostly as is !