The Bangalore Apache Hadoop Meetup group, with over 3400 members who share common interests and ideas in the Hadoop ecosystem, brings together a community of practitioners and developers at Bangalore. The talks at this meetup cover a variety of topics related to the Hadoop ecosystem, such as Data Science workloads, Big Data-Driven Applications, SQL on Hadoop, YARN workloads — the list is endless.
The most recent meetup of the group was held on July 28, 2018 at LinkedIn, Bangalore. Over a hundred enthusiastic participants attended this meetup on a busy weekend. The Twitter feed about the meetup was also busy with a lot of tweets. You can seen them here:
Talk 1 Ozone: Object Store in Apache Hadoop
The meetup kick started with an overview of the Ozone File system. Mukul Kumar Singh and Nandakumar from Hortonworks shared a quick overview of the new Ozone File System (HDDS) and its capabilities.
One of the major challenges with HDFS is managing small files, which can also affect scalability. Ozone, which belongs to the Apache Hadoop ecosystem, is an object store to help address the scalability issues in HDFS. Ozone is also HDFS-compatible, and therefore, downstream projects can use it without any client modifications.
Talk 2 Sorcerer – Myntra’s Self-Serve Data Ingestion Platform
Deepak Batra from Myntra (one of the largest online shopping platforms for fashion and lifestyle in India) presented an overview of Sorcerer — the latest Data Ingestion Platform which is running on the production clusters at Myntra.
Sorcerer uses Apache Gobblin as core of the Data Ingestion framework that runs on the Apache Hadoop YARN resource management platform. It also uses Debezium, an open source distributed platform to capture data changes) for MySQL, which is over 20 million in an hour. Sorcerer also uses Hive Metastore for Data Discovery with its compaction and snapshot features.
Myntra’s complete data ingestion as of today is done by using Sorcerer. More features are being planned for its querying capabilities in the near future.
Talk 3 Scaling and Managing Capacity for the Linkedin grid ecosystem
Linkedin has one of the largest Hadoop clusters in production. Rahul Jain from LinkedIn gave an impressive overview of LinkedIn clusters and how they use the cluster to power various cool LinkedIn features such as People You May Know and Linkedin Learning.
Rahul introduced us to the cluster that runs Azkaban, a batch workflow job scheduler created at LinkedIn for Hadoop jobs on YARN platform.
Rahul shared use cases that are essential for a cluster administrator and showcased a new user interface that extracts complex metrics from a Hadoop cluster. This UI collects various cluster metrics from components such as YARN, History Server, and HDFS, and correlates them on a dashboard. This dashboard, named GridView, provides an intuitive user experience that helps cluster administrators understand how their clusters are running at any given point in time and provide relevant answers to pressing questions such as “Why is my job running slow today?”
Talk 4 Implementation and Performance Impact of Join Order, Dynamic Filter, and Cost Estimation of Queries in Presto
Rajat Venkatesh from Qubole presented an interesting talk on the performance impact of various join statements in Presto.
In this talk, Rajat covered various types of joins and optimizing them using Dynamic Filtering. Dynamic filtering leads to about 30% improvement in performance as columns that are not present in a table can be filtered from another related table at run time.
Talk 5 Apache Hadoop 3 Insights and Migrating your Clusters from Hadoop 2 to Hadoop 3 by Sunil Govindan (@sunilgovind) and Rohith Sharma K S (@rohithsharmaks) from Hortonworks
We presented a detailed overview of Hadoop 3 features that are available in the Hadoop 3.1 release. We also provided an informative preview of the upcoming features in YARN.
In this session, we also covered the migration use case from Hadoop 2 clusters to Hadoop 3. This intends to help users who plan to migrate their clusters to Hadoop 3 and use the latest features. To address the challenges associated with migration of platforms, we provided a detailed upgrade plan with the configuration, shell script, and command changes that help in simplifying upgrade process.
We recommend Express Upgrade to migrate to Hadoop 3.
All the sessions that we attended were informative and diversified. There were very good ‘industry-wide’ discussions with all the speakers and participants who shared their experiences of running Hadoop clusters in production environment for various workloads and also provided everyone with more use cases to solve for the future.
With a delicious lunch arranged by Linkedin Bangalore team, we said goodbye to each other until we meet for the next Hadoop Meetup at Bangalore!