To make Big Data cloud-native, Hortonworks has unveiled the Open Hybrid Architecture Initiative. In our previous blogs, we have talked about the vision, key tenets/concepts, real-world use case, and the new storage environment of O3. We often get asked by our customers, partners, and analysts: “Hortonworks has been in the middle of the data revolution for years. How do you plan to bring the new Kubernetes cloud-native architecture to Big Data?”
We have collaborated with many of our customers in their container journey for many years now, leading to one of our major launches of recent times (Hortonworks Data Platform 3.x). While Kubernetes has started as a container orchestrator for stateless applications such as web applications and is now on its path to support data-intensive applications, running the Big Data stack on Kubernetes introduces many new challenges and opportunities.
If you take a data-intensive micro-service such as a Data Science solution, it comprises of both stateless and stateful components. The stateless components such as Spark, TensorFlow require bursty compute and memory to process deep learning or statistical models. Then, there are stateful services such as PostgreSQL or MySQL to store the persistent states or even file shares to store data science notebooks. To enable the full gamut of microservices, we are investing in Apache Hadoop Ozone (O3). O3 provides a persistent storage layer at Big Data scale with multiple access paths (Hadoop API, S3 API, iSCSI block, NFS file share) at the same time being decoupled from the compute layer and with a Container Storage Interface (CSI) to seamlessly integrate into the Kubernetes eco-system.
In the Big Data world, a single Hadoop cluster typically supports thousands of tenants at scale, almost in an inverted pyramid. A handful of administrators operationalize the Hadoop clusters; a dozen data engineers extract data from multiple sources to the cluster for analytics at scale; a dozen data scientists apply GPU intensive deep learning or statistical models to one/many business problems; hundreds of business analysts run interactive or batch queries for report generation via BI tools. Battle-hardened Apache YARN, with advanced capabilities, handles those diverse workloads from real-time interactive queries to batch workloads at scale, in a very elastic manner. We see an opportunity to extend the prowess of YARN as a powerful job scheduler into the hybrid environment.
Then, there is a shift of the security model from bare-metal to containerized world, in the backdrop of adoption of multi-cloud, in conjunction with on-premises. When it comes to networking, containers now need an overlay network to communicate across physically separate servers. So, we want to invest in Container Networking Interface (CNI), so that our customers can use their familiar software-defined networking vendor such as Calico or others to enforce the IP firewall policy at the container granularity. There are many more areas to consider, as we scan through twenty plus (and growing) components of the Big Data stack (did I mention Apache Oozie and Apache Airflow?). Kubernetes has been a community effort based on collaboration across a wide variety of technologies and that aligns with Hortonworks’ commitment to the open source community. We want to participate in the Cloud Native Computing Foundation (CNCF) and jointly work towards a common Cloud Native Big Data Architecture.
We have collaborated with our customers, analysts and partners and now, this is how we view the Cloud Native Big Data architecture that is consistent across multi-cloud and on-premises. This architecture has a de-coupled storage and compute environment- the compute environment is tuned for Big Data workloads with a strong workload/job scheduler. Our diverse workloads are containerized and the same architecture can run in both on-premises and cloud in an open format that our customers are used to, whether it is Enterprise Data Warehouse or Data Science and Engineering workloads. We have a shared security & governance service across on-premises and multi-cloud. Last and not the least, all of them are managed via a single pane of glass, with persona focused user experiences so that we can support hundreds to thousands of diverse tenants in the inverted pyramid. Please drop by our booth to get a copy of the Cloud Native Big Data Journey trail map.
We want to demonstrate an end to end working instantiation of our Cloud Native Big Data Architecture with 4 demos. In this case, a fictitious company named Hortonworks AirFreight wants to build a hybrid cloud, encompassing on-premises and multi-cloud.
We encourage you to visit us at Booth P2 at KubeCon Seattle from Dec. 11th to 13th.