The document discusses deploying Apache Spark on Kubernetes. It provides an overview of Kubernetes and Spark architectures, and describes how to configure Spark applications to run on Kubernetes, including using DaemonSets for the shuffle service, StatefulSets for HDFS, and a staging server for resources. Examples are given of SparkPi and GroupByTest submissions using Kubernetes. Challenges of running HDFS on Kubernetes are also mentioned.
This document discusses using Docker containers to run Cassandra clusters at Walmart. It proposes transforming existing Cassandra hardware into containers to better utilize unused compute. It also suggests building new Cassandra clusters in containers and migrating old clusters to double capacity on existing hardware and save costs. Benchmark results show Docker containers outperforming virtual machines on OpenStack and Azure in terms of reads, writes, throughput and latency for an in-house application.
This document summarizes the Scylla Operator for Kubernetes, including its developers, features, releases, and roadmap. Key points include: - The Scylla Operator manages and automates tasks for Scylla clusters on Kubernetes. - Features include seedless mode, security enhancements, performance tuning, and improved stability. - It follows a rapid 6-week release cycle and supports the latest two releases. - Future plans include additional performance optimizations, persistent storage support, TLS encryption, and multi-datacenter capabilities.
Organizations commonly use Apache Spark to gain actionable insight from their large amounts of data. Often, these analytics are in the form of data processing pipelines, where there are a series of processing stages, and each stage performs a particular function, and the output of one stage is the input of the next stage. There are several examples of pipelines, such as log processing, IoT pipelines, and machine learning. The common attribute among different pipelines is the sharing of data between stages. It is also common for Spark pipelines to process data stored in the public cloud, such as Amazon S3, Microsoft Azure Blob Storage, or Google Cloud Storage. The global availability and cost effectiveness of these public cloud storage services make them the preferred storage for data. However, running pipeline jobs while sharing data via cloud storage can be expensive in terms of increased network traffic, and slower data sharing and job completion times. Using Alluxio, a memory speed virtual distributed storage system, enables sharing data between different stages or jobs at memory speed. By reading and writing data in Alluxio, the data can stay in memory for the next stage of the pipeline, and this result in great performance gains. In this talk, we discuss how Alluxio can be deployed and used with a Spark data processing pipeline in the cloud. We show how pipeline stages can share data with Alluxio memory for improved performance benefits, and how Alluxio can improves completion times and reduces performance variability for Spark pipelines in the cloud.
Automate and streamline your build-test-release cycle for reliable, continuous delivery of your product
This document provides an overview of Apache Mesos and how to run Apache Spark on a Mesos cluster. It describes Mesos as a distributed systems kernel that allows sharing compute resources across applications. It then gives step-by-step instructions for launching a Mesos cluster in AWS, configuring and running Spark jobs on the cluster, and where to find example Spark jobs and further Mesos resources.
What are the tools that we can find to day to manage Hadoop cluster and its ecosystem? There are two tools ready today: Cloudera Manager and Ambari from Hortonworks. In this presentation I explain what they do and why to use them, as well as Pros. and Cons.
This presentation gives an overview of the most relevant updates and features going from Java 9 to Java 11 from a developer's perspective.
- Nomad is a cluster scheduler that makes deploying Spark clusters easy for developers and operationally simple. It allows Spark jobs to be deployed across multiple datacenters and regions. - Currently, Nomad allows running Spark in production environments without compromising functionality. It enables shared clusters for batch and streaming workloads with higher efficiency. It also integrates with Vault for secure secrets management. - Future enhancements may include preempting lower priority Spark executors, implementing quotas and chargebacks, enabling GPU acceleration, and allowing over-subscription of resources to improve cluster utilization. Nomad aims to make deploying and running Spark easier and more cost effective at scale.
In this talk we will walk through how Apache Kafka and Apache Accumulo can be used together to orchestrate a de-coupled, real-time distributed and reactive request/response system at massive scale. Multiple data pipelines can perform complex operations for each message in parallel at high volumes with low latencies. The final result will be inline with the initiating call. The architecture gains are immense. They allow for the requesting system to receive a response without the need for direct integration with the data pipeline(s) that messages must go through. By utilizing Apache Kafka and Apache Accumulo, these gains sustain at scale and allow for complex operations of different messages to be applied to each response in real-time.
This document discusses Kubernetes usage at VMware SAAS. It covers dynamic provisioning of applications on Kubernetes, monitoring tools used like DataDog and Log Insight, and best practices for upgrading Kubernetes clusters. Key points include using stateless applications where possible, service discovery using Kubernetes services, dynamic provisioning using an onboarding service, and performing rolling upgrades for stateful applications to minimize downtime.
This is a talk from the Austin OpenStack summit. It demonstrates how a resilient, elastic and load-balanced cluster can be deployed using senlin, heat, ceilometer, lbaas v2, nova.
You’ve heard all of the hype, but how can SMACK work for you? In this all-star lineup, you will learn how to create a reactive, scaling, resilient and performant data processing powerhouse. Bringing Akka, Kafka and Mesos together provides a foundation to develop and operate an elastically scalable actor system. We will go through the basics of Akka, Kafka and Mesos and then deep dive into putting them together in an end2end (and back again) distrubuted transaction. Distributed transactions mean producers waiting for one or more of consumers to respond. We'll also go through automated ways to failure induce these systems (using LinkedIn Simoorg) and trace them from start to stop through each component (using Twitters Zipkin). Finally, you will see how Apache Cassandra and Spark can be combined to add the incredibly scaling storage and data analysis needed in fast data pipelines. With these technologies as a foundation, you have the assurance that scale is never a problem and uptime is default.
An experience sharing of the OpenStack deployment at Suning.com, a large online retailer in China. The talk presents the challenges and opportunities on orchestrating the enterprise workloads using Heat.
This document discusses using ZooKeeper to automatically handle Redis failover. ZooKeeper is an open-source tool that provides primitives for building distributed applications and handles tasks like leader election and quorum management. The presenter describes how his redis_failover Ruby gem uses ZooKeeper to monitor Redis servers, detect failures, and automatically inform clients so they reconnect to the new master, preventing downtime during a failover. Several companies already use this approach with redis_failover to make their Redis infrastructure more robust and fault-tolerant.
The document discusses Hadoop on Mesos, beginning with a short history of distributed computing. It describes how Mesos provides an operating system for clusters that allows applications like Hadoop to run as distributed frameworks. The document outlines challenges in running Hadoop on Mesos and how these were addressed, including using Mesos schedulers and reservations. It also presents a case study of how Airbnb migrated its Hadoop infrastructure from Amazon EMR to run on Mesos, improving availability, performance, and customer satisfaction.
Apache Hadoop YARN is a modern resource-management platform that handles resource scheduling, isolation and multi-tenancy for a variety of data processing engines that can co-exist and share a single data-center in a cost-effective manner. In the first half of the talk, we are going to give a brief look into some of the big efforts cooking in the Apache Hadoop YARN community. We will then dig deeper into one of the efforts - supporting Docker runtime in YARN. Docker is an application container engine that enables developers and sysadmins to build, deploy and run containerized applications. In this half, we'll discuss container runtimes in YARN, with a focus on using the DockerContainerRuntime to run various docker applications under YARN. Support for container runtimes (including the docker container runtime) was recently added to the Linux Container Executor (YARN-3611 and its sub-tasks). We’ll walk through various aspects of running docker containers under YARN - resource isolation, some security aspects (for example container capabilities, privileged containers, user namespaces) and other work in progress features like image localization and support for different networking modes. Speakers: Vinod Kumar Vavilapalli is the Hadoop YARN and MapReduce guy at Hortonworks. He is a long term Hadoop contributor at Apache, Hadoop committer and a member of the Apache Hadoop PMC. He has a Bachelors degree from Indian Institute of Technology Roorkee in Computer Science and Engineering. He has been working on Hadoop for nearly 9 years and he still has fun doing it. Straight out of college, he joined the Hadoop team at Yahoo! Bangalore, before Hortonworks happened. He is passionate about using computers to change the world for better, bit by bit. Sidharta Seethana is a software engineer at Hortonworks. He works on the YARN team, focussing on bringing new kinds of workloads to YARN. Prior to joining Hortonworks, Sidharta spent 10 years at Yahoo! Inc., working on a variety of large scale distributed systems for core platforms/web services, search and marketplace properties, developer network and personalization.
(Nina Hanzlikova, Zalando) Kafka Summit SF 2018 My team at Zalando fell in love with KStreams and their programming model straight out of the gate. However, as a small team of developers, building out and supporting our infrastructure while still trying to deliver solutions for our business has not always resulted in a smooth journey. Can a small team of a couple of developers run their own Kafka infrastructure confidently and still spend most of their time developing code? In this talk, we will dive into some of the problems we experienced while running Kafka brokers and Kafka Streams applications, as well as the consultations we had with other teams around this matter. We will outline some of the pragmatic decisions we made regarding backups, monitoring and operations to minimize our time spent administering our Kafka brokers and various stream applications.
Mario-Leander Reimer presented on building cloud-native .NET microservices with Kubernetes. He discussed key principles of cloud native applications including designing for distribution, performance, automation, resiliency and elasticity. He also covered containerization with Docker, composing services with Kubernetes and common concepts like deployments, services and probes. Reimer provided examples of Dockerfiles, Kubernetes definitions and using tools like Steeltoe and docker-compose to develop cloud native applications.
In 2020, two significant IT platforms converge. On the one hand, Spark 3 becomes available with the support of Kubernetes as a scheduler
Kubernetes is designed to be an extensible system. But what is the vision for Kubernetes Extensibility? Do you know the difference between webhooks and cloud providers, or between CRI, CSI, and CNI? In this talk we will explore what extension points exist, how they have evolved, and how to use them to make the system do new and interesting things. We’ll give our vision for how they will probably evolve in the future, and talk about the sorts of things we expect the broader Kubernetes ecosystem to build with them.
DevOps, continuous delivery and modern architectural trends can incredibly speed up the software development process. Big Data applications cannot be an exception and need to keep the same pace.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
This document discusses Red Hat's OpenStack platform. It provides an overview of OpenStack and what it is used for. It then discusses why Red Hat is well suited to provide an OpenStack platform, including that it is optimized to run on Red Hat Enterprise Linux and benefits from Red Hat's engineering resources and long term support. Key features of Red Hat's OpenStack platform are also summarized, such as performance, availability, security and manageability.
Kubernetes is exploding in popularity right now and has all the buzz and cargo-culting that Docker enjoyed just a few years ago. But what even is Kubernetes? How do I run my PHP apps in it? Should I run my PHP apps in it ?
Linux-Stammtisch Juli 2019, Munich: Talk by Mario-Leander Reimer (@LeanderReimer, Principal Software Architect at QAware) === Please download slides if blurred! === Abstract: Only a few years ago the move towards microservice architecture was the first big disruption in software engineering: instead of running monoliths, systems were now build, composed and run as autonomous services. But this came at the price of added development and infrastructure complexity. Serverless and FaaS seem to be the next disruption, they are the logical evolution trying to address some of the inherent technology complexity we are currently faced when building cloud native apps. FaaS frameworks are currently popping up like mushrooms: Knative, Kubeless, OpenFn, Fission, OpenFaas or Open Whisk are just a few to name. But which one of these is safe to pick and use in your next project? Let's find out. This session will start off by briefly explaining the essence of Serverless application architecture. We will then define a criteria catalog for FaaS frameworks and continue by comparing and showcasing the most promising ones.
This document discusses Apache Spark, an open-source cluster computing framework. It provides an overview of Spark, including its main concepts like RDDs (Resilient Distributed Datasets) and transformations. Spark is presented as a faster alternative to Hadoop for iterative jobs and machine learning through its ability to keep data in-memory. Example code is shown for Spark's programming model in Scala and Python. The document concludes that Spark offers a rich API to make data analytics fast, achieving speedups of up to 100x over Hadoop in real applications.
Docker containers provide an ideal foundation for running Kafka-as-a-Service on-premises or in the public cloud. However, using Docker containers in production environments for Big Data workloads using Kafka poses some challenges – including container management, scheduling, network configuration and security, and performance. In this session at Kafka Summit in August 2017, Nanda Vijyaydev of BlueData shared lessons learned from implementing Kafka-as-a-Service with Docker containers. https://kafka-summit.org/sessions/kafka-service-docker-containers
Combining the awesomeness of Apache Spark and Azure DocumentDB together for real-time machine learning on globally-distributed data.
Once you're familiar with Kubernetes and Helm, and still need more integration, what are your options ? Can Java integrate well with Kubernetes ?
Google and Intel speak on NFV and SFC service delivery The slides are as presented at the meet up "Out of Box Network Developers" sponsored by Intel Networking Developer Zone Here is the Agenda of the slides: How DPDK, RDT and gRPC fit into SDI/SDN, NFV and OpenStack Key Platform Requirements for SDI SDI Platform Ingredients: DPDK, IntelⓇRDT gRPC Service Framework IntelⓇ RDT and gRPC service framework
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating.
Kubernetes is an open-source containerization framework that makes it easy to manage applications in isolated environments at scale. In Apache Spark 2.3, Spark introduced support for native integration with Kubernetes. Palantir has been deeply involved with the development of Spark’s Kubernetes integration from the beginning, and our largest production deployment now runs an average of ~5 million Spark pods per day, as part of tens of thousands of Spark applications. Over the course of our adventures in migrating deployments from YARN to Kubernetes, we have overcome a number of performance, cost, & reliability hurdles: differences in shuffle performance due to smaller filesystem caches in containers; Kubernetes CPU limits causing inadvertent throttling of containers that run many Java threads; and lack of support for dynamic allocation leading to resource wastage. We intend to briefly describe our story of developing & deploying Spark-on-Kubernetes, as well as lessons learned from deploying containerized Spark applications in production. We will also describe our recently open-sourced extension (https://github.com/palantir/k8s-spark-scheduler) to the Kubernetes scheduler to better support Spark workloads & facilitate Spark-aware cluster autoscaling; our limited implementation of dynamic allocation on Kubernetes; and ongoing work that is required to support dynamic resource management & stable performance at scale (i.e., our work with the community on a pluggable external shuffle service API). Our hope is that our lessons learned and ongoing work will help other community members who want to use Spark on Kubernetes for their own workloads.
bettterCode, 24.06.2021, Online: Workshop of Mario-Leander Reimer (@LeanderReimer, Principal Software Architect at QAware) & Markus Zimmermann (@markus_zm, Senior Software Engineer at QAware) == Please download slides in case they are blurred! === Use the right tool and language for the job! Especially in the DevOps tooling area, Go has established itself as a simple, reliable and efficient programming language. In this workshop, we learned about suitable application areas and implementing quite a few tools.
MongoDB Ops Manager is an enterprise-grade end-to-end database management, monitoring, and backup solution. Kubernetes has clearly won the orchestration-platform "wars". In this session we'll take a deep dive on how you can leverage both these technologies to host your MongoDB deployments within your Kubernetes infrastructure whether that's OpenShift, PKS, Azure AKS, or just upstream. This talk will review the core technologies, such as containers, Kubernetes, and MongoDB Ops Manager. You'll also have a chance to see real-live demos of MongoDB running on Kubernetes and managed with MongoDB Ops Manager with the MongoDB Enterprise Kubernetes Operator.
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It groups containers that make up an application into logical units for easy management and discovery. Kubernetes services expose these units to enable dynamic load balancing while maintaining session affinity. It also provides self-healing capabilities by restarting containers that fail, replacing them, and killing containers that don't respond to their health check.