{"componentChunkName":"component---src-templates-blog-post-js","path":"/post/opensource-data-lakes-for-the-hybrid-cloud-designing-an-oss-datalake","result":{"data":{"headerImage":{"childImageSharp":{"fluid":{"aspectRatio":3.3992537313432836,"src":"/static/b72d38f0a9a131a445c0798c8f11b233/85c19/blog-post-intro.png","srcSet":"/static/b72d38f0a9a131a445c0798c8f11b233/c95ef/blog-post-intro.png 911w,\n/static/b72d38f0a9a131a445c0798c8f11b233/6d938/blog-post-intro.png 1822w,\n/static/b72d38f0a9a131a445c0798c8f11b233/85c19/blog-post-intro.png 3635w","srcWebp":"/static/b72d38f0a9a131a445c0798c8f11b233/bbedc/blog-post-intro.webp","srcSetWebp":"/static/b72d38f0a9a131a445c0798c8f11b233/8f106/blog-post-intro.webp 911w,\n/static/b72d38f0a9a131a445c0798c8f11b233/4b1a2/blog-post-intro.webp 1822w,\n/static/b72d38f0a9a131a445c0798c8f11b233/bbedc/blog-post-intro.webp 3635w","sizes":"(max-width: 3635px) 100vw, 3635px"}}},"relatedPosts":{"nodes":[{"fields":{"slug":"/blog-aws-kubernetes/"},"frontmatter":{"url":"aws-kubernetes/part-1","title":"The State of Kubernetes in AWS: Persistent Data Storage, Application Engineering and More","description":"When it comes to orchestrating containerized workloads, there are several options in the market, with [Kubernetes](https://kubernetes.io) being the most adopted and sought-after solution.","tags":["AWS","Kubernetes"],"date":"2022-12-20T16:44:23.317Z","image":{"childImageSharp":{"fluid":{"aspectRatio":1.5,"src":"/static/eb8228db77951dd583fd607fb3b3d3bd/836e2/kubernetes-and-aws.jpg","srcSet":"/static/eb8228db77951dd583fd607fb3b3d3bd/6e81a/kubernetes-and-aws.jpg 120w,\n/static/eb8228db77951dd583fd607fb3b3d3bd/fbe0e/kubernetes-and-aws.jpg 240w,\n/static/eb8228db77951dd583fd607fb3b3d3bd/836e2/kubernetes-and-aws.jpg 480w,\n/static/eb8228db77951dd583fd607fb3b3d3bd/94285/kubernetes-and-aws.jpg 720w,\n/static/eb8228db77951dd583fd607fb3b3d3bd/b1cc5/kubernetes-and-aws.jpg 960w,\n/static/eb8228db77951dd583fd607fb3b3d3bd/097fa/kubernetes-and-aws.jpg 1920w","srcWebp":"/static/eb8228db77951dd583fd607fb3b3d3bd/35871/kubernetes-and-aws.webp","srcSetWebp":"/static/eb8228db77951dd583fd607fb3b3d3bd/83552/kubernetes-and-aws.webp 120w,\n/static/eb8228db77951dd583fd607fb3b3d3bd/2b5a3/kubernetes-and-aws.webp 240w,\n/static/eb8228db77951dd583fd607fb3b3d3bd/35871/kubernetes-and-aws.webp 480w,\n/static/eb8228db77951dd583fd607fb3b3d3bd/9754a/kubernetes-and-aws.webp 720w,\n/static/eb8228db77951dd583fd607fb3b3d3bd/fcc10/kubernetes-and-aws.webp 960w,\n/static/eb8228db77951dd583fd607fb3b3d3bd/30cf3/kubernetes-and-aws.webp 1920w","sizes":"(max-width: 480px) 100vw, 480px"}}}}},{"fields":{"slug":"/kubernetes-node-management/"},"frontmatter":{"url":"karpenter","title":"Karpenter - A New Way to Manage Kubernetes Node Groups","description":"One of the most common discussions that happen when adopting Kubernetes is around autoscaling. You can autoscale your workloads horizontally or vertically, but the main challenge has always been the nodes.\n","tags":["Kubernetes","AWS"],"date":"2022-01-20T00:00:00.000Z","image":{"childImageSharp":{"fluid":{"aspectRatio":1.9047619047619047,"src":"/static/e0d4e328e64d982af16b722b7165263b/b460a/aws-karpenter.png","srcSet":"/static/e0d4e328e64d982af16b722b7165263b/d966b/aws-karpenter.png 120w,\n/static/e0d4e328e64d982af16b722b7165263b/67196/aws-karpenter.png 240w,\n/static/e0d4e328e64d982af16b722b7165263b/b460a/aws-karpenter.png 480w,\n/static/e0d4e328e64d982af16b722b7165263b/9a8d7/aws-karpenter.png 720w,\n/static/e0d4e328e64d982af16b722b7165263b/6e898/aws-karpenter.png 960w,\n/static/e0d4e328e64d982af16b722b7165263b/6050d/aws-karpenter.png 1200w","srcWebp":"/static/e0d4e328e64d982af16b722b7165263b/35871/aws-karpenter.webp","srcSetWebp":"/static/e0d4e328e64d982af16b722b7165263b/83552/aws-karpenter.webp 120w,\n/static/e0d4e328e64d982af16b722b7165263b/2b5a3/aws-karpenter.webp 240w,\n/static/e0d4e328e64d982af16b722b7165263b/35871/aws-karpenter.webp 480w,\n/static/e0d4e328e64d982af16b722b7165263b/9754a/aws-karpenter.webp 720w,\n/static/e0d4e328e64d982af16b722b7165263b/fcc10/aws-karpenter.webp 960w,\n/static/e0d4e328e64d982af16b722b7165263b/9000d/aws-karpenter.webp 1200w","sizes":"(max-width: 480px) 100vw, 480px"}}}}},{"fields":{"slug":"/aws-kubernetes-part-2/"},"frontmatter":{"url":"aws-kubernetes/part-2","title":"The Current State of Kubernetes on AWS: Kubernetes Security, Scalability, Performance Engineering & More, Part 2","description":"In the first part of our two-part post on the current state of Kubernetes in AWS, we discussed how Kubernetes can help you handle stateful workloads with persistent data storage and standardize your application and data engineering approaches.","tags":["AWS","Kubernetes"],"date":"2021-12-09T08:30:41.061Z","image":{"childImageSharp":{"fluid":{"aspectRatio":1.5,"src":"/static/dddeb31efb8e1c04a57b32e10aa14653/836e2/kubernetes-security.jpg","srcSet":"/static/dddeb31efb8e1c04a57b32e10aa14653/6e81a/kubernetes-security.jpg 120w,\n/static/dddeb31efb8e1c04a57b32e10aa14653/fbe0e/kubernetes-security.jpg 240w,\n/static/dddeb31efb8e1c04a57b32e10aa14653/836e2/kubernetes-security.jpg 480w,\n/static/dddeb31efb8e1c04a57b32e10aa14653/94285/kubernetes-security.jpg 720w,\n/static/dddeb31efb8e1c04a57b32e10aa14653/b1cc5/kubernetes-security.jpg 960w,\n/static/dddeb31efb8e1c04a57b32e10aa14653/097fa/kubernetes-security.jpg 1920w","srcWebp":"/static/dddeb31efb8e1c04a57b32e10aa14653/35871/kubernetes-security.webp","srcSetWebp":"/static/dddeb31efb8e1c04a57b32e10aa14653/83552/kubernetes-security.webp 120w,\n/static/dddeb31efb8e1c04a57b32e10aa14653/2b5a3/kubernetes-security.webp 240w,\n/static/dddeb31efb8e1c04a57b32e10aa14653/35871/kubernetes-security.webp 480w,\n/static/dddeb31efb8e1c04a57b32e10aa14653/9754a/kubernetes-security.webp 720w,\n/static/dddeb31efb8e1c04a57b32e10aa14653/fcc10/kubernetes-security.webp 960w,\n/static/dddeb31efb8e1c04a57b32e10aa14653/30cf3/kubernetes-security.webp 1920w","sizes":"(max-width: 480px) 100vw, 480px"}}}}},{"fields":{"slug":"/gitops-why-is-it-relevant-now/"},"frontmatter":{"url":"gitops-why-is-it-relevant-now","title":"GitOps - Why is it Relevant Now?","description":"There seems to have been a lot of talk about GitOps just recently. This impression is certainly reinforced by the sessions and booths during KubeCon San Diego late 2019. Regardless of the discipline or services, GitOps was the keyword that was constantly repeated.","tags":["Kubernetes"],"date":"2020-01-21T17:00:00.000Z","image":{"childImageSharp":{"fluid":{"aspectRatio":1.3333333333333333,"src":"/static/602b397bd0ef200acbf6007f11c2f3f5/836e2/shutterstock_1019460151-1-.jpg","srcSet":"/static/602b397bd0ef200acbf6007f11c2f3f5/6e81a/shutterstock_1019460151-1-.jpg 120w,\n/static/602b397bd0ef200acbf6007f11c2f3f5/fbe0e/shutterstock_1019460151-1-.jpg 240w,\n/static/602b397bd0ef200acbf6007f11c2f3f5/836e2/shutterstock_1019460151-1-.jpg 480w,\n/static/602b397bd0ef200acbf6007f11c2f3f5/94285/shutterstock_1019460151-1-.jpg 720w,\n/static/602b397bd0ef200acbf6007f11c2f3f5/b1cc5/shutterstock_1019460151-1-.jpg 960w,\n/static/602b397bd0ef200acbf6007f11c2f3f5/405f0/shutterstock_1019460151-1-.jpg 4856w","srcWebp":"/static/602b397bd0ef200acbf6007f11c2f3f5/35871/shutterstock_1019460151-1-.webp","srcSetWebp":"/static/602b397bd0ef200acbf6007f11c2f3f5/83552/shutterstock_1019460151-1-.webp 120w,\n/static/602b397bd0ef200acbf6007f11c2f3f5/2b5a3/shutterstock_1019460151-1-.webp 240w,\n/static/602b397bd0ef200acbf6007f11c2f3f5/35871/shutterstock_1019460151-1-.webp 480w,\n/static/602b397bd0ef200acbf6007f11c2f3f5/9754a/shutterstock_1019460151-1-.webp 720w,\n/static/602b397bd0ef200acbf6007f11c2f3f5/fcc10/shutterstock_1019460151-1-.webp 960w,\n/static/602b397bd0ef200acbf6007f11c2f3f5/cdeed/shutterstock_1019460151-1-.webp 4856w","sizes":"(max-width: 480px) 100vw, 480px"}}}}},{"fields":{"slug":"/setting-up-a-multi-tenant-aws-eks-cluster/"},"frontmatter":{"url":"setting-up-a-multi-tenant-aws-eks-cluster","title":"Setting up a Multi-tenant Amazon EKS cluster: a few things to consider","description":"MyOps prides itself in heavy use of cloud-native technology, and Kubernetes is often the primary platform of choice to run containerized workloads. In this blog we discuss using name space, network policies, Integrating AWS IAM to EKS cluster/workloads, isolation techniques and much more.","tags":["Kubernetes","AWS"],"date":"2019-12-12T17:00:00.000Z","image":{"childImageSharp":{"fluid":{"aspectRatio":1.7647058823529411,"src":"/static/242e9209b664bee2a7dc6b090d3a07e1/836e2/setting-up-multi-tenant-aws-eks-cluster.jpg","srcSet":"/static/242e9209b664bee2a7dc6b090d3a07e1/6e81a/setting-up-multi-tenant-aws-eks-cluster.jpg 120w,\n/static/242e9209b664bee2a7dc6b090d3a07e1/fbe0e/setting-up-multi-tenant-aws-eks-cluster.jpg 240w,\n/static/242e9209b664bee2a7dc6b090d3a07e1/836e2/setting-up-multi-tenant-aws-eks-cluster.jpg 480w,\n/static/242e9209b664bee2a7dc6b090d3a07e1/94285/setting-up-multi-tenant-aws-eks-cluster.jpg 720w,\n/static/242e9209b664bee2a7dc6b090d3a07e1/b1cc5/setting-up-multi-tenant-aws-eks-cluster.jpg 960w,\n/static/242e9209b664bee2a7dc6b090d3a07e1/e147c/setting-up-multi-tenant-aws-eks-cluster.jpg 5760w","srcWebp":"/static/242e9209b664bee2a7dc6b090d3a07e1/35871/setting-up-multi-tenant-aws-eks-cluster.webp","srcSetWebp":"/static/242e9209b664bee2a7dc6b090d3a07e1/83552/setting-up-multi-tenant-aws-eks-cluster.webp 120w,\n/static/242e9209b664bee2a7dc6b090d3a07e1/2b5a3/setting-up-multi-tenant-aws-eks-cluster.webp 240w,\n/static/242e9209b664bee2a7dc6b090d3a07e1/35871/setting-up-multi-tenant-aws-eks-cluster.webp 480w,\n/static/242e9209b664bee2a7dc6b090d3a07e1/9754a/setting-up-multi-tenant-aws-eks-cluster.webp 720w,\n/static/242e9209b664bee2a7dc6b090d3a07e1/fcc10/setting-up-multi-tenant-aws-eks-cluster.webp 960w,\n/static/242e9209b664bee2a7dc6b090d3a07e1/b4d70/setting-up-multi-tenant-aws-eks-cluster.webp 5760w","sizes":"(max-width: 480px) 100vw, 480px"}}}}},{"fields":{"slug":"/walkthrough-ecs-local/"},"frontmatter":{"url":"walkthrough-ecs-local","title":"Walkthrough - ECS Local: Bringing ECS to your local environment","description":"As someone who works with AWS on a day-to-day basis, It's important to stay up to date with all the changes and new features of the different services on the platform. That's how one recent announcement caught my eye - The new capability of local testing of ECS.","tags":["Kubernetes","AWS"],"date":"2019-09-17T16:00:00.000Z","image":{"childImageSharp":{"fluid":{"aspectRatio":2.142857142857143,"src":"/static/12224681f2fd40bf0749423e29cf8d0c/836e2/technology-education-information-handover.jpg","srcSet":"/static/12224681f2fd40bf0749423e29cf8d0c/6e81a/technology-education-information-handover.jpg 120w,\n/static/12224681f2fd40bf0749423e29cf8d0c/fbe0e/technology-education-information-handover.jpg 240w,\n/static/12224681f2fd40bf0749423e29cf8d0c/836e2/technology-education-information-handover.jpg 480w,\n/static/12224681f2fd40bf0749423e29cf8d0c/94285/technology-education-information-handover.jpg 720w,\n/static/12224681f2fd40bf0749423e29cf8d0c/b1cc5/technology-education-information-handover.jpg 960w,\n/static/12224681f2fd40bf0749423e29cf8d0c/0ff54/technology-education-information-handover.jpg 1200w","srcWebp":"/static/12224681f2fd40bf0749423e29cf8d0c/35871/technology-education-information-handover.webp","srcSetWebp":"/static/12224681f2fd40bf0749423e29cf8d0c/83552/technology-education-information-handover.webp 120w,\n/static/12224681f2fd40bf0749423e29cf8d0c/2b5a3/technology-education-information-handover.webp 240w,\n/static/12224681f2fd40bf0749423e29cf8d0c/35871/technology-education-information-handover.webp 480w,\n/static/12224681f2fd40bf0749423e29cf8d0c/9754a/technology-education-information-handover.webp 720w,\n/static/12224681f2fd40bf0749423e29cf8d0c/fcc10/technology-education-information-handover.webp 960w,\n/static/12224681f2fd40bf0749423e29cf8d0c/9000d/technology-education-information-handover.webp 1200w","sizes":"(max-width: 480px) 100vw, 480px"}}}}},{"fields":{"slug":"/opensource-data-lakes-for-the-hybrid-cloud-designing-an-oss-datalake/"},"frontmatter":{"url":"opensource-data-lakes-for-the-hybrid-cloud-designing-an-oss-datalake","title":"OpenSource Data Lake for the Hybrid Cloud - Part 2: Designing an OSS DataLake","description":"In part 1 of this series, we answered the question of WHY Open Source components are often an attractive option when building a data lake of any significant size. In this second installment, we describe HOW to cost-effectively build a data lake out of Open Source components.","tags":["Kubernetes","Big Data"],"date":"2019-08-27T16:00:00.000Z","image":{"childImageSharp":{"fluid":{"aspectRatio":1.6,"src":"/static/107087aec2d3327919bcfb2ab38201da/836e2/datalake-p2.jpg","srcSet":"/static/107087aec2d3327919bcfb2ab38201da/6e81a/datalake-p2.jpg 120w,\n/static/107087aec2d3327919bcfb2ab38201da/fbe0e/datalake-p2.jpg 240w,\n/static/107087aec2d3327919bcfb2ab38201da/836e2/datalake-p2.jpg 480w,\n/static/107087aec2d3327919bcfb2ab38201da/94285/datalake-p2.jpg 720w,\n/static/107087aec2d3327919bcfb2ab38201da/b1cc5/datalake-p2.jpg 960w,\n/static/107087aec2d3327919bcfb2ab38201da/32638/datalake-p2.jpg 6399w","srcWebp":"/static/107087aec2d3327919bcfb2ab38201da/35871/datalake-p2.webp","srcSetWebp":"/static/107087aec2d3327919bcfb2ab38201da/83552/datalake-p2.webp 120w,\n/static/107087aec2d3327919bcfb2ab38201da/2b5a3/datalake-p2.webp 240w,\n/static/107087aec2d3327919bcfb2ab38201da/35871/datalake-p2.webp 480w,\n/static/107087aec2d3327919bcfb2ab38201da/9754a/datalake-p2.webp 720w,\n/static/107087aec2d3327919bcfb2ab38201da/fcc10/datalake-p2.webp 960w,\n/static/107087aec2d3327919bcfb2ab38201da/85285/datalake-p2.webp 6399w","sizes":"(max-width: 480px) 100vw, 480px"}}}}},{"fields":{"slug":"/opensource-data-lake-for-the-hybrid-cloud/"},"frontmatter":{"url":"opensource-data-lake-for-the-hybrid-cloud","title":"OpenSource Data Lake for the Hybrid Cloud - Part 1","description":"Data lakes have become the de-facto standard for Enterprises and Corporations looking to take advantage of their existing Data.\n","tags":["Kubernetes","Big Data"],"date":"2019-06-17T16:00:00.000Z","image":{"childImageSharp":{"fluid":{"aspectRatio":1.5,"src":"/static/8640602d41c9ebdbd88a4281c37bcae9/836e2/myops-data-lake-blog-profile-1-.jpg","srcSet":"/static/8640602d41c9ebdbd88a4281c37bcae9/6e81a/myops-data-lake-blog-profile-1-.jpg 120w,\n/static/8640602d41c9ebdbd88a4281c37bcae9/fbe0e/myops-data-lake-blog-profile-1-.jpg 240w,\n/static/8640602d41c9ebdbd88a4281c37bcae9/836e2/myops-data-lake-blog-profile-1-.jpg 480w,\n/static/8640602d41c9ebdbd88a4281c37bcae9/94285/myops-data-lake-blog-profile-1-.jpg 720w,\n/static/8640602d41c9ebdbd88a4281c37bcae9/b1cc5/myops-data-lake-blog-profile-1-.jpg 960w,\n/static/8640602d41c9ebdbd88a4281c37bcae9/724c8/myops-data-lake-blog-profile-1-.jpg 1000w","srcWebp":"/static/8640602d41c9ebdbd88a4281c37bcae9/35871/myops-data-lake-blog-profile-1-.webp","srcSetWebp":"/static/8640602d41c9ebdbd88a4281c37bcae9/83552/myops-data-lake-blog-profile-1-.webp 120w,\n/static/8640602d41c9ebdbd88a4281c37bcae9/2b5a3/myops-data-lake-blog-profile-1-.webp 240w,\n/static/8640602d41c9ebdbd88a4281c37bcae9/35871/myops-data-lake-blog-profile-1-.webp 480w,\n/static/8640602d41c9ebdbd88a4281c37bcae9/9754a/myops-data-lake-blog-profile-1-.webp 720w,\n/static/8640602d41c9ebdbd88a4281c37bcae9/fcc10/myops-data-lake-blog-profile-1-.webp 960w,\n/static/8640602d41c9ebdbd88a4281c37bcae9/36ebb/myops-data-lake-blog-profile-1-.webp 1000w","sizes":"(max-width: 480px) 100vw, 480px"}}}}},{"fields":{"slug":"/top-10-misconceptions-around-migrating-hadoop/"},"frontmatter":{"url":"top-10-misconceptions-around-migrating-hadoop","title":"Top 10 Misconceptions around Migrating Hadoop to the Cloud","description":"Lots of mid-size companies and Enterprises want to leverage the Cloud for their Data Processing requirements. But in reality migrating a production, Petabyte scale, multi-component Data Processing pipeline from on-prem to the Cloud can be a nightmare.","tags":["Big Data"],"date":"2018-11-26T17:00:00.000Z","image":{"childImageSharp":{"fluid":{"aspectRatio":1.5,"src":"/static/d5db71f736f36c26e2d3007f65b0dd52/836e2/cloud-elephant.jpg","srcSet":"/static/d5db71f736f36c26e2d3007f65b0dd52/6e81a/cloud-elephant.jpg 120w,\n/static/d5db71f736f36c26e2d3007f65b0dd52/fbe0e/cloud-elephant.jpg 240w,\n/static/d5db71f736f36c26e2d3007f65b0dd52/836e2/cloud-elephant.jpg 480w,\n/static/d5db71f736f36c26e2d3007f65b0dd52/94285/cloud-elephant.jpg 720w,\n/static/d5db71f736f36c26e2d3007f65b0dd52/b1cc5/cloud-elephant.jpg 960w,\n/static/d5db71f736f36c26e2d3007f65b0dd52/0ff54/cloud-elephant.jpg 1200w","srcWebp":"/static/d5db71f736f36c26e2d3007f65b0dd52/35871/cloud-elephant.webp","srcSetWebp":"/static/d5db71f736f36c26e2d3007f65b0dd52/83552/cloud-elephant.webp 120w,\n/static/d5db71f736f36c26e2d3007f65b0dd52/2b5a3/cloud-elephant.webp 240w,\n/static/d5db71f736f36c26e2d3007f65b0dd52/35871/cloud-elephant.webp 480w,\n/static/d5db71f736f36c26e2d3007f65b0dd52/9754a/cloud-elephant.webp 720w,\n/static/d5db71f736f36c26e2d3007f65b0dd52/fcc10/cloud-elephant.webp 960w,\n/static/d5db71f736f36c26e2d3007f65b0dd52/9000d/cloud-elephant.webp 1200w","sizes":"(max-width: 480px) 100vw, 480px"}}}}},{"fields":{"slug":"/securing-kubernetes-secrets-how-to-efficiently-secure-access-to-etcd-and-protect-your-secrets/"},"frontmatter":{"url":"securing-kubernetes-secrets-how-to-efficiently-secure-access-to-etcd-and-protect-your-secrets","title":"Securing Kubernetes secrets: How to efficiently secure access to etcd and protect your secrets","description":"Etcd is a distributed, consistent and highly-available key value store used as the Kubernetes backing store for all cluster data, making it a core component of every K8s deployment. Due to its central role etcd may contain sensitive information related to access of the deployed services and their associated components,","tags":["Kubernetes","Security"],"date":"2018-06-20T16:00:00.000Z","image":{"childImageSharp":{"fluid":{"aspectRatio":0.7407407407407407,"src":"/static/62bd016a89ce5970467a24df70a52cf0/836e2/close-up-door-golden-67537.jpg","srcSet":"/static/62bd016a89ce5970467a24df70a52cf0/6e81a/close-up-door-golden-67537.jpg 120w,\n/static/62bd016a89ce5970467a24df70a52cf0/fbe0e/close-up-door-golden-67537.jpg 240w,\n/static/62bd016a89ce5970467a24df70a52cf0/836e2/close-up-door-golden-67537.jpg 480w,\n/static/62bd016a89ce5970467a24df70a52cf0/94285/close-up-door-golden-67537.jpg 720w,\n/static/62bd016a89ce5970467a24df70a52cf0/b1cc5/close-up-door-golden-67537.jpg 960w,\n/static/62bd016a89ce5970467a24df70a52cf0/fb46d/close-up-door-golden-67537.jpg 2820w","srcWebp":"/static/62bd016a89ce5970467a24df70a52cf0/35871/close-up-door-golden-67537.webp","srcSetWebp":"/static/62bd016a89ce5970467a24df70a52cf0/83552/close-up-door-golden-67537.webp 120w,\n/static/62bd016a89ce5970467a24df70a52cf0/2b5a3/close-up-door-golden-67537.webp 240w,\n/static/62bd016a89ce5970467a24df70a52cf0/35871/close-up-door-golden-67537.webp 480w,\n/static/62bd016a89ce5970467a24df70a52cf0/9754a/close-up-door-golden-67537.webp 720w,\n/static/62bd016a89ce5970467a24df70a52cf0/fcc10/close-up-door-golden-67537.webp 960w,\n/static/62bd016a89ce5970467a24df70a52cf0/d0805/close-up-door-golden-67537.webp 2820w","sizes":"(max-width: 480px) 100vw, 480px"}}}}}]},"socials":{"frontmatter":{"socials":{"linkedin":"https://www.linkedin.com/company/myops-yael","github":"https://github.com/opsguru-israel"}}},"markdownRemark":{"html":"<p>In <a href=\"/post/opensource-data-lake-for-the-hybrid-cloud\">part 1</a> of this series, we answered the question of WHY open source components are often an attractive option when building a data lake of any significant size. In this second installment, we describe HOW to cost-effectively build a data lake out of open source components. We will share common architectural patterns as well as critical implementation details for the key components.</p>\n<h2>Designing an Open Source Data Lake</h2>\n<h3>Data Flow Design</h3>\n<p>A typical data lake's logical flow is comprised of these functional blocks:</p>\n<ul>\n<li>Data Sources</li>\n<li>Data Ingestion</li>\n<li>Storage Tier</li>\n<li>Data Processing &#x26; Enrichment</li>\n<li>Data Analysis Exploration</li>\n</ul>\n<p>In this context, the <strong>data sources</strong> are generally streams or collections of raw, event-driven data (e.g. logs, clicks, IoT telemetry, transactions). A key characteristic of these data sources is that the data is far from clean - often due to constraints of time or compute power during collection. Noise in this data usually consists of duplicate or incomplete records with redundant or erroneous fields.</p>\n<p><img src=\"/img/datalake-dataflow.png\" alt=\"A Data Lake data flow infographic.\"></p>\n<p>Given one or more data sources, we consume the raw data via an ingestion phase. The ingestion mechanism is most often implemented as one or more distributed message queues with a lightweight computational component responsible for initial data sanitization and persistence. In order to build an efficient, scalable, and coherent <strong>data lake</strong>, we strongly recommend a clear distinction between simple data sanitization and more complex enrichment tasks. One rule of thumb is that sanitization tasks should only require data from a single source and within a reasonable sliding window.</p>\n<p>For example, a deduplication task that is bounded to only consider keys from events that are received within 60 seconds of each other from the same data source would be a typical sanitization task. On the other hand, a task that aggregates data from multiple data sources and/or across a relatively long time span (e.g. the last 24 hours) would probably be better suited for the batch analytics enrichment phase (which we will talk about below).</p>\n<p>Once data has been ingested and sanitized, it is persisted into an object store/distributed file system to ensure resilience from any subsequent component failure. The data is normally written in a columnar format such as Parquet or ORC and is usually compressed via fast protocol such as Snappy for maximum storage efficiency and query performance.</p>\n<p>When new data is written into the storage tier, a data catalog (which hosts the schema and the underlying metadata) can be dynamically updated using a serverless function crawler. The execution of such data crawler is normally event driven (arrival of new file into a specific location on an object store). Data stores are normally integrated with the data catalog in order to infer the underlying schema to make the data queryable.</p>\n<p>The data usually lands in a dedicated location (or a \"zone\") dedicated to the <strong>golden data</strong>. The data is called golden for a reason: it is still raw, semi-structured or unprocessed and it is your business logic's primary \"source of truth\". From here on the data is ready to be further enriched by the subsequent data pipelines.</p>\n<p>During the enrichment process, the data is further modified and distilled according to the business logic. It is eventually stored in a structured format in one of the data stores (e.g. a document store, an RDBMS or an object store) that may be dedicated to on-line serving, BI analytics, data warehousing or model training.</p>\n<p>Lastly, the <strong>analysis and data</strong> exploration is where the data consumption occurs. This is where the distilled data is transformed into business insights through visualizations, BI dashboards, reports and views. It is also a source of ML predictions, the outcome of which helps drive better business decisions.</p>\n<h2>Platform Components</h2>\n<p>A <strong>hybrid cloud</strong> data lake architecture requires a reliable and unified core abstraction layer that will allow us to deploy, coordinate, and run our workloads without being constrained by vendor API's and resource primitives. Kubernetes is a great tool for this job since it allows us to efficiently deploy, orchestrate and run various data lake services and workloads in a reliable and cost efficient manner while exposing a unified API whether it is running on-premise or on any public or private cloud. We will dive deeper into the specifics of Kubernetes implementation details in a future post.</p>\n<p><img src=\"/img/high-level-overview.png\"></p>\n<p>From a platform perspective, the <strong>foundation layer</strong> is where we deploy Kubernetes or equivalent thereof. The same foundation can be used to handle workloads beyond the data lake. A future-proof foundation layer incorporates cloud vendor best practices (functional and organizational account segregation, logging and auditing, minimal access design, vulnerability scanning and reporting, network architecture, IAM architecture etc) in order to achieve the necessary levels of security and compliance.</p>\n<p>Above the foundation layer, there are two additional layers - the <strong>data lake</strong> and the <strong>data value</strong> derivation layers.</p>\n<p>These two layers are mainly responsible for the core business logic as well as data platform pipelines. While there are many ways to host these two layers, Kubernetes is once again a good option because of its flexibility to support different workloads both stateless and stateful.</p>\n<p>The data lake layer, typically includes all the necessary services that are responsible for ingestion (Kafka, Kafka Connect), filtering, enrichment and processing (Flink and Spark), workflow management (Airflow) as well as <strong>data stores</strong> such as distributed file-systems (HDFS) as well as RDBMS and NoSQL databases.</p>\n<p>The uppermost layer, <strong>data value derivation</strong>, is essentially the \"consumer\" layer and includes components such as visualisation tools for BI insights, ad-hoc data exploration via data-science notebooks (Jupyter). Another important process that takes place on this layer is ML model training leveraging data sets residing on the data-lake.</p>\n<p>It is important to mention that an integral part of every production grade data lake is the full adoption of common DevOps best practices such as <strong>infrastructure as code, observability, audit and security.</strong> These play a critical role in the solution and should be applied on every single layer in order to enable the necessary level of compliance, security and operational excellence.</p>\n<p>--</p>\n<p>Now, let's deep dive further into the data lake architecture and review some of the core technologies involved in the process of ingestion, filtering, processing and storing our data. A good guiding principle for choosing open source solutions for any of the data lake stages is to look for a track record of wide industry adoption, comprehensive documentation and, of course, extensive community support.</p>\n<p>A Kafka cluster will receive the raw and unfiltered messages and will function as the data lake ingestion tier with its reliable message persistence and ability to support very high message throughput in a robust way. The cluster typically contains several topics for raw, processed (for stream processing) and dead letter (for malformed messages). For maximum security the brokers endpoints can terminate SSL, while encryption is turned on in the persistence volumes.</p>\n<p>From that point, a Flink job consumes the messages from Kafka's raw data topic and performs the required filtering and, when needed, initial enrichment. The data is then produced back to Kafka (into a separate topic dedicated to filtered/enriched data). In the event of failure, or when business logic changes, these messages can be replayed from the beginning of the log since they are persisted in Kafka. This is very common in streaming pipelines.</p>\n<p>Meanwhile, any malformed, illegal messages are written by Flink into the dead letter topic for further analysis.</p>\n<p>Using a Kafka Connect fleet backed by storage connectors, we are then able to persist the data into the relevant data store backends such as a golden zone on HDFS. In the event of a traffic spike, the Kafka Connect deployment can easily scale out to support a higher degree of parallelism resulting in higher ingestion throughput:</p>\n<p><img src=\"/img/datalake-article.png\"></p>\n<p>While writing into HDFS from Kafka Connect, it is usually a good idea to perform a content (topic) and date-based partitioning for query efficiency (less data to scan meaning less IO), for example:</p>\n<p><code class=\"language-text\">hdfs://datalake-vol/golden/topic_name/2019/09/01/01/datafoo.snappy.parquet</code></p>\n<p><code class=\"language-text\">hdfs://datalake-vol/golden/topic_name/2019/09/01/02/databar.snappy.parquet</code></p>\n<p>Once the data was written to HDFS, a periodically scheduled serverless function (such as OpenWhisk or Knative) updates the metastore (which contains the metadata and schema settings) with the updated structure of the schema, so that it can be queried via SQL-like interfaces, such as Hive or Presto:</p>\n<p>For the subsequent data flows and ETL coordination we can leverage Apache Airflow, which allows users to launch multi-step data pipelines using a simple Python object Directed Acyclic Graph (DAG). A user can define dependencies, programmatically construct complex workflows, and monitor scheduled jobs in an expressive UI.</p>\n<p><img src=\"/img/data-lake-hive.png\"></p>\n<p>Airflow can also potentially handle the data pipeline for all things external to the cloud provider (e.g. pulling in records from an external API and storing in the persistence tier).</p>\n<p>Being orchestrated by Airflow via dedicated operator plugin, Spark can then periodically further enrich the raw filtered data according to the business logic and prepare the data for consumption and exploration by your data scientists, business analysts and BI teams.</p>\n<p>The data science team will be able to leverage JupyterHub to serve the Jupyter Notebooks, therefore enable effective multi-user, collaborative notebook interfaces with your data leveraging the Spark execution engine to perform aggregations and analysis.</p>\n<p><img src=\"/img/data-lake-airflow.png\" alt=\"A data lake airflow infographic guide.\"></p>\n<p>The team can also leverage frameworks such as Kubeflow as a production-grade ML model training leveraging the scalability of Kubernetes. The resultant machine learning models can later be fed back into the serving layer.</p>\n<p>Gluing all the pieces of the puzzle, the final architecture will look something like this:</p>\n<p><img src=\"/img/datalake-article-oss-datalake-1-.png\"></p>\n<h3>Operational excellence</h3>\n<p>We've already mentioned that DevOps and DevSecOps principles are core components of every data lake and should never be overlooked. With great power comes great responsibility, especially when your business has structured and unstructured data now residing in one place.</p>\n<p>One of the recommended approaches is to allow access only to specific services (via appropriate IAM service roles) and block any direct user access so that data cannot be manually altered by your team members. Also, a full audit with relevant trail services is essential for monitoring and safeguarding the data.</p>\n<p>Data encryption is another important mechanism to protect your data. Data encryption at rest can be done by using KMS services to encrypt your persistent volumes for stateful sets and object store, while data encryption in transit can be achieved by using certificates on all UI's as well as services such as Kafka, ElasticSearch endpoints.</p>\n<p>We recommend a serverless scanner for resources that aren't complaint with your policies, such that it is easy to discover issues as such untagged resources, non restrictive security groups.</p>\n<p>We discourage any manual, ad-hoc deployments for any component of the data lake; every change should originate in a version control and go through a series of CI tests (regression, smoke tests etc) before getting deployed into the production data lake environment.</p>\n<h2>Cloud Native Data Lake - Epilogue</h2>\n<p>In this series of blog posts we've demonstrated the rationale and the architectural design of an open source data lake. As in most cases in IT, the choice of whether to adopt or not to adopt such an approach is not always obvious and can be dictated by a wide array of business and compliance requirements, budget and time constraints.</p>\n<p>It is important to understand that the real cost benefit is usually observed when the solution is deployed at scale, since there is an initial investment in a platform that is made to support this flexible model of operation (as we have demonstrated in <a href=\"/post/opensource-data-lake-for-the-hybrid-cloud\">part 1</a>).</p>\n<p>Going with a cloud native data lake platform (whether it is hybrid or fully cloud native solution) is clearly a growing trend in the industry given the sheer amount of benefits this model offers. It has a high level of flexibility and protects against increasing lock-in. In the next installment we are going to drill down into a Kubernetes abstraction that enables hybrid data lake implementation.</p>\n<p>Written by: MyOps Team</p>","frontmatter":{"url":"opensource-data-lakes-for-the-hybrid-cloud-designing-an-oss-datalake","seo":{"title":"OpenSource Data Lake for the Hybrid Cloud - Part 2: Designing an OSS DataLake","description":"In part 1 of this series, we answered the question of WHY Open Source components are often an attractive option when building a data lake of any significant size. In this second installment, we describe HOW to cost-effectively build a data lake out of Open Source components.","canonical":null,"image":{"childImageSharp":{"fluid":{"aspectRatio":1.6025641025641026,"src":"/static/107087aec2d3327919bcfb2ab38201da/724c8/datalake-p2.jpg","srcSet":"/static/107087aec2d3327919bcfb2ab38201da/84d81/datalake-p2.jpg 250w,\n/static/107087aec2d3327919bcfb2ab38201da/f0719/datalake-p2.jpg 500w,\n/static/107087aec2d3327919bcfb2ab38201da/724c8/datalake-p2.jpg 1000w,\n/static/107087aec2d3327919bcfb2ab38201da/d79bd/datalake-p2.jpg 1500w,\n/static/107087aec2d3327919bcfb2ab38201da/a66ad/datalake-p2.jpg 2000w,\n/static/107087aec2d3327919bcfb2ab38201da/32638/datalake-p2.jpg 6399w","srcWebp":"/static/107087aec2d3327919bcfb2ab38201da/36ebb/datalake-p2.webp","srcSetWebp":"/static/107087aec2d3327919bcfb2ab38201da/1d872/datalake-p2.webp 250w,\n/static/107087aec2d3327919bcfb2ab38201da/4e6d4/datalake-p2.webp 500w,\n/static/107087aec2d3327919bcfb2ab38201da/36ebb/datalake-p2.webp 1000w,\n/static/107087aec2d3327919bcfb2ab38201da/fd45d/datalake-p2.webp 1500w,\n/static/107087aec2d3327919bcfb2ab38201da/6e77b/datalake-p2.webp 2000w,\n/static/107087aec2d3327919bcfb2ab38201da/85285/datalake-p2.webp 6399w","sizes":"(max-width: 1000px) 100vw, 1000px","maxHeight":625,"maxWidth":1000}}}},"title":"OpenSource Data Lake for the Hybrid Cloud - Part 2: Designing an OSS DataLake","date":"2019-08-27T16:00:00.000Z","tags":["Kubernetes","Big Data"],"author":{"name":"MyOps","photo":{"extension":"png","publicURL":"/static/3ff870573bc56665ee67e3cf3f5fc163/logo-small.png","childImageSharp":{"fluid":{"aspectRatio":0.8759124087591241,"src":"/static/3ff870573bc56665ee67e3cf3f5fc163/b460a/logo-small.png","srcSet":"/static/3ff870573bc56665ee67e3cf3f5fc163/d966b/logo-small.png 120w,\n/static/3ff870573bc56665ee67e3cf3f5fc163/67196/logo-small.png 240w,\n/static/3ff870573bc56665ee67e3cf3f5fc163/b460a/logo-small.png 480w,\n/static/3ff870573bc56665ee67e3cf3f5fc163/eec14/logo-small.png 596w","srcWebp":"/static/3ff870573bc56665ee67e3cf3f5fc163/35871/logo-small.webp","srcSetWebp":"/static/3ff870573bc56665ee67e3cf3f5fc163/83552/logo-small.webp 120w,\n/static/3ff870573bc56665ee67e3cf3f5fc163/2b5a3/logo-small.webp 240w,\n/static/3ff870573bc56665ee67e3cf3f5fc163/35871/logo-small.webp 480w,\n/static/3ff870573bc56665ee67e3cf3f5fc163/c0cb3/logo-small.webp 596w","sizes":"(max-width: 480px) 100vw, 480px"}}}},"image":{"childImageSharp":{"fluid":{"aspectRatio":1.5957446808510638,"src":"/static/107087aec2d3327919bcfb2ab38201da/8c3c2/datalake-p2.jpg","srcSet":"/static/107087aec2d3327919bcfb2ab38201da/15aed/datalake-p2.jpg 300w,\n/static/107087aec2d3327919bcfb2ab38201da/a07a5/datalake-p2.jpg 600w,\n/static/107087aec2d3327919bcfb2ab38201da/8c3c2/datalake-p2.jpg 1200w,\n/static/107087aec2d3327919bcfb2ab38201da/cd33f/datalake-p2.jpg 1800w,\n/static/107087aec2d3327919bcfb2ab38201da/1c8c6/datalake-p2.jpg 2400w,\n/static/107087aec2d3327919bcfb2ab38201da/4ad79/datalake-p2.jpg 6399w","srcWebp":"/static/107087aec2d3327919bcfb2ab38201da/e7405/datalake-p2.webp","srcSetWebp":"/static/107087aec2d3327919bcfb2ab38201da/4fec1/datalake-p2.webp 300w,\n/static/107087aec2d3327919bcfb2ab38201da/483a3/datalake-p2.webp 600w,\n/static/107087aec2d3327919bcfb2ab38201da/e7405/datalake-p2.webp 1200w,\n/static/107087aec2d3327919bcfb2ab38201da/7f800/datalake-p2.webp 1800w,\n/static/107087aec2d3327919bcfb2ab38201da/7acea/datalake-p2.webp 2400w,\n/static/107087aec2d3327919bcfb2ab38201da/ecce4/datalake-p2.webp 6399w","sizes":"(max-width: 1200px) 100vw, 1200px"}}}}}},"pageContext":{"id":"a6e959df-f018-5b7f-80fa-264d9fe56f6e","categories":["Kubernetes","Big Data"]}},"staticQueryHashes":["2022990323","639612397"]}