how to build a data lake on-premise


So you can bring compute to whatever you need, rather than allocating specific compute silos. Today, we are going to talk about how to build a Cloud Data Lake or a cloud like Data Lake on premises. And again, go check out that Field Day by Brian gold, from Pure Storage And hes going to explain how we built this ground up architecture to scale. So lets start with an architecture or [marketecture 00:10:19] diagram, which shows the various layers, should on top, you have your applications, right? Okay. And finally complexity, management complexity. It can be used for building automating, protecting your cloud native applications, would module to just core storage, backup, disaster recovery, application, data migration, security, and infrastructure automation, all of that is taken care [00:21:00] with this a hundred percent software solution called Kubernetes data services platform. Learn on the go with our new app. Below the Kubernetes layer, youre going to have a layer that says Thats for data management services for Kubernetes. So the data management services for Kubernetes its going to as a container is spun up, spun down, the data management services layer is going to provide the storage to do the Kubernetes layer, and then youre going to have a layer, [00:11:30] which is your modern data lake layer, which is based on open data formats, and this software layer, or this layer is going to be built on top of Block or ObjectStore, or it could be more legacy systems, its going to be built on a [inaudible 00:11:49] . Buy vs build: 11-point decision framework. [00:26:30] And finally, you can support all these multi-tenant applications scale and make everything self-service and thats our vision, to make analytics and AI scalable self-service and automated. And the second is people want people who are doing more predictive use cases, more real-time use cases where youre just not getting insights and just putting it on a dashboard, but you have a piece of code which gets an insight and then takes an action, and for example, in the case of creating, you get insights and then you take an action, you have to create a stock [00:14:30] or you have to respond to a security threat, or the software takes an action, so for you to be able to take immediate action on the data, you need to have real time data, and if the response is going to be automated, then you need better have real-time data and needs to be predictive as we go to move towards machine learning. ingestion ingest implementation Performance in terms of throughput, in terms of bandwidth, everything goes up as you add more and more blades and more and more capacity, so theres no complexity, it would scale. So I think we should [inaudible 00:36:14] and I did post the link there, you might just go check out slack just to see if anybody else is there, but at this point, I think we should just wrap it up. Its like, you take a bunch of SSDs put it together in a box and it becomes a FlashBlade, right? You cannot do that and just manage it with like one or two guys, and forget storage, right? All of these new data sources that are growing exponentially are machine generated [00:13:00] and people are doing AI and ML on them.While its clear in 10 years as forward thinking, companies say that most of the code generated would be AI and ML code. You want faster time to insight, you want to build these pipelines, to create business value, right? Like Dave said, Im a senior solutions manager for analytics and AI at Pure Storage. So [00:00:30] were going to start with some challenges with legacy data Structure today, and how a modern data architecture solves some of these problems, and then were going to talk about some requirements for modern infrastructure to create these Cloud Data Lakes on premises, and finally, well show you how to accelerate data insights at your organization and finally, well conclude with some pointers to where you can find out more technical [00:01:00] in-depth resources about what Im talking about to kind of show you some of the proof points and some of the examples of how other people have done it. And finally, each piece of analytics software you have in your pipeline, whether its Spark or Splunk [00:07:00] or Elastic or whatever, it may be Dremio. So this is what most data teams want and we know that, but what are the infrastructure challenges that are sort of preventing us from getting there? So you can do this over time and making sure that youre [00:23:30] leveraging the latest S3 protocols and at the same time, keeping your users happy with no zero downtime. And it needs to be reliable and available always, even if youre doing upgrades, you want to add capacity, you dont want to take the storage down, it needs to be always available, no matter what youre doing, upgrades patches and the data needs to be protected against ransomware attacks, against any kind of failure scenarios. You want code thats going through a CICT process and thats ready for production anytime. Today, we are going to talk about how to build a Cloud Data Lake or a cloud like Data Lake on premises.

Good evening, wherever you may be. I dont know what BOC stands for?Naveen: Well try and get that [00:32:30] over Slack, maybe.Dave: Yeah. Actually, if you go to YouTube and just search for FlashBlade theres something about Field Day with Brian Gold, where he actually walks through all of the design that hes gone through.You can put the bunch of SSDs together and make it work for like a few terabytes of data. FlashBlade is completely managed from the cloud, so as you want to add capacity, you just keep slipping in new blades and it just adds capacity with no downtime, and its super simple, theres no need for tuning. Yeah. 10 Databricks Capabilities every Data Person Needs to Know, Query data from Cross Region Cross Account AWS Glue Data catalog, How To Obtain Kafka Consumer Lags in Pyspark Structured Streaming (Part 1). Okay. Lets go ahead and open it up for Q and A. All right. Lets say you have a cluster of a certain analytics cluster, you can add either higher capacity nodes, [00:06:00] when you add higher capacity nodes, you know whats going to happen is when one of those nodes fail, its going to cause a huge amount of rebalancing in your cluster and especially direct attached storage, right? Here is Portworx, [00:24:00] You build a open data lake on top of a Pure FlashBlade, with whatever meta store table, open data format, that the tables or files, parquet files, data lake tables on top of, and weve tested this And weve seen this works very, very well. [00:10:00] So lets talk about the infrastructure underlying these cloud, like data lakes on premises and see how we build those and give you some actual examples of how to build those, right? All right, so thats all the questions we have for time in the session. Years experience in AI/Machine Learning research, and leading engineering team in various areas software development, DevOps, data science and MLOps. You had nodes, these hyper-converged nodes, and youve given a certain number of nodes for a particular application, whether its Hadoop or Spark or whatever [00:08:00] application that may be and you had these nodes that you just [inaudible 00:08:06] to scale like hundreds of nodes to 200 nodes, 300 nodes.And in 2015 to 2020, we moved into this cloud data warehouse world, where you were in a cloud, theres separation of computing storage, so the whole storage became a sort of in a cloud S3 layer, and you had cloud data warehouses, which would [00:08:30] separate compute, so bring compute to a query, and if youre in a cloud it would bring unlimited compute with cloud to a particular query for a few minutes, and then spin it down when you dont need it. For those of you who are still here, it looks like theres still a 20 so people here, just in the chat here. And I know DataOps is a very buzzy word right now. A lot of this content was developed by Joshua Robinson, whos a chief [00:27:00] technology at Pure, and hes written a very detailed blog describing it. So [00:00:30] were going to start with some challenges with legacy data Structure today, and how a modern data architecture solves some of these problems, and then were going to talk about some requirements for modern infrastructure to create these Cloud Data Lakes on premises, and finally, well show you how to accelerate data insights at your organization and finally, well conclude with some pointers to where you can find out more technical [00:01:00] in-depth resources about what Im talking about to kind of show you some of the proof points and some of the examples of how other people have done it.So lets get started. Navin Albert is a Sr. So MinIO is similar to FlashBlade, except FlashBlade is [inaudible 00:29:53] software. Object storage, but you may have legacy software that may be using NFS, or even current software using NFS are SMB protocol, so you want whatever the protocol that your application is using to access that data, that protocol should be available.And also it should be native to that platform so the performance is good, no matter what protocol the application is using [00:18:30] to access data. So lets double click into that storage layer, Im from Pure Storage, obviously Im going to double click into that storage layer and just find out like, What are some [00:12:00] of the requirements of that storage layer in this modern data analytics world? And what are some of the key drivers in market drivers for this layer, for data today, actually just not the storage layer, just what are the key market drivers for modern data delivery today. So thanks again Naveen for sticking around a little extra and thanks for your talk.Naveen: Thank you guys. Apache Spark + Hadoop + Sqoop to take in data from RDBMS (MySQL). Youre not creating pipelines for the sake of creating data pipelines, and you may encounter new tools, you want to use [00:03:00] the latest and greatest tools, newer tools, and you want to allocate the right amount of resources to the right project at the right time, right?

Different applications use Cloud-like applications [00:18:00] use an S3 Itll be a [inaudible 00:18:03] protocol, right? Finally, from an organizational perspective, from an environmental perspective, youre seeing security becoming a big concern because that data is now the new oil that is your IP and you have to protect [00:14:00] it is ransomware attacks everywhere, locking up your data and demanding ransom and so you want to keep it safe. And if want to learn more about this, there are many customers that are doing this today, and if you want to learn more about, weve got a very technical document written by Joshua. You were able to do this with a combination of a Hive source of [00:23:00] data and Dremio and Flash registry. And in 2015 to 2020, we moved into this cloud data warehouse world, where you were in a cloud, theres separation of computing storage, so the whole storage became a sort of in a cloud S3 layer, and you had cloud data warehouses, which would [00:08:30] separate compute, so bring compute to a query, and if youre in a cloud it would bring unlimited compute with cloud to a particular query for a few minutes, and then spin it down when you dont need it. Next, it needs to be cloud ready, even if youre on premises, it needs to be at agile infrastructure, which [00:16:30] gives you flexible flexibility to bring compute to the data and also with segregated, compute and storage, and also provides you consumption choices that are cloud-like, right? So [00:33:00] all right, let me copy the link. So lets talk about how in the context of Dremio how these applications and this architecture is going to help you. Lets say you have a cluster of a certain analytics cluster, you can add either higher capacity nodes, [00:06:00] when you add higher capacity nodes, you know whats going to happen is when one of those nodes fail, its going to cause a huge amount of rebalancing in your cluster and especially direct attached storage, right?Im talking about direct attached storage, where you have a hyper-converge infrastructure, where you have nodes. So lets start with an architecture or [marketecture 00:10:19] diagram, which shows the various layers, should on top, you have your applications, right? FlashBlades is ground up built to be a very reliable and [00:30:00] performant object [inaudible 00:30:05] store. You can start sharing your table definitions with Hive and share your data across existing applications and start migrating tables, users, query, and queries slowly to the S3 interface and Dremio at your pace and we dont have the luxury to do a forklift upgrade or a weekend to just migrate all your data into this modern architecture. Youd have to have people who are extremely technically smart to manage petabytes and petabytes [00:35:00] of data storage, and with Pure you can just literally set it and forget it will just exist and you dont have to manage it, you dont have to tune performance to it, its just going to keep delivering that simplicity performance and scale simply, thats it. azure datalake data icon magnifier document file applications ai development iconfinder Good afternoon. That [00:29:30] answers the question.Dave: Okay. We want to take that cloud-like approach where the storage and the compute [00:29:00] are desegregated. If we didnt get your questions, I think weve got them all, but if we didnt, then you can hit up Naveen in Slack, [00:31:00] but before you go, we would appreciate it if you would please fill out the super short Slido session survey, which youll find in the chat and the next session is coming up, I think we have a panel actually, or a keynote, a fireside chat, I believe. And if want to learn more about this, there are many customers that are doing this today, and if you want to learn more about, weve got a very technical document written by Joshua. [00:19:30] The other layer, on top of this storage, we spoke about building that open data platform and we also need another software to manage storage for your containers, as you spin up and spin down containers, you want those to be automatic and the storage needs to be allocated when its spun up and also when theres a failure scenario, [00:20:00] when theres backup, when theres needs for backup, theres needs to migrate data, we need to create a dev test environment, theres a need to encrypt that data, when one container fails in and Kubernetes takes action to create another container, to do confer failure scenarios or scaling scenarios, all of those storage needs do need to be addressed, and you need [00:20:30] Kubernetes data services platform to address all those requirements and Pure Storage, quite a company called Portworx, which is an industrys leading companies, data services, platforms available for that.