Teja Manakame, Senior Director-Data Intelligence, Dell IT
In the era of ever exploding data volume and the realization of the value this data can bring to business, IT organizations have an upheaving challenge to capture and store all forms of data to derive insights from them. Faced with humongous volume and the heterogeneous types of data, organizations need more than just a traditional data management or a data warehouse. They need something innovative that can offer better agility and flexibility to manage their Big Data.
With the exposure to cloud based technologies, businesses are uniquely positioned to be more informed than ever before; they want to make better data-driven decisions and in turn expect that the latest analytic technologies be available at their fingertips. The super connected network of people, processes, data and tools is disrupting both the implementation and consumption of traditional data management and its analytics. IT now needs tore-position itself towards a more efficient, cost-effective, self-service model to meet these demands.
Data Lakeis a relatively a new and increasingly popular way to store and analyze data that addresses many of these challenges. A Data Lake is a pool of unstructured and structured data coming from different sources, stored as-is, without a specific purpose in mind, that can be “built on multiple technologies such as Hadoop, NoSQL, any Simple Storage Service, a relational database, or various combinations thereof,” saved on usually low commodity hardware.
With the growing popularity of Hadoop as the Big Data analytics platform, this solution helps speed time to insights into data from multiple dimensions and reduces the risks and costs associated with deploying new systems or extending existing ones as business needs change. One of the basic tenets of Hadoop and distributed computing is the notion of moving the compute to the data, rather than the reverse. The Hadoop –based datalake is gaining in popularity because it can capture the volume of big data and other new sources that enterprises want to leverage via analytics, and it does so at a low cost and with good interoperability with other platforms in the datawarehousing world. In this sense, Hadoop and datalakes add value to the Data Warehouse and its environment without ripping and replacing mature investments.
The data lake and the enterprise data warehouse must both do what they do best and work together as components of a logical data warehouse. The logical data warehouse, made up of an enterprise data warehouse, a datalake, and a discovery platform to facilitate analytics across the architecture, will determine what data and what analytics to use to answer business needs. DataLake also comes with features like SQOOP where loading data from Relational Data bases onto HIVE is easy and faster. Data modeling skill is not required and data can be queried leveraging its query features without the knowledge of SQL.
The ability to capture and process this ever growing business data is now possible because of the growth of inexpensive storage and limitless compute, along with the invention of new technologies that enable real-time analysis and a direct connection to action through new applications and products. EMC Isilon is one such example; it has multi-protocol scale-out file storage for DataLake kind of applications.
The new datalake 2.0 strategy expands the datalake to extend from the datacenter to the enterprise edge locations and to your choice of public or private cloud options. With Isilon CloudPools software, the datalake can be extended to provide virtually limitless capacity without adding any complexity to store or manage the data.
The Hadoop datalake isn't without its challenges. Even experienced Hadoop datalake users say that a successful implementation requires a strong architecture, security gates and disciplined data governance policies- without those things, they warn, datalake systems can be come out-of-control dumping grounds of exploding data.
In conclusion, in the era of Data-Driven Innovation the emergence of the data lake comes from the need to manage and exploit new forms of data. Many companies feel like they are on the cutting edge of BigData analytics in the enterprise by leveraging this. More importantly, it helps with the foundation and tools to use data and analytics to create sustainable, long-term competitive differentiation.
The shape of your datalake is determined by what you need to do but cannot with your current data processing architecture. The right datalake can only be created through experimentation. Together, the data lake and the enterprise data warehouse provide a synergy of capabilities that delivers accelerating returns, allowing people to do more with data faster and driving business results. It is a game-changer not because it saves IT a whole bunch of money, but because it can help the business make huge money.