Cloud Object Storage for Data Lakes
Data growth is exploding. Ever-growing numbers of mobile devices, intelligent sensors and smart endpoints are generating an ever-increasing variety, volume and velocity of data. IDC forecasts annual global data generation to grow from 33 zettabytes (ZB) in 2018 to 175 ZB in 2025 as connected devices and smart systems proliferate. (1 ZB = 1 trillion GB)
By turning this sea of raw data into meaningful and actionable insights, companies can accelerate the pace of business, increase worker productivity and streamline operations. Corporations can optimize business processes and fine-tune sales, marketing and advertising campaigns. Municipalities and utilities can enhance public safety and services, optimize transportation and energy systems, and reduce expense and waste. And researchers and scientists can improve our understanding of the universe, accelerate cures for diseases, and improve weather forecasting and climate modeling.
Big Data has the potential to fundamentally transform entire industries. But antiquated and costly data storage solutions stand in the way. The fact of the matter is most organizations can’t afford to maintain massive datasets for extended periods using conventional on-premises storage solutions or first-generation cloud storage services from AWS, Microsoft Azure or Google Cloud Platform. In practice, most enterprises store only essential data required to support primary business applications and regulatory requirements. Historical data containing valuable insights into customer behavior and market trends is often discarded.
But all of that is about to change. A new generation of cloud storage has arrived, bringing utility pricing and simplicity. With Cloud Storage 2.0 you can cost-effectively store any type of data, for any purpose, for any length of time in Wasabi’s hot cloud storage. And you no longer have to make agonizing decisions about which data to collect, where to store it and how long to retain it.
This next generation of cloud storage is ideal for building data lakes—vast storage repositories where you can collect massive volumes of raw data, for any purpose. In a TDWI survey of over 250 data management professionals, nearly half of respondents said they already have a data lake in production (23%) or plan to have one in production within 12 months (24%).
What is a Data Lake?
A data lake is an enterprise-wide system for securely storing disparate forms of data in native format. A data lake includes a wide variety of data not found in a conventional structured data store (e.g. sensor data, click-stream data, social media data, location data, log data from servers and network devices) as well as traditional structured and semi-structured data. Data lakes break down traditional corporate information silos by bringing all of an enterprise’s data into a single repository for analysis, without the historical restrictions and hassles of schema or data transformation.
Data lakes lay the foundation for advanced analytics, machine learning and new data-driven business practices. Data scientists, business analysts and technical professionals can run analytics in place using the commercial or open-source data analysis, visualization and business intelligence tools of their choice. Dozens of vendors offer standards-based tools, from self-service data exploration tools for non-technical business users to advanced data mining platforms for data scientists, that help enterprises monetize data lake investments and transform raw data into business value.
The diagram below depicts a data lake in an Internet of Things implementation. Edge compute devices process and analyze local data before sending it to the data lake. For example, edge servers might perform real-time analytics, execute local business logic and filter out data that has no intrinsic historic or global value.
Data Warehouse vs Data Mart vs Data Lake
The terms data lake and data warehouse are often confused and sometimes used interchangeably. In fact, while both are used to store massive datasets, data lakes and data warehouses are different (and can be complementary).
A data lake is a massive pool of data that can contain any type of data—structured, semi-structured or unstructured.
A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose. In other words, a data warehouse is well-organized and contains well-defined data.
A data mart is a subset of a data warehouse, used by a specific enterprise business unit for a specific purpose such as a supply chain management application.
James Dixon, the originator of the data lake term, explains the differences by way of analogy: “If you think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
A data lake can be used in conjunction with a data warehouse. For example, you can use a data lake as a landing and staging repository for a data warehouse. You can use the data lake to curate or cleanse data before feeding it into a data warehouse or other data structures.
Data lakes that are not curated run the risk of becoming data swamps with no governance or quality decisions applied to the data, radically decreasing the value of collecting the data by “muddying” mixed quality data together in a way that makes it difficult to rely on the validity of decisions being made from the collected data.
The diagram below depicts a typical data lake technology stack. The data lake includes scalable storage and compute resources; data processing tools for managing data; analytics and reporting tools for data scientists, business users and technical personnel; and common data governance, security and operations systems.
Data Lake Technology Stack
You can implement a data lake in an enterprise data center or in the cloud. Many early adopters deployed data lakes on-premises. As data lakes become more prevalent, many mainstream adopters are looking to cloud-based data lakes to accelerate time-to-value, reduce TCO and improve business agility.
On-Premises Data Lakes are CAPEX and OPEX Intensive
You can implement a data lake in an enterprise data center using commodity servers and local (internal) storage. Today most on-premises data lakes use a commercial or open-source version of Hadoop, a popular high-performance computing framework, as a data platform. (In the TDWI survey, 53% of respondents are using Hadoop as their data platform, while only 6% are using a relational database management system.)
You can combine hundreds or thousands of servers to create a scalable and resilient Hadoop cluster, capable of storing and processing massive datasets. The diagram below depicts a technology stack for an on-premises data lake on Apache Hadoop.
On-Premises Data Lake on Hadoop Example
The technology stack includes:
A software framework for easily writing applications that process vast amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.
A framework for job scheduling and cluster resource management.
Hadoop Distributed File System (HDFS)
A high-performance file system specifically designed to run on low-cost servers, with inexpensive internal disk drives.
On-premises data lakes provide high performance and strong security, but they are notoriously expensive and complicated to deploy, administer, maintain and scale. Disadvantages of an on-premises data lake include:
Building your own data lake takes significant time, effort and money. You have to design and architect the system; define and institute security and administrative systems and best practices; procure, stand up and test the compute, storage and networking infrastructure; and identify, install and configure all the software components. It usually takes many months (often over a year) to get an on-prem data lake up and running in production.
Substantial upfront equipment outlays lead to lopsided business models with poor ROIs and long paybacks. Servers, disks and network infrastructure are all over-engineered to meet peak traffic demands and future capacity requirements, so you’re always paying for idle compute resources and unused storage and network capacity.
Recurring power, cooling and rack space expenses; monthly hardware maintenance and software support fees; and ongoing hardware administration costs all lead to high equipment operations expenses.
Ensuring business continuity (replicating live data to a secondary data center) is an expensive proposition beyond the reach of most enterprises. Many enterprises back up data to tape or disk. In the event of a catastrophe it can take days or even weeks to rebuild systems and restore operations.
Complex system administration
Running an on-premises data lake is a resource-intensive proposition that diverts valuable (and expensive) IT personnel from more strategic endeavors.
Cloud Data Lakes Eliminate Equipment Cost and Complexity
You can implement a data lake in a public cloud to avoid equipment expenses and hassles and accelerate big data initiatives. The general advantages of a cloud-based data lake include:
You can slash rollout times from months to weeks by eliminating infrastructure design efforts and hardware procurement, installation and turn-up tasks.
You can avoid upfront capital outlays, better align expenses with business requirements and free up capital budget for other programs.
No equipment operating expenses
You can eliminate ongoing equipment operations expenses (power, cooling, real estate), annual hardware maintenance fees and recurring system administration costs.
Instant and infinite scalability
You can add compute and storage capacity on-demand to meet rapidly evolving business requirements and improve customer satisfaction (respond quickly to line-of-business requirements).
Unlike with an on-premises Hadoop implementation that relies on servers with internal storage, with a cloud implementation you can scale compute and storage capacity independently to optimize costs and make maximum use of resources.
You can replicate data across regions to improve resiliency and ensure continuous availability in the event of a catastrophe.
You can free up IT staff to focus on strategic tasks to support the business (the cloud provider manages the physical infrastructure).
First-Gen Cloud Storage Services are too Costly and Complex for Data Lakes
Compared to an on-premises data lake, a cloud-based data lake is far easier and less expensive to deploy, scale and operate. That said, first-generation cloud object storage services like AWS S3, Microsoft Azure Blob Storage and Google Cloud Platform Storage are inherently costly (in many cases being just as expensive as on-premises storage solutions) and complicated. Many enterprises are seeking simpler, more affordable storage services for data lake initiatives. Limitations of first-generation cloud object storage services include:
Expensive and confusing service tiers
Legacy cloud vendors sell several different types (tiers) of storage services. Each tier is intended for a distinct purpose e.g. primary storage for active data, active archival storage for disaster recovery, or inactive archival storage for long-term data retention. Each has unique performance and resiliency characteristics, SLAs and pricing schedules. Complicated fee structures with multiple pricing variables make it difficult to make educated choices, forecast costs and manage budgets.
Each service provider supports a unique API. Switching services is an expensive and time-consuming proposition—you have to rewrite or swap out your existing storage management tools and apps. Worse still, legacy vendors charge excessive data transfer (egress) fees to move data out of their clouds, making it expensive to switch providers or leverage a mix of providers.
Beware of Tiered Storage Services
First-generation cloud storage providers offer confusing tiered storage services. Each storage tier is intended for a specific type of data, and has distinct performance characteristics, SLAs and pricing plans (with complex fee structures).
While each vendor’s portfolio is slightly different, these tiered services are generally optimized for three distinct classes of data.
Live data that is readily accessible by the operating system,
an application or users. Active data is frequently accessed and has stringent read/write performance requirements.
Occasionally accessed data that is available instantly online
(not restored and rehydrated from an offline or remote source). Examples include backup data for rapid disaster recovery or large video files that might be accessed from time-to-time on short notice.
Infrequently accessed data. Examples include data maintained
long-term for regulatory compliance. Historically, inactive data is archived to tape and stored offsite.
Identifying the best storage class (and best value) for a particular application can be a real challenge with a legacy cloud storage provider. Microsoft Azure, for example, offers four distinct object storage options: General Purpose v1, General Purpose v2, Blob Storage and Premium Blob Storage. Each option has unique pricing and performance characteristics. And some (but not all) of the options support three distinct storage tiers, with distinct SLAs and fees: hot storage (for frequently accessed data), cool storage (for infrequently accessed data) and archive storage (for rarely accessed data). With so many choices and pricing variables, it is nearly impossible to make a well-informed decision and to accurately budget expenses.
At Wasabi, we believe cloud storage should be simple. Unlike legacy cloud storage services with confusing storage tiers and convoluted pricing schemes, we provide a single product—with predictable, affordable and straightforward pricing—that satisfies any cloud storage requirement. You can use Wasabi for any data storage class: active data, active archive and inactive archive.
Wasabi Hot Cloud Storage for Data Lakes
Wasabi hot cloud storage is extremely economical, fast and reliable cloud object storage for any purpose. Unlike first-generation cloud storage services with confusing storage tiers and complex pricing schemes, Wasabi is easy to understand and extremely cost-effective to scale. Wasabi is ideal for storing massive volumes of raw data.
Wasabi’s key advantages for data lakes include:
Wasabi hot cloud storage costs a flat $.0059/GB/month. Compare that to $.023/GB/month for Amazon S3 Standard, $.026/GB/month for Google Multi-Regional and $.046/GB/month for Azure RA-GRS Hot.
Unlike AWS, Microsoft Azure and Google Cloud Platform we don’t impose extra fees to retrieve data from storage (egress fees). And we don’t charge extra fees for PUT, GET, DELETE or other API calls.
Wasabi’s parallelized system architecture delivers faster read/write performances than first-generation cloud-storage services, with significantly faster time-to-first-byte speeds.
Robust data durability and protection
Wasabi hot cloud storage is engineered to deliver extreme data durability, integrity and security. An optional data immutability capability prevents accidental deletions and administrative mishaps; protects against malware, bugs and viruses; and improves regulatory compliance.
Read our Strong Security tech brief
Read our Data Immutability tech brief
Wasabi Hot Cloud Storage for Apache Hadoop Data Lakes
If you run your data lake on Apache Hadoop, you can use Wasabi hot cloud storage as an affordable alternative to HDFS, as shown in the diagram below. Wasabi hot cloud storage is fully compatible with the AWS S3 API. You can use the Hadoop Amazon S3A connector, part of the open-source Apache Hadoop distribution, to integrate S3 and S3-compatible storage clouds like Wasabi into various MapReduce flows.
On-Premises Data Lake on Hadoop Example
You can use Wasabi hot cloud storage as part of a multi-cloud data lake implementation to improve choice and avoid vendor lock-in. A multi-cloud approach lets you scale data lake compute and storage resources independently, using best-of-breed providers.
Wasabi offers direct, high-speed connectivity to a variety of cloud compute services through partnerships with leading colocation, carrier hotel and exchange providers like Equinix, Flexential and Limelight Networks. These private network connections avoid internet latency and bottlenecks, providing fast and predictable performance. You can also connect your private cloud directly to Wasabi. Unlike with first-generation cloud storage providers, with Wasabi Direct Connect you never pay data transfer (egress) fees. In other words, you can freely move data out of Wasabi.
Economical Business Continuity and Disaster Recovery
Wasabi is hosted in multiple, geographically distributed data centers for resiliency and high availability. You can replicate data across Wasabi regions for business continuity, disaster recovery and data protection, as shown below.
For example, you could replicate data across three different Wasabi data centers (regions) using:
- Wasabi Data Center 1 for active data storage (primary storage).
- Wasabi Data Center 2 as an active archive for backup and recovery (hot standby in the event Data Center 1 is unreachable).
- Wasabi Data Center 3 as an immutable data store (to protect data against administrative mishaps, accidental deletions and ransomware). An immutable data object cannot be deleted or modified by anyone, including Wasabi.
What if you could store ALL of your data in the cloud affordably?
NOW YOU CAN. Wasabi is here to guide you through your migration to the enterprise cloud and to work with you to determine which cloud storage strategy is right for your organization.