So, here you are, faced with the fundamental question for a data engineer/data scientist: How do I provide the secure, available, scalable, flexible, accessible, reliable, and cost-effective data ingestion, storage, transformation, reporting, and analytic environment my organization needs to compete in today’s acceleration economy?
Wait . . . go back and read that again, slowly. Look at all those requirements! How the heck do you deliver on all those — sometimes conflicting — demands?
Way back when I was a data engineer, the answer was to license and implement (and patch and upgrade and support and train people to use) a plethora of specialized tools that collectively provided the needed features (except for “flexible” and “cost-effective,” in most cases). And, of course, most of those tools ran on-premise, requiring lots of additional work and cost.
Today, your solution might be to implement a software-as-a-service (SaaS) data lakehouse that combines most, if not all, the above features. Data lakehouse products can be licensed as a stand-alone toolset, as is the case with Acceleration Economy’s Cloud Wars Top 10 vendor Snowflake’s product, or as part of an analytics toolset, like the products offered by Data Modernization Top 10 Shortlist vendor Qlik. Whether or not you license the data lakehouse separately from the analytics product mostly depends on the scale and complexity of your needs.
OK, full disclosure: This class of product has a number of stock keeping units (SKU)s that can be licensed separately, depending on your needs, so you’ll spend some time working through your configuration. And many of the products have “application stores” that allow customers to license additional capabilities from affiliated vendors. Pay close attention to your needs versus wants or “cost-effective” can go out the window . . . but a SaaS data lakehouse suite is still far cleaner than any multi-vendor tool amalgam can be.
Which companies are the most important vendors in data? Check out the Acceleration Economy Data Modernization Top 10 Shortlist.
A Proper Data Lakehouse Implementation
Since a data lakehouse combines the features of a data lake with those of a data warehouse — with an analytics and reporting capability, perhaps — its use can dramatically speed decision-making by improving data access and analytics. A proper data lakehouse implementation depends on a set of thoughtful decisions (including, but not limited to):
- Data Governance. What data are you collecting? How long do you need to store it? Who should have access, and what kind of access should they have (this incorporates both role-based controls and “data classification” into categories like “internal use only”)? Who can grant access to each data element, and what kind of audit trails are needed? How should data be described so people can find what they need (which gets into taxonomy and metadata)?
One important part of data governance is data lineage (or data provenance), which means demonstrating (to auditors and perhaps regulators) where data originated and how it was copied and transformed into its end products (reports, dashboards, etc.). Some data must be pristine: It’s in-scope for Sarbanes-Oxley Act (SOX) audits and for external financial reporting. But data quality comes at a cost — especially for “big data” — so not all data needs to be perfect (see “Data Engineering” below). - Data Security/Availability. This is your next decision. Start with encryption (in today’s world the answer is “Yes, encrypt” — don’t overthink this). Then layer in data access controls (to implement the governance decisions made above). If you do it right, you’ll find that this is where data security intersects with zero-trust principles. What level of redundancy is needed?
- Data Engineering. Here you’ll face another set of decisions that are related to the earlier decisions. Data engineering is largely about cost, and the trade-offs that are needed to balance cost against every other objective. FYI, users always desire three things from an information technology (IT) system: that it be free, instant, and all-encompassing. What kind of performance is needed from, for example, real-time data ingestion for Internet of Things (IoT) applications, analytics performance for trading-floor applications and industrial controls, or archival storage for historical comparisons? What is the desired redundancy cost in license fees, bandwidth, latency (for dual-commit transactions), and FTEs (full-time equivalents)?
- Tool Access. This used to be easy; IT got to access IT tools, and end-users consumed the output of the tools. Then the pendulum swung — too far, in my opinion — and shadow IT flourished as users got access to powerful tools and huge datasets without necessarily being subject to, or even being aware of, data governance and security controls. This tool/control mismatch created many problems for organizations as multiple sources of truth were created and maintained — with needless cost and inadequate security. As a CIO, I’ve spent years stamping out most shadow IT, but data lakehouse products finally allow IT to embed security and governance controls right in the lakehouse, thereby making it easier to enforce important organizational standards and harder for users to inadvertently cause problems. Data lakehouse tools can also bridge the gap between “citizen developers” (users with cool tools) and “pro developers” (IT specialists with cooler tools) and thus facilitate collaboration among groups that heretofore used different tools and had different controls. Effective tool deployment, governance, and training aren’t automatic. Data lakehouse tools should operate within the organization’s overarching data security and governance frameworks and be deployed following best practices that make it easier for all users to “do the right thing” with data and analytics (which means getting rid of spreadsheets almost everywhere).
Conclusion
Data lakehouse technology combines powerful tools with access to treasure troves of data. Proper implementation and use of a data lakehouse and its associated analytics tools empowers everyone from top executives to customer-facing employees to make decisions faster and more accurately than ever before. Making smart decisions when designing and implementing the data lakehouse is critical to maximizing return on the organization’s big investment in technology, and its even bigger investment in generating and acquiring data.
Want more insights into all things data? Visit the Data Modernization channel: