Stop Thinking About Data Lakes and Start Thinking About Data Gardens
The prevailing analogy for a collection of data from multiple sources and in multiple formats is that of a lake. While this is more useful than saying ‘store your data on a bunch of virtual machines’, I have come to prefer a different reference point, that of the garden. I mean a conceptual data garden, rather than a literal one.
That’s fine, you might say, but so what?
Because, I think that choice of analogy is important. In the following paragraphs I will outline why I think the garden is a better conceptual starting point for planning the technical strategy and implementation of storing and managing collections of data.
The Data Garden
Who is this place designed for?
To find something hidden beneath the waters of a real lake, specialist equipment and training is required. Perhaps a boat, or scuba gear, maybe powerful sonar. In fact, the whole environment is hostile to humans.
In contrast, a garden is by definition designed with a human experience in mind. While experts are surely involved in creating a garden, and there might be private zones, and areas open to the public, once a person is invited in, they may find what they seek in relative comfort.
Strong foundations matter
Any farmer or gardener will tell you that things like soil quality and access to sunlight matter more than the seeds used. Further, every garden must be designed to complement or benefit from the environment it is in, and while it is possible to start a garden in controlled conditions, more often a garden must be built with an understanding of the local environment.
These conceptual constraints makes sense in an enterprise data environment as well, where data assets must be built around existing technology, people and process. Flexibility is critical, because…
Seasonality and timing are important, nothing is forever
If you think of data stores and assets like plants, you will not be disappointed to realise they require maintenance and care. Some datasets are best left to grow, others requires significant levels of monitoring and curation.
“A garden requires patient labor and attention. Plants do not grow merely to satisfy ambitions or to fulfil good intentions. They thrive because someone expended effort on them. ” Liberty Hyde Bailey
All gardens are unique
There are 25 different versions of mint and 150 varieties of rose. Orchids and sunflowers each have more than 20,000 species. So it is with enterprise data - yes there are similarities but every organisation has their own people data, financial data, operational data, and is interested in different combinations of technical and external data.
Gardening relies on principles which are to be applied with judgement by people…
We need gardeners
A gardener supports and understands and maintains their plot. They do not throw a bunch of seeds in the ground, organise a few regular deliveries, and wait. They turn up every day, to prune, cut back weeds, plant things, to show someone around. Rarely, they may sit an enjoy the results of their work.
A gardener is always busy, and so too must be the people tasked with maintaining your data lake garden.
This analogy is not perfect. It’s hard to elegantly include aspects of data access and privacy, or automation and efficiencies of scale, although the data lake concept does not do these easily either. Also, introducing this terminology may present a barrier to education - a data lake is commonly referred to and written about, so it is an ok starting point for an organisation that has not yet taken any steps to consolidate and understand their data universe.
A final note - much has been written on the negatives of file based data lakes (including this good piece by Jeremiah Hansen here), but I believe there is a place for almost any technical architecture as long as it is managed in the right way ie ‘like a garden’.