Big Data In Clouds – Build Them And They Will Come….

So, the two watch words of the moment (aside from the usual virtualisation tranche that gets bounced from pillar to post) are ‘Big’ and ‘Data’. In themselves, not particularly new or exciting, but together, they form a whole new set of questions, problems, issues and answers for system designers and integrators to worry about.

What exactly is ‘Big Data’? Essentially, in the modern age – the footprint of data sets is becoming massive, and managing it is become equally as big a challenge in current computing trends. When I say massive – gone are the days of multi-terabyte datasets. Big Data steps past the terabyte into the petabyte and the exabyte range of data storage requirements. When data is this big, the usual rules, tools and even timescales needed extend into the unreasonable – raising the need for new rules, tools and technology practices to help data consumers achieve realistic timescales for accessing the data and making it usable. After all, what’s the use of having data if it’s not accessible and consumable?

So, how can virtualisation and more exactly cloud platforms help to make this easier? First, we need to look at the major limitations with Big Data footprints.

  • Data Curation – cataloging and filing the data for access.
  • Data Provenance – how is data stored and changes tracked over time.
  • Data Access – simply getting access to the data in the first place.

Curation and provenance of big data stores are specialist subjects in the field of Big Data, and these are challenges for actual data management. Where virtualisation and cloud computing come in is the access portion of the problem. Extending beyond this even, cloud models don’t solve this problem in the real sense, but offer an option to turn current access trends on their head. But how?

The current trend for data access is what I call the ‘you to me’ model, and its the typical model under which the majority of the internet currently functions. In this model, data is stored centrally, accessed down a broadband or LAN connection to a local PC and local storage, where it is consumed (e.g. either visually with web pages or for post-processing with other data formats). This model has given rise to ASDL lines for domestic broadband – given the nature of the traffic is from you (webservers) to me (consumer).

With Big Data problems, the problem is that this model no longer works. If the data is to large to send to a local PC for consumption over broadband or LAN connections – how can it be accessed with current [limited] technology?

Cloud Infrastructure + Big Data + VDI = Big Data Access!

Hold on a minute – VDI – that’s virtual desktops right? Correct! (VDI = Virtual Desktop Infrastructure) It’s the VDI model that, when combined with cloud architecture in the current technology climate, solves the Big Data access problem.

Let’s think about that a little more. With a VDI model, data is held and processed centrally, with thin clients or remote ‘dumb’ terminals accessing compute and storage instances held in another central location. By streamlining the data that flows over the broadband or LAN connection, big data sets no longer become the issue, rather managing the connectivity to the data becomes the issue. VDI solves this issue, because the VDI model is no longer ‘you to me’ but instead is ‘me to you’. The data doesn’t come to me, I go to the data – it’s the same as pulling up a chair at the console in the DC and working directly on the storage and compute architecture hosting the big data footprint.

Taking this a step further, cloud modelling and infrastructure lends itself more to this model than first meets the eye. Cloud provisioning allows us to provision elastic compute instances capable of bursting several metrics for big data consumption in a controllable and manageable manner. With cloud architecture, we can:

  • Burst CPU, RAM and compute storage when needed.
  • Dynamically scale infrastructure up and down (at application, virtual machine or infrastructure level) to handle sinewave workload requests.
  • Audit and control compute instances to police infrastructure consumption, through virtual limits or financial means.

So, let’s look at an example of how this might work. A multi-petabyte dataset sits in a DC in a green field but well connected location. Users that use the service come from many worldwide locations, with connections ranging from multi-gigabyte synchronous links to those connected to domestic internet services.

Why Is This Important?

Converging access to big data footprints is important for several reasons.

  1. Technology marches on. As the volume of data increases and traditional architectures in connectivity to big data stores lags behind (for whatever reason, be it commercial or political) – some technology needs to bridge the gap until such a time that one technology stream catches up with the other.
  2. Capacity marches on. With every technological enhancement or evolution comes a leap in capacity. Keeping up with the capacity of tomorrow’s big data is the essential core of the big data problem.
  3. Access means accelerated connections. We see all the time that groups of researchers reveal their eureka moments to the world – in the news and print media. The amount of data available to researchers will take years to access, download and digest into consumable chunks. With a central access methodology to big datasets, the first 2 considerations are compressed significantly. Net result – more eureka moments in the news as research and information is accelerated by accessing more data in cloud deployments.

Big data problems are here to stay. The question for cloud service providers is not ‘how’ or ‘why’ should we look at the bug data question, rather it should be ‘how quickly’ can we provide big data solutions to those communities that require central access to giant data sources.

So – to service providers of cloud services: Big Data, Build Your Clouds and They Will Come.

Jeremy loves all things technology! Has been in IT for years, loves Macs (but doesn't preach to others about their virtues), loves virtualization (and does shout about it's virtues), and sometimes skis, bikes and directs amateur plays!

Leave a Reply

Your email address will not be published. Required fields are marked *