Archived website

This online community was active in conjunction with the Digital Agenda Assembly 2012 and is now archived and available for institutional memory. You can now join the discussion at https://ec.europa.eu/digital-agenda/en/community

Data and the Three/Four/Five Vs

paul.miller's picture
Submitted by paul.miller on Sun, 2012-04-22 19:42

Data discussions these days, especially in the arena of 'Big Data,' tend to get obsessed with the three Vs. These are Volume, Velocity, and Variety. In other words, how much data is there, how fast does it come in, and how much variation is there in data type and form.

Sometimes we add a fourth V, of 'Value' (how much benefit does analysis deliver). Given enthusiasm elsewhere in this forum (and more generally) for Open Data, we clearly need to find a European language in which 'Open' or 'Free' or 'Unencumbered by nasty licences' translates into a word that begins with a 'V' !

Of the original three Vs, most attention is typically given to the first. "I have 10 Petabytes of data" is somehow more interesting than "I must deal with 10 readings every nanosecond" or "I have ten different types of data all coming at me for analysis."

Although the undue emphasis on volume is understandable (the *BIG* in "Big Data" is an assessment of size, after all), it's also unfortunate.

What can we do, in the DAA discussions and its outputs, to ensure that Velocity and Variety receive the consideration they are due?

Group audience: 
Interesting!
3 users have voted.

Comments

miguel.gonzalez-sancho-bodero's picture
Submitted by miguel.gonzalez... on Sun, 2012-04-22 22:19

Fascinating!
Data parameters: volume, variety, velocity, value... (I like this series of 'v', easy to remember) and other parameters like format, cost (different from value), licensing conditions (what you're autorised to do with the data),(potential) links to other sets of data, other parameters?
On velocity: what does that depend of, on format, structure, another technical or semantic variable? What can be done to improve velocity?
On value: this is in principle subjective notion, thus depending on market valuation (I use here market in the broadest sense). So is there something that can be possibly done, from a policy perspective about data value, appart of encouraging data openess as much as possible?

Interesting!
1 user has voted.

paul.miller's picture
Submitted by paul.miller on Mon, 2012-04-23 10:31

Velocity is most often considered with respect to streaming data from sensors, and the flood of data from social networks such as Twitter. Data can be any format and any structure (although each individual data point tends to be reasonably simple in structure). The challenge arises because of the rate at which data is arriving, and often because of the rate at which it has to be acted upon. Systems which are perfectly capable of dealing with 100 data points, or 1,000, or 1,000,000 can begin to struggle when *all* of those data points arrive within a second or two... or when all of those data points need to be processed, analysed, and used in developing a response to a situation in the environment around the sensor. A collision detection system in a car which boasts of recording 1,000 variables related to the road, cars on the road, speed, direction etc is technically impressive but practically useless if it requires 5 seconds to decide that the car 1 second ahead of you has applied its brakes and stopped.

European companies like Datasift (which is one of only two internationally that are licensed to stream and process the full 'firehose' of data from Twitter) are amongst those working to understand the range of Velocity-related issues, and the challenges that they present to traditional data processing environments.

On Value, I agree that it is principally a subjective attribute of the data. It's important to stress from the beginning that this isn't necessarily about monetary value, or about what can be charged for data (although that may be part of the consideration here). The biggest policy benefit here, I think, would probably be to demonstrate the wide range of possible downstream 'values' that may arise from data - the lives that may be touched, the processes that may be improved, the new businesses that may rise from the "data exhaust" (a topic that's probably worth its own post...), the money that may be saved. A data set's value is not simple to capture. It's not a single transactional relationship between dataset provider and dataset consumer. Instead, it's more like a stone thrown into water... the ripples spread out from the dataset, and may have an immediate and obvious impact upon some, and a delayed and far less clearly related (but potentially more significant) impact upon others at third, fourth, fifth or even sixth hand...

Policy-driven initiatives associated with something like the DAA may well be exactly the sort of environment within which to undertake work that clearly demonstrates how far those ripples spread, and how valuable they are to us all. Without that, we keep falling back on the superficial but easily observed 'benefits' that occur close to the data set. Like an iceberg (ok, too many analogies in this comment already!), most of the 'value' probably lurks out of sight...

Interesting!
1 user has voted.

mgarrigap's picture
Submitted by mgarrigap on Sun, 2012-04-29 19:36

Thanks (once again!) Paul.

This comment has a lot of value, it's very interesting.

Relating to new Digital Agenda policies, I suggest you to start a new discussion with more specific list of themes.

For instance, according of your comment, I think the new Digital Agenda should have an specific section in order to be aware the importance of Big Data area.

But, what else?

This is the main goal of this forum: to "catch" the experts ideas in order to improve our Digital Agenda.

So, please, do you want to do a (first) list of topics to add to the new Digital Agenda about BigData?

Thanks!

Interesting!
0 users have voted.

mgarrigap's picture
Submitted by mgarrigap on Sun, 2012-04-22 23:33

There is another 'V': Vanity.

But, in this case, is not in a good sense.

One month ago, I was in Budapest, in a LAPSI Conference, I presented a initiative about to have a single OpenData license for all EU (I will write about it in another post).

One man of this conference said me that the European Union Governments have a lot of OpenData licenses because vanity: All of they want to have their specific license.

In fact, most of these licenses are very close, but with specific differences (starting with their names) that cause enough gap that to reuse open data from different OpenData portals (in the same service) are unable to do.

Thanks.

Interesting!
0 users have voted.

paul.miller's picture
Submitted by paul.miller on Mon, 2012-04-23 10:51

Good point - vanity may well be one of the principal drivers for unnecessary licence inflation, and we need to dissuade people from pursuing that path.

I'm less sure that a "single European license" is a good idea, as I discussed here - http://cloudofdata.com/2012/02/open-is-good-but-encouragement-better-tha...

Some of the reasons for licence inflation are less insidious than vanity, but just as damaging to data reuse.

The worst example we saw in the UK's NOF projects (£50M/ €60M worth of museum/library/archive content digitisation, back around 2000) was a misplaced desire to 'clearly' express terms and conditions. Over 100 project web sites, all *wanting* to encourage use and reuse of their content, all asking lawyers for advice, and all ending up with words that basically said "use this stuff for educational purposes." The problem was that they all had their own slightly different ideas about the best words to use in saying that. Driven by good intentions, all of those project websites ended up creating a mess of words that made it almost impossible for a third party to come in and remix content from two or more of the sites. That was no one's intention, but it was definitely the result.

We have a growing collection of reasonably widely used licences on the global stage. We have the Creative Commons licenses for copyrightable creative works, and we have both the Open Data Commons(*) and Creative Commons CC0 licenses for data/facts/databases.

Through DAA, we need to do what we can to dissuade people from unnecessarily creating new or (worse, actually) slightly modified licenses. We might offer *model* licenses for them to adopt. But we need to recognise the areas in which different requirements are valid and justified, and ensure that these are respected.

Personally, I would be worried about any attempt to mandate a Europe-wide license at this stage...

(*) - Disclaimer... I was closely involved with the Open Data Commons license in its earlier form as the Talis Community Licence, and whilst in a previous role funded the initial work that transformed the TCL into the ODC...

Interesting!
1 user has voted.

mgarrigap's picture
Submitted by mgarrigap on Sun, 2012-04-29 19:12

Thanks Paul.

Perhaps you're right, perhaps is too premature to do a single open data license for all european union.

But, what do you lose to try it?, I think the situation of the current european Open Data sector is bad, very bad because it's an "artificial" situation.

So, we need to do "something" in order to save our Open Data's future.

I think having a single Open Data license is a good action, perhaps isn't an ideal action, but I think is a good action.

----
I start a new discussion of this topic: http://daa.ec.europa.eu/content/single-opendata-license-all-eu-great-act...

Interesting!
0 users have voted.

paul.miller's picture
Submitted by paul.miller on Sun, 2012-04-29 19:37

Weeeellll.... I worry that by doing "something" that is too prescriptive, too wide-ranging, too early, and too untried, we actually end up setting the cause of open data *back* by years or even decades.

I worry that we get too bogged down in haggling over the minutiae of a single all-encompassing licence, rather than just getting on and dealing with things on local, national, sectoral and other vectors that may be more achievable - and more useful in the short term.

Going the other way, I worry that we implement something too fast, without fully understanding the implications, and damage prospects for either justifiable revenue generation by European businesses or truly effective use and reuse of public data by Europeans.

Neither would be good.

And I'm not convinced that the benefits of a "single European licence" outweigh those risks.

But I look forward to being persuaded otherwise!

And now I, too, will move the conversation to your new thread.

Interesting!
0 users have voted.

Oscar Wijsman's picture
Submitted by Oscar Wijsman on Tue, 2012-04-24 00:49

It guess it depends on the kind of data what V is more important. We already use the 3 V's since Doug Laney introduced them in 2001. It took a lot of years before we were able to get substantial value out of the ever growing amount of data we just ignored, now commonly referred to as big data. So the 4th V of Value is now getting more commonly used (please update Wikipedia...). Next comes the 5th V of Virtue, often discarded. What good can it bring to us? Often Value is used in a commercial way but there is a lot more to gain. Perhaps the 6th V would be something like Viewable: if we cannot see it, we cannot use it.

Interesting!
2 users have voted.

Oscar Wijsman's picture
Submitted by Oscar Wijsman on Wed, 2012-05-09 00:53

Doug, nice to see you are getting involved in the EC discussion too. And yes, you are right, many people forget where the V's came from so good to have the link to the original article.

I have one question for you. In my opinion data quantity now beats quality, even if you are using algorithms that are not smart at all. Still many researchers stick to their belief that they must filter first and then start the results on small datasets they can work on in a traditional way using ("big") spreadsheets (having thrown away most of the data they consider waste). What's your opinion?

Interesting!
2 users have voted.

mgarrigap's picture
Submitted by mgarrigap on Wed, 2012-05-02 13:15

Thanks Doug for sharing here your (old) post.

In your opinion, what does Digital Agenda must contain in order to benefits of BigData reach European people?

Interesting!
0 users have voted.

People

competencesmarocaines.org's picture
fhardes's picture
fredriklinden's picture
keneastwood's picture
Nicholas Bentley's picture
JacintaArcadia's picture
Loankanassy's picture
Kasper Peters's picture
Kristijan Jakic's picture
lpujol's picture
Digital Agenda Assembly engagement
glqxz9283 sfy39587stf02 mnesdcuix8
glqxz9283 sfy39587stf03 mnesdcuix8
glqxz9283 sfy39587stf04 mnesdcuix8