A comparative study on data mining and other data processing methods.
Few months ago I developed a diagram which presented data mining in a comparative way beside several other methods and concepts such as “data analysis”, “pattern recognition”, “machine learning”, “artificial intelligence” and “deep packet inspection”. The diagram was useful to clarify these concepts, their interdependence and especially their role in the data processing.
But while interpreting the new data mining copyright exceptions, I found out that the public didn’t know exactly what “automated analytical technique” and “data generation” really meant. Although the correct interpretation of the new copyright provisions depended on these terms, and despite the fact that the same public was constantly exposed to tons of information regarding the new processing methods mentioned above.
I realized that the cause of the problem lied in the incorrectly understanding “mining” as being limited to the technology with the same name although the European legislator included in the definition of data mining also “any automated analytical technique”. This is the exact wording of the art. 2(2) of DSM Directive and the use of the term ANY is very suggestive to identify any other automated technique that fulfils the conditions of the European definition — namely to “analyse text and data in digital form in order to generate information which includes, but is not limited to patterns, trends and correlations”.
Notwithstanding that this aspect has been previously noted in some of the current legal analysis, those “other automated technologies” had never been pointed out.
But the peculiar thing is not this one — most of those OTHER technologies (methods) mentioned by the European provisions, were already known to public, some of them being already regulated such as the “profiling” or the mass mediated “upload filter”. Other technologies such as machine learning or artificial intelligence are already buzzwords that nearly everyone has heard of. The relevance in the copyright context is that all these technologies have “the analysis” and the “data generation” as their common ground. These features are subject to the new developed diagram particularly because they explain what kind of technologies are covered by art.2(2). It also explains why one should understand the new mining exceptions as real open norms which can be adapted/applied to a lot of methods and technologies.
- More details on the new diagram
(Please note that the picture bellow is not the original diagram but only a glimpse of it. Since Medium doesn’t support embedded forms I ask you to please follow the link to see the whole diagram)
The new diagram was developed based on those two common grounds namely “the analysis” and the “data generation” a.k.a “useful information extraction”. Technologies such as “reverse engineering” and the solutions for “content filtering” were added to the initial diagram which dealt only with data mining technique, machine learning, KDD, artificial intelligence, pattern recognition and data analysis.
2. Short presentation:
The concepts are revolved around data mining mostly because this technique has become especially relevant due to its recent regulation by the new Copyright Directive. Beside each concept there are several definitions displayed, together with a link where the online source can be accessed. This gives you the ability to compare different kinds of definitions of the same concept. You can also make a comparative analysis of the different concepts in order to understand that some of them are interdependent or can be used simultaneously in different kind of technology assemblies/systems. Many definitions are different because of each author’s perspective but what’s important to note is that each technology involves “the analysis of data” as well as “the generation of data” (methods for sustaining the specific analysis are different as well as those involved in the process of generation/extracting data. The generated data could also be different).
The terms are different. How could that be explained? “Data mining”, “artificial intelligence”, “machine learning”, “deep packet inspection”, “reverse engineering”. We obviously cannot put the equal sign between these concepts even if sometimes the terms are used interchangeably and even industry recognizes them as umbrella term. The most of them are making use of different kind of algorithms and sometimes their analysis involves different kinds of data (as in the case of the deep packet inspection which analyses traffic data).
More to that, as I mentioned earlier, the definitions are different. The term “analysis” couldn’t be found like that in all the explanations related to one specific technology. Sometimes the “analysis” is implied or the definition includes synonyms for it such as “investigation”, “monitoring”, “inspection”, “study”, “examination” but in all the cases the methods deal with some kind of analysis. Some data are seen, investigated, verified, analysed, in order to be classified, for example.
The same way is happening in the case of that “knowledge extraction” mentioned above. First, some insights on this notion: Although this term is very provocative and seems to encapsulate the whole human knowledge or at least, a very essential and/or advanced part of it, this “knowledge” is actually some data resulted (output data) from the analysis stage.
Of course the output data is new, different and most probably superior or advanced but only compared with the analyzed ones. They are not necessarily state of the art and in most cases they will need further processing/analysis but they are definitely more important for the development, from the pursued task point of view. Sometimes this knowledge is described by using other kinds of generic terms such as “insights”, “ideas”, “principles” or terms which illustrate specific kind of data — “patterns”, “associations” but all of these are only ways to describe the results of different kinds of processing/analysis.
3. Which are the consequences for the copyright law? How do these details affect the way the new copyright exceptions should be interpreted?
The diagram explains that the analysis and the data generation are not specific only to data mining technique but to multiple other methods linked to machine learning and artificial intelligence. The study of the new exceptions texts will not depend on data-mining tech specs, because data mining is an umbrella term which encompasses all kind of techniques described by those concepts illustrated in the diagram.
Each of these concepts describes, in fact, activities excepted from the copyright holder consent. Similar to “private copy” — a well known exception which illustrates the circumstances under which copies could be freely made, the data mining exception identifies the context under which all kind of data can be used (analyzed) without any restriction. Finding the common ground of all these mentioned technologies is practically essential to identify the context of the new copyright exception.
More precisely, other types of technologies could be identified — beside those depicted in the diagram, that will also benefit from the data mining exceptions, regulated by the new copyright directive. The type of data or the way in which data are to be processed are not relevant. Nor the types of data being generated, because the directive text is very clear in not limiting the types of data resulted from the analysis — (“information which includes but is not limited to patterns, trends and correlations”). The only important thing that one should focus on is to correctly identify the analysis and the data generation/extraction as concrete stages of one specific technology. Those will be the criteria that the technology should primary meet in order to benefit from the new copyright exception.