How do I get to data mining

Text and data mining - what is it and how do I get the data?

Text and data mining has already become part of everyday life as a scientific method for many scientists from a wide variety of specialist disciplines. With the help of analysis methods based on algorithms, patterns are searched for in an unstructured or unstructured amount of data in order, for example, to develop new scientific theses or to check existing theses through data analysis. The amount of data can be, for. B. consist of texts, images, measurement data. In science, for example, text and data mining have become indispensable in the context of genome sequencing, but also in the humanities as a method.

Requirements for text and data mining are freely available and machine readable Data.

If the data used for text and data mining does not come from your own texts and experiments you have carried out yourself, the question arises as to where scientists get the data they want and under what conditions the data can be used.

Unprocessed raw data (e.g. measurement data) are generally structured data and free of copyright from a technical point of view. Other "data" such as B. Texts, images and videos are initially unstructured data that have to be technically processed in order to make them accessible for machine analysis. They can be protected by copyright as a work or within the framework of an ancillary copyright, provided that the requirements are met and the protection periods have not already expired. If the scientists have made a clear statement about the legal situation and usage options for text and data mining on the copyrighted objects, scientists are spared a potentially time-consuming search for the right holder and equally time-consuming obtaining of rights for the use of the data.

Journal articles and other text publications are also suitable data material for text and data mining. In many cases, these are publisher publications. Some publishers are already beginning to include clauses in license agreements (Springer) or to allow text and data mining on the publishers' platforms (CrossRef and Elsevier).

The Alliance of German Science Organizations is trying to negotiate the desired rights to text and data mining with the publishers for the scientists within the framework of the priority initiative Digital Information in the national licensing of publishing products. For this purpose, the following clause has been included in the sample license for the alliance licenses:

"The Licensed Material may be used for text and data mining to enhance services, to encourage scholarship, teaching and learning and to conduct research by the Licensee and Authorized Users according to the following principles, as long as the purpose is not to create a product for use by third parties that would substitute the Licensed Material: Raw data may be extracted from the Licensed Material. Text and data mining may be performed on the unchanged Licensed Material or on extracted data (including but not limited to reproducing, storing, adapting, assembling large collections or extracting substantial portions of data and analyzing them). The raw data is research data and may be stored, published and distributed in any medium or form under any license in order to ensure reproducibility and sustainability, as long as the Licensed Material cannot be reconstructed in its original, human readable form. The Licensor will cooperate with Licensee and Authorized Users as reasonably necessary in making the Licensed Material available in a manner and form most useful to the Licensee and Authorized Users. Attribution must be made to the Licensor in an appropriate manner and form. "

The models differ in the following points:

  • physical availability of the full texts for the scientist
  • scientist's non-commercial purposes
  • Rights to publish the data used in the form of structured raw data if the original texts cannot be reconstructed
  • The purpose must not be to create a product that can be used by third parties and that replaces the publications offered by the publisher
  • Attribution

For text and data mining of text publications, scientists in Germany are currently dependent on the scientists' choice of license or the models offered by the publisher, provided that there is copyright protection. But here, too, the discussion and development is not over yet.