European distributed digitisation infrastructure for natural heritage

Biological collections are irreplaceable as keepers of the hard core of biodiversity information and natural heritage.  Yet, only 10-20% of their vast holdings in Europe have been databased and much less imaged. As the consequence, most of their content is not accessible online.

This project will launch us seriously towards digitizing all of those one billion specimens in European museums.  Modern mass-digitisation technology makes it possible for one person to produce images of 500 samples in a day, which is 100,000 samples per year.  A small digitization centre of 10 people can thus reach one million samples in a year. In this project we establish 10 such centres around Europe and run them for 5 years, which means 50 million specimens digitized. (WP1)

Budget will be 15 million €.  We aim at 0,30 € / specimen, which is about 1/10 of the cost today.  We will develop innovative new imaging methods for the kinds of samples which cannot be imaged fast with current methods. This will accelerate the rate of digitization further. (WP2)

In order to transcribe data from those images, we mobilize a workforce of 1000 people around the world.  Transcribing will be distributed to the country and expertise that is required to read them correctly and where the cost is right.  An international collaboration will emerge because, for instance scientists in an African country can transcribe their own specimens. Other transcribing will be prioritised demand-driven, when then data is really needed. It will be supported by efficient, adaptive workflows which guarantee that no data need to be georeferenced twice and that scientific names will be interpreted correctly. (WP3)

We will establish a data centre that can handle such a big load of images, as 3 TB will be generated each day.  This will use the European HPC infrastructure where the data will form one virtual pool. The functions will be packaged in a virtual research environment for biological collections. This will ensure seamless online collaboration which accelerates biodiversity science. (WP4)

We will mobilize the scientific community to curate and annotate the data, increasing the quality of data and maintain the data up to date.  We will also digitize “dark species” which have not yet been described, in anticipation that availability of images and data will accelerate discovery of new species.  Such nameless specimens will be DNA-barcoded. (WP5)

There will be an outreach programme that will communicate the use and benefits of this big data pool, and gain support for this operation. We will cooperate closely with major other digitisation programmes such as the NSF iDigBio. At the end of the project we will transfer this infrastructure to the willing nearby host institutions, who will operate it to digitize further collections.  During the project we will explore business models for the operation, with the aim that parts of the operation will continue as independent businesses. (WP6)

Proposed by: 
Hannu Saarenmaa

Comments

http://www.gravatar.com/avatar/ee97bb21704b6086e6a4f0cc7846d00e.jpg?d=http%3A//h2020.myspecies.info/sites/all/modules/contrib/gravatar/avatar.png&s=100&r=G
Submitted by aaike on

Are you planning to frame these efforts in a specific questions and/or use specific criteria to select the material to be digitised? Or in other words, will this be a demand driven excercise, and who do you regard as the main stakeholders?

http://www.gravatar.com/avatar/fcf017de1c7bdb778745955b8f47e85e.jpg?d=http%3A//h2020.myspecies.info/sites/all/modules/contrib/gravatar/avatar.png&s=100&r=G
Submitted by saarenmaa on

Good question. Because of physical limitations, mass-imaging must go from A-Z. That is, an entire collection must be imaged in one go, without regard to prioritisation within the collection.

The transcribing, however, can and probably should be selective and demand-driven.  This can be achieved if during the imaging phase data of some important fields like taxonomic group and major geographical area are entered.  This can be done semiautomatically, and it facilitates discovery of the right samples for demand-driven transcribing later.

The stakeholders define what will be prioritised.  For instance, when taxonomic treatment of a group begins, that research group may want to transcribe data from the relevant images. Etc.

http://www.gravatar.com/avatar/44d47bee205dd490b68b69fb6958ea22.jpg?d=http%3A//h2020.myspecies.info/sites/all/modules/contrib/gravatar/avatar.png&s=100&r=G
Submitted by Jiri Frank on

How will be those data stored and also accessed? Do you plan to build something as for example EOL with CC licenses?

There will be needed also some new effective digitization methodology development according different collections types - herbaria, fossils, big animals, micro, alcohol specimens etc. (of course Digitarium have experience in this field).

Negotiation with collections institutions will be needed as well as some agreement or consensus maybe even before the proposal submission.

Anyway it is great idea, which will follow already running national level digitization projects and EU projects as for example OpenUp!

Those are just few quick ideas.

http://www.gravatar.com/avatar/fcf017de1c7bdb778745955b8f47e85e.jpg?d=http%3A//h2020.myspecies.info/sites/all/modules/contrib/gravatar/avatar.png&s=100&r=G
Submitted by saarenmaa on

"How will be those data stored and also accessed?"   It think we need to try something new here and place all the data one big pool, using the high-performance computing infrastructure in Europe.  We need to ensure that the images are immediately accessible and discoverable for annotations and demand-driven transcribing.  This can be somewhat  like a VRE - virtual research environment.

http://www.gravatar.com/avatar/d720025814764361383aa5662520fa59.jpg?d=http%3A//h2020.myspecies.info/sites/all/modules/contrib/gravatar/avatar.png&s=100&r=G
Submitted by Triebel on

Dear Hannu,

the envisaged project fits well with our interest in digitalisation of our zoological, mycological, botanical and paleontological collections (around 30 Mill objects). Therefore, the Staatliche Naturwissenschaftliche Sammlungen Bayerns are interested to be a partner. We certainly support the efforts to create a distributed IT infrastructure for the storage of image data.

Dagmar

public://pictures/picture-179-1383663872.jpg

The Discovery Collections(http://noc.ac.uk/data/discovery-collections) based at the National Oceanography Centre (and in part at the Natural History Museum in London) would like to be considered a partner of this initiative.  We already have a large database of midwater organisms held within OBIS but would like to add the Benthic collections to this.  This will require a coordinated effort such as this for funding.  It will be important to include some of the smaller/specialised collections that t have little access to infrastructure/support/training for such digitisation efforts.

Tammy

http://www.gravatar.com/avatar/fcf017de1c7bdb778745955b8f47e85e.jpg?d=http%3A//h2020.myspecies.info/sites/all/modules/contrib/gravatar/avatar.png&s=100&r=G
Submitted by saarenmaa on

We have now identified a suitable call and proposal writing is underway.  The first draft will be distributed by Digitarium in late April 2014.

The call that is being aimed at is for designing infrastructure and consulting the needs of the community for its features and feasibility. The project will be smaller than indicated here. Construction phase with bigger budgets may come later.

Add new comment

You must have Javascript enabled to use this form.
Scratchpads developed and conceived by (alphabetical): Ed Baker, Katherine Bouton Alice Heaton Dimitris Koureas, Laurence Livermore, Dave Roberts, Simon Rycroft, Ben Scott, Vince Smith