Monday, April 30, 2012

Will more data lead to different histories being told?

New technology is making information more widely available and, when it launches later this year, the WDL will make it easier to access historical evidence about the foundations of modern genetics.  Will this democratize our understanding of the history of genetics and lead to different versions of the history being told?

There is an African proverb which says that history is written by the hunter not the lion.  History inevitably simplifies the past and the selection process can be subjective.  When it launches the WDL will start to put 21 archive collections and around 2,000 books on-line.  The project is to digitise as much as we can rather than cherry pick the highlights. This means that the building blocks used by historians to piece together the past will be made freely available to a wider audience.  A lot of this material may seem like mundane workaday stuff. Users will have to wade through a lot of material to reach the bits they are interested in but this is probably a more accurate reflection of the scientific research process.

Flashes of genius are essential but they do not happen in isolation. Thomas Edison’s phrase about invention being 1% inspiration and 99% perspiration applies to scientific research too. The discovery process needs both.

Watson and Crick were extremely clever to work out the helical structure of DNA but they did not get there simply because they were lone geniuses. Before they made their discovery a lot of people had spent years experimenting, writing and thinking about DNA. There had even been flashes of insight which ended up being wrong.I recently read a letter from Gerald Oster sent to Aaron Klug after Rosalind Franklin’s death, in which he recalled his time working in London. He reflects that even though he had much of the relevant information by early 1950 he lacked the insight to work out the structure of DNA.  This letter (FRKN/06/07/001-2) is held by the Churchill Archives Centre in Cambridge and a digitised version will become part of the WDL.

I am rather hoping that the WDL might help us to recognise that while flashes of inspiration are part of scientific discovery they are only possible because a team of other people paved the way.

Thursday, April 19, 2012

Clearing copyright for books: preliminary ARROW results

As part of the genetics books project, we are tackling issues of copyright clearance and due diligence head on. Up to 90% of this collection is in copyright, or is likely to be in copyright, so developing a copyright clearance strategy was one of our earliest considerations. This turned into a useful project to test-run the EC-funded ARROW system on a large scale. ARROW provides a workflow for libraries and other content repositories to determine whether books are in-commerce, in copyright, and whether the copyright holders can be identified and traced. This system has undergone small tests throughout Europe, including the UK (using collections and metadata from the British Library), but in order to determine whether ARROW is feasible on a large scale, a realistic large-scale project was needed.

The Wellcome's genetics books project provided this opportunity, and the challenge was taken up by the ALCS and the PLS jointly, as announced previously on our Library Blog. Results from ARROW, combined with the responses from contacted rights holders, determine whether the Wellcome Library will publish a work online.

The collection of (roughly) 1,700 potentially in-copyright books is not enormous, but it is diverse, and has already thrown up some interesting wrinkles in the copyright clearance workflow.

For example, according to the AARC2 standard used to catalogue these books, only up to three authors are included in the metadata record (followed by et al). Works with more than three authors, and collected works such as conference proceedings, had to be manually consulted in order to identify all the named contributors. This inflated the known number of contributors to nearly 7,000 (4 authors on average per book).

Embedded below is a presentation I gave at the London Book Fair earlier this week, which provides an overview of the process, and preliminary statistics from the first 500 books to complete the ARROW workflow.

Monday, April 2, 2012

Learning lessons on the Genetics Books digitisation project

A key component of the theme of our digitisation pilot programme - "Foundations of Modern Genetics" - is a set of printed textbooks and secondary sources published between 1850 and 1990 that shed light on the development of genetic and genomic research. The total collection identified is around 2,000 books. The goal is to digitise these texts in full, and make them freely available online via the Wellcome Digital Library (we are of course dealing with copyright clearance).

Digitisation of books often looks and sounds straightforward. It is not always straightforward of course - but the new book scanners on the market these days do make it quick. There are standard ways of book scanning - you put the book on a cradle, and either turn the pages (by hand), or use a "robotic" contraption that turns the pages automatically. You can use scanning technology, or one-shot dSLR cameras; panes of glass to hold the pages down, or small grips on the outer margins of the pages. The choice depends on the physical nature of the books and how quickly you want to digitise. Even when outsourcing it is useful to understand how book scanning really works. Our Genetics Books digitisation project - a pilot project - is giving us this opportunity.

We commissioned local digitisation company Bespoke Archive Digitisation to carry out the digitisation work for this pilot project. As the digitisation is carried out on site, we have been involved to some extent in all aspects of the digitisation, including the setup and use of new types of equipment, the QA process involved in book digitisation, and the workflow of image conversion and delivery. As we have never carried out high-throughput book digitisation at the Wellcome Library before, this has been a huge learning curve for us, allowing us to gain knowledge that will come in very useful in the future with new (and hopefully larger) projects.


Bespoke Archive Digitisation uses a robotic book scanner and a manual book scanning unit (for books that are not robust enough for the robotic scanner, are outsized, etc.). Both of these "scanners" use Canon 5D Mark II cameras, two per unit to capture each page of an opening simultaneously.  The robotic book scanner is the latest version from Kirtas, the Kabis III. Richard Keenan, owner of Bespoke Archive Digitisation explains, "this unit has a number of time-saving features such as "fluffers," a “snubber,” and a self adjusting book cradle which moves to keep the book at the correct angle to be photographed. This is accomplished through various sensors and lasers, which monitor the book throughout imaging to keep it in the correct position, but must also be monitored by the operator."

A key lesson, according to Richard, is that "although all robotic book scanners include a published throughput (2,890 pages per hour for this particular unit), it is important to understand that the published throughputs do NOT mean that you can do 2,890 pages per hour, hour after hour without stopping. Each book must be set up on the cradle, the cameras may need some adjustment/focusing, and page turning does require manual intervention, every time, to ensure the pages are flat, and to prevent page curvature and glare (especially on sealed paper).

"Also, it is very important to remember that this is just the image capture stage, the pages then have to be batch processed, edited and rigorously quality assessed which can take the same, or more time than imaging. Depending on the book's structure - page thickness, binding type, size of the book etc - you will find that speeds vary considerably, a realistic estimate of throughput over a significant period of time is approximately 1000 pages per hour, but this can be much lower with some books.

"Although these figures differ by a large margin from those published, the Kabis III from Kirtas is still probably the fastest way to digitize books, and the important thing is that the quality of output produced is excellent if operated correctly. The on board editing software 'Book Scan Editor' is very handy, offering the usual cropping, image adjustment and sharpening options, but also deskewing and xml conversion and even OCR. I would say that another thing to bear in mind here, is that there is a large learning curve with this technology, so for anyone thinking of using one - particularly those who have no experience with robotic book scanners - plan plenty of time in the project for training and testing periods."