by Mary Todd Bergmann

Millions of people generously give their time, blood and information to biobanks – all with the goal of improving research into human disease.

Genetic data from the UK Biobank’s major cohort is now distributed through the European Genome–phenome Archive (EGA), ensuring its sustainable, secure delivery to approved biomedical researchers worldwide.

Genomic data from the 500,000 people participating in the UK Biobank initiative will be distributed via the European Genome–phenome Archive (EGA), a resource developed jointly by the European Bioinformatics Institute (EMBL-EBI) and the Centre for Genomic Regulation (CRG). UK Biobank provides extremely detailed, high-quality datasets on individuals. It is an unprecedented collection that offers endless possibility and substantial efficiency savings for biomedical research and understanding the causes of disease.


UK Biobank has established a partnership with the European Genome–phenome Archive (EGA), a joint resource developed by EMBL-EBI and the Centre for Genomic Regulation (CRG)

UK Biobank, which manages health information on over 500,000 individuals, will share its genetic data in its first release via EGA.

Distribution of the data via the EGA will ensure long-term data security, accessibility and sustainability, which will help researchers to better understand human disease.

UK Biobank

Around 500,000 people from across the UK, between the ages of 40 and 69, participated in UK Biobank between 2006 and 2010, undergoing extensive measurements and genotyping. They provided blood, urine and saliva samples for future analysis – including genetic – and gave detailed information about themselves. They also agreed to allow UK Biobank to integrate information from their electronic health records.

The result is a staggering body of data for human health research, made available to approved researchers. In return for using the data, the researchers are obliged to report their new findings to UK Biobank so that other scientists can benefit.

Because of the sheer size of the datasets, they are offered in two separate parts. The phenotype data – physical traits, health information – are available through the UK Biobank directly. To make sure the data is delivered to researchers with maximum efficiency, the genetic information is available through the EGA.

Once an approved researcher has downloaded both parts, he or she matches them up locally to create or boost the control dataset for their experiment. This is standard practice for controlled access. The data management process is set up so that participants can’t be identified.

Using UK Biobank genetic data

If a researcher is studying macular degeneration (MD), a serious eye condition, and needs more subjects with certain genetic traits, he or she could go to the UK Biobank to see whether they have samples from MD patients. This is much easier than going out and finding new subjects to participate in the study, which could take years. Researchers can apply for access and, once it has been granted, add patient data to their analyses by downloading the phenotype data from UKBiobank then matching it with genetic data from the EGA.

High-demand datasets

In its first few weeks of activity, more than 300 researchers across 139 institutes requested access to the genetic data from UK Biobank. Half a petabyte of data was transferred in the first two weeks alone, and the partners anticipate up to two petabytes of transfer to the research community over a six to eight-week period.

“This is an incredible resource for researchers around the world who are studying human disease. By partnering with the EGA, the UK Biobank can make the data available through robust community-agreed standards and practices,” says Jordi Rambla, Team Leader for the EGA at the CRG in Barcelona.

“These are exactly the type of high-demand human genetic datasets we’re here to provide,” says Thomas Keane, Team Leader for the EGA and Archive Infrastructure at EMBL-EBI. “Our collaboration with major genomics initiatives such as the UK Biobank will help approved scientists access the data they need to study the causes of disease.”

Security, encryption and distributed networking

The EGA will distribute genotype data securely for these 500,000 samples, which may be studied any number of times. These deep, baseline measurements cover many different phenotypes, and are in significant demand from people working in all areas of the biomedical sciences.

“We believe that this is the single largest release of a genetic dataset in terms of number of individuals genotyped,” says Mark Effingham, UK Biobank Chief Information Officer. “The dataset is vast, but we hope it will drive innovative and exciting studies to transform research. Working with the EGA has been crucial in delivering these data quickly and efficiently, so that scientists can get on with the work of improving health.”

“The EGA is a public resource that has long offered robust security and encryption for controlled-access data,” says Helen Parkinson, Head of Archival Resources at EMBL-EBI. “Because the EGA is a federated resource, it offers enhanced network capacity from data centres at the Wellcome Genome Campus, at Hemel Hempstead and at the CRG in Barcelona.”

“The 500,000 UK BioBank participants have already revolutionised research into health and disease by participating in this ground-breaking study,” says Ewan Birney, Director of EMBL-EBI. “By having the entire genotype data present across the entire cohort, and accessible to researchers in a secure, well-managed scheme, the UK BioBank will become a bedrock of our understanding of health and disease in humans.”

What’s a petabyte?

One petabyte (PB) is 1024 terabytes, which is enough data to fill over 32,000 iPhone 7s (the 32GB kind). That’s a lot of data (and a hefty stack of phones!).

This article was originally published on EBI News