over 33TB of distributed data to secure the future of open science

When the developing Human Connectome Project (dHCP) was launched in 2017, an ambitious project by King’s College, Imperial College London and the University of Oxford to study in detail the brain development of human beings with the best imaging techniques, researchers they ran into a problem.

They were going to need to have to share terabytes of thousands of high definition brain scans with institutions around the world and, as Jonathan Passerat-Palmbach, associate researcher at Imperial College London acknowledged, “they couldn’t rely on HTTP-based downloads to make that dataset always available” .

There were proprietary solutions, but most of them were expensive, and in practice required some of the world’s largest research institutions to put billions of data into the hands of corporate solutions. What could I do? Wasn’t there an open alternative to all of this?

Data, data, too much data

over 33TB of distributed data to secure the future of open science

They weren’t the first to come across this problem. Four years earlier, at the University of Massachusetts Boston, two PhD students interested in machine learning, computer vision, and other emerging areas who had one thing in common: they needed large amounts of data in order to move forward.

Joseph Paul Cohen and Henry Z. Lo, as they were called, realized two things, one good and one bad. The good news is that we had more and more data, little by little, the internet was filling up with repositories, APIs and open databases with which to work. The bad news is that, for that very reason, the transfer of this huge data set was increasingly cumbersome, strenuous and expensive.

Cohen and Lo began to think about the problem and came to a conclusion that today may seem obvious: the best tool to transfer large files was BitTorrent. Why not develop a solution based on the world’s best-known p2p exchange protocol? Thus was born Academic Torrents.

How does Academic Torrents work?

over 33TB of distributed data to secure the future of open science

AT is a service designed so that researchers can share datasets in a simple way and is made up of two different parts: a web directory where users can search among the available torrents and a simple BitTorrent client that, as we already know, allows to transmit large amounts of data in a scalable and fast way.

Since 2013, AT has grown a lot. Currently, it has 33.66TB of research data available and, according to the latest available, it serves more than 3TB a day and more than 30,000 users each month. It’s not Pirate Bay in its heyday, but it just keeps growing.

So much so that Cohen and Lo assumed that the only way to sustain the project was by creating a non-profit organization, the Institute for Reproducible Research, dedicated to creating tools to help science overcome the now infamous replication crisis. Or to allow collaboration between research groups, just what the developing Human Connectome Project needed.

Beyond data transfer

over 33TB of distributed data to secure the future of open science

The truth is that if we stayed here, Academic Torrents would be nothing more than another file sharing solution. What happens is that when we introduce the logic of the most indomitable technologies on the Internet into any field, wonderful things happen.

Apparently, AT gives access to data that includes data sets already available on the internet, such as those of the UCI Machine Learning Repository, those of ImagenNet or those of Wikipedia itself. But if we think about what is underneath, we realize that what it is actually doing is releasing them.

Yes, it makes it easier and faster to download huge datasets, but it also works as a ‘mirror’, as a backup against contingencies. The best example of this is the Netflix database, a database that is no longer available on the company’s website, but remains safely hidden in the bowels of Academic Torrents.

Using BitTorrents allows (requires!) That the data is always available and guarantees exhaustive control and traceability by the scientific community in the area. They also propose a standard to avoid fragmentation, undetectability and chaos in the world of scientific research with big data.

What can we find in the “data sci-hub”?

over 33TB of distributed data to secure the future of open science

Among the Academic Torrents datasets we can find a bit of everything: from subject content from numerous universities (such as Stanford, CMU, CalTech or MIT) to databases of “development challenges”, competitions open to researchers and companies in which a problem occurs (usually the cataloging of a novel dataset) and the algorithms that give the best results are rewarded.

In the big data world, these types of challenges are very popular (and prestigious) because they allow researchers to evaluate their developments and approaches in a competitive context. DeepMind won the Alphafold at the beginning of the month and they celebrated it like an Olympics. Academic Torrents serves as infrastructure many of them.

But, without a doubt, the most interesting are the repositories that allow us to “train” our artificial intelligences. In this way with all the reviews and comments from Yelp and Amazon or millions of Wikipedia articles to train AIs in natural language. There are also repositories with more than 600,000 Russian texts, all the UK flowers or more than one hundred thousand images of food. We have geographic data of all North American highways to train GPSs, radiation oncological datasets to train algorithms that detect breast or liver cancer.

The main bottleneck of research in artificial intelligence and deep learning is precisely the access to large data sets that allow us to improve our algorithms. For this reason, the big technology companies have an advantage: they have enormous amounts of information with which to work. Research centers and small businesses have a harder time. And for that, the Academic Torrents community is changing things just by sharing huge amounts of data.

And it is that, although it may seem otherwise, the millions of videos, images and files of Academic Torrents they are very useful both to start in the field of machine learning and to obtain new databases with which to improve research projects. But above all, facilitating that large AI developments are not left on the other side of the corporate gardens.

But wait, BitTorrent is evil

over 33TB of distributed data to secure the future of open science

What no one suspected (not the developers of Academic Torrents, nor the team of the developing Human Connectome Project) is that this part was the simplest. Today, BitTorrent is a well-known, efficient and easy-to-implement technology. Namely, the main obstacle was not technical.

It was social. In these four years of work, the hardest thing has been convincing the researchers that a technology as demonized as torrents could have legitimate scientific use. And not only that because, once they convinced the researchers, they touched an even tougher bone: convincing the institutions.

For example, Jonathan Passerat-Palmbach explained that at Imperial College London the network was capable of preventing torrenting and getting them to make an exception required extraordinary work. It is so illustrative that the story tells itself.

And it is also an internet classic: how the battle against open technologies ends up pruning not only the applications that affect large lobbies, but all of them. It is sad that we have to go ‘reinventing’ the Torrent for a whole generation of scientists simply because the first decade of the century abused them terribly. However, that we are doing it is excellent news.

Tagged in :

, , , , ,