Investigating AI datasets - OpenImages

(Disclaimer: This post is also featured on the Data Culture blog)

Today’s statistical algorithms, including a majority of machine learning and AI techniques, rely on large amounts of training data. Understanding what is in these datasets, how they are collected, and how different groups are represented is important, as biased data often leads to biased algorithms. Inspired by the recent work of Emily Bender et al. and Kate Crawford’s new book Atlas of AI I decided to look at one popular dataset - the Open Images Dataset. Open Images is a dataset similar to ImageNet, it contains approx. 9 million images, annotated with image-level labels, bounding boxes, segmentation masks, etc.

Research has already shown that Open Images has issues when it comes to where images are from, with more than 50% of images coming from just 4 countries (US, UK, France and Spain). However, here I want to dig a bit deeper into the dataset itself. I’m not interested in where images are from, rather I want to know what they actually depict. For instance, the image above is a random sample of the banana class. There are images of single bananas, stalks of bananas, banana bread, somebody using a banana as a phone, and even drawings/art of bananas. In total there are 743 images of bananas in the dataset.

Banana is just one type of label in the 601-label large dataset. If we look at the 10 most popular labels (see figure above) we see that around 800.000 images have the label “Person” (~ 9% of all images), followed by clothing with 600.000 images (~ 6.6% of images). We also see there are also more images labeled “Man” than “Woman”. Further, here, we also notice a peculiarity, for instance, there is a label called “Human face”. I’m not sure why “Human” is added, because there are no other types of faces in the dataset. Anyway, that got me thinking, what other “Human”-like attributes are in the dataset?

In the figure above we see there are 13 different “Human x” type labels, from human face, to hair, body, nose, eye, beard, to foot. By far, the most popular, and larger than all the other combined, is “Human face”. The least popular is human foot, with only ~ 1000 images (but still more popular than the “Banana” label). The focus on human faces rather than any other body part tells us something about which purpose this dataset was assembled for, and most likely it looks like its for facial recognition purposes. Please note that developments in facial recognition algorithms have been heavily funded by various defense and intelligence agencies - and have been found to be heavily biased in terms of gender and race - see Atlas of AI for more info.

If we dig further into the “Human x” categories and plot a random subset of the images labeled with “Human beard” we see that it mainly represents one specific demographic - predominantly the beard of white men.

But that is not all. If we keep digging we eventually find other issues. For instance, the Swimwear label mainly has images of skimpily dressed young women (as perceived through a male gaze). The label Brassiere has the same issue. However, the thing that surprised me the most is the existence of labels relating to weapons. This includes labels like: Weapon, Rifle, Tank, Handgun, and many others. Which raises the question: why are they even in this dataset? It is difficult to say why the developers of OpenImages chose to include these in the dataset. My guess is that the heavy involvement of DARPA, IARPA, and many other defense and military agencies in AI, and the money they have thrown into the development of AI tool for their purposes (e.g. facial recognition, etc.), has biased the sample towards weapons.

Now, you might think, how big is this problem? There cannot be that many images? If we look at the amount of images labeled with a weapon-category (Weapon, Rifle, Tank, Handgun, Missile, Shotgun, Sword, Bow and arrow, Submarine, Bomb) we find there 10.000 of them, a substantial number. This is comparable to the number of images labeled as Book, Bus, Vegetable, and Fruit (see below figure).

It is an open question why our image recognition AI tools should be equally good at identifying books and fruit as weapons. To summarize, do not assume that because a dataset is big, or has been used by many people / many academic papers, it is of good quality. Often datasets are constructed from easily scraped datasources. As such, many widely used big datasets contain significant biases and limitations. To read more please check out Kate Crawford’s book.

Strange Attractors - Part II

Strange attractors are capable of generating amazingly diverse shapes from abstract to concrete, and to butterflies - see more here. However, all these images are single realizations of specific parameters and initial conditions. And they are static, so I wondered: is it possible to add some movements, some dynamic to strange attractors?

Cellular art-omata

Cellular art-omata, or cellular automata as they usually are called, demonstrate how simple mathematical rules can lead to astonishing complexity.

My favourite papers of 2018

2018 has been a really exciting year, scientifically speaking a lot of new interesting studies have been published this year (so many that I have had a hard time keeping up with my to-read-list), and personally it has been a fruitful year where I was lucky to publish in PNAS and Nature Human Behaviour. Here i have included my favorite scientific papers of 2018.

The generic face of danish politics

As Denmark is getting closer to the next elections the debate about refugees, migrants and their descendants has yet again resurfaced and is beginning to turn sour. Disillusioned with this development and trying to get my mind off the issue, i wondered what the average politician looked like. I was thinking about something along the lines of this work or something like this piece by Soumitra Agarwal. After a quick online search, where i was unable to find much work on faces of politicians, i decided to create my own. The basic idea is to take lot of portrait pictures, overlay them, and take their median.

Transforming images into networks

Trying to come up with a cool visualization for a small side-project, i was contemplating how to draw, or approximate, an object using networks. During my creative process i remembered my colleague and friend Piotr Sapiezynski once told me how he once did something similar (see here and here). Thinking his visualizations look absolutely stunning i tried to do my own version.

Random walk (art)

Trying to kill some time on a 4-hour long train ride I played around with simulating random walk in two dimensions. Coloring each walker with it’s own unique colors, the motion of individual walkers will more or less look like confused ants moving around on a piece of paper. Resembling the behavior illustrated below – see code below.

Global Airport Network

I like to keep track of my life; collecting data about random things–one of them happens to be my travel patterns. While visualizing my own travels I started to wonder what the global airport network might look like. I remember reading about the structure of the airport network in the architecture of complex weighted networks by A. Barrat et al. but the paper never visualized the network. To figure it out, I first needed some data, luckily has a database of routes as well as airports, which allows us to create some pretty nice looking visualizations (see above figure).

Strange Attractors

This Sunday while surfing the web I came across a figure depicting the Rössler attractor and while looking at it, it suddenly struck me that I have always seen it depicted from this specific angle. But what does it look like from other angles? Curious, I sat down, quickly wrote a python script to generate the dynamics, used Matplotlib to plot the figure from multiple angles, and ffmpeg to aggregate them into an animation (see below). One thing lead to another and soon I found myself reading about other strange attractors, such as Clifford attractors, and writing code to generate the figures you see above.

Cover of PNAS - code

I received questions from a couple of people asking me how I drew the network featured on the cover of PNAS (read about it here). Well, this blogpost is for you, and anybody else.

On the cover of PNAS!!!

We (Sune Lehmann, Arek Stopczynski and yours truly) recently published a paper in PNAS where we give our two cents on how to uncover meaningful, “fundamental”, social structures from temporal complex networks. In addition to submitting the paper we also sent some pictures along which we felt would look good on the cover of PNAS. As it turns out one of them was actually selected!

Game Of Thrones network visualization

I was watching the season finale of Game of Thrones the other day and wondered—with so many characters in the series what does the interaction network look like? Well, as it turns out I was not the first person to get this thought. In fact A. Beveridge and J. Shan read through Storm Of Swords (third book in the series) and mapped all the interactions between characters, and released the data. You can read more about their cool project here. They are, further, planning to release data regarding the others books as well.

On the cover of KVANT

While finalizing my PhD I was asked, alongside Sune Lehmann, to author a popular article about networks by the magazine Kvant (danish journal for physics and astronomy). We wrote and submitted the piece and were fairly confident in our work. Nonetheless we were surprised when we were contacted by the editor who asked us for permission to use one of my figures for the cover! This is my first cover, and I gotta say, it feels awesome, next stop …. Nature :)

NYC Taxi - Heartbeat of NYC

Have you ever wondered which areas of New York City are the most popular? You need not worry anymore, this little movie will answer your questions. The video shows the dynamics of pick-ups and drop-offs within a representative week. It is interesting to see how the popularity of areas changes over the course of a day, and how certain areas attract more attention during nighttime. To me the circadian patterns resembles a heartbeat.

NYC Taxi - Statistics

One of the most iconic sights in New York are its Yellow cabs. They are ubiquitous and an important lifeline that tie the city and its inhabitants together. Understanding how cabs move around can give us new insights into how people travel within the city, how people use the city, and which neighborhoods are popular.

The next big thing....

Is just around the corner! We have some cool results that hopefully should be published soon. Until then here are two teaser pics.

World Cup 2014

Since I as a kid watched my first world cup (1994), I have been hooked on football (or soccer as the Americans call it). Back then l I remember that almost every player used to wear Adidas Copa Mundials - a stylish, yet simple black leather boots with 3 white stripes.


Just submitted a paper - Wohoo! Meanwhile until it is published you can find it on arXiv. The paper investigates usability of the Bluetooth sensor is as a proxy for real life face-to-face interactions. You can learn more about the data on the SensibleDTU homepage.

Talk @ KU (4th Dec)

I will be giving a talk at the Niels Bohr Institute on December 4th. Topic will be “Social Contacts and Commnities”. It is based on the results and finding from the SensibleDTU project.

Bluetooth Network

How do we as humans interact over the course of a day? The video shows proximity interactions for student participating in the SensibleDTU project for a randomly chosen 24-hour interval.