Custom data visualisation with d3.js

When it comes to creating fast and good looking visualisations there's no better package than ggplot (my personal opinion). Implementing the grammar of graphics it's concise and intuitive allowing you to produce advanced plots in only a few lines of code. This is extremely helpful when performing EDA where I tend to produce a large amount of visualisations in order to familiarise myself with the data. If you want something interactive though, you have to turn elsewhere. For these cases plotly is a very good alternative. Especially in combination with Shiny if you're building dashboards (or using DASH). Every once in a while, however, you find yourself in the need of something higly customised. When all the tweaking and work arounds have only taken you so far, what are the alternatives? Well, since plotly is built on d3.js, I thought it could be worth exploring. So I signed up for a Udemy course.

One Udemy course later...

One thing to note about d3 is that you have to set everything up for yourself. Axes, scales, labels, everything. Your data is represented by binding observations to SVGs which can be points, paths, rectangles, etc. Having produced a lot of line, bar and donut charts during the course I felt I was ready for something a bit more challenging, like a chord diagram. One of many good things with open source is that there are a lot of good examples and most likely someone has already made something similar to what you want to do. D3 is no different; there's a lot of free and open source licenced code you can modify to your needs instead of having to write everything from scratch. Below I have extensively used Delimited.io.

Opponent:

Min passes: 0

Filtered players:

The chord diagram represents passes between players in the Swedish national team during the World Cup 2018 in Russia. Arches represent the total number passes played and chords are coloured according to the player who has made the most passes between any given pair. There are three filters:

  • Switch between games by opponent
  • A threshold for the minimum amount of passes between two players
  • Exclude/include any player by clicking on the name

You can also hover over chords to see an info box with the numbers explicitly written out.

Chord diagrams are good for visualising flows between groups, but I find the arcs, or group sizes, hard to compare when differences are small. Including an axis would help, but in cases when there are only a few groups a stacked bar chart would probably be the better option. The chart does, however, communicate a lot of information at the glance of an eye. The archs are sorted clockwise and you can easily see which players make more passes than they receive just by looking at the colours of the chords leading up to each player (unless your colour blind). There are, of course, much better ways to visualise a passing network if you have the coordinates. Overlaying a football pitch would also convey where on the pitch the passes have been made.

The data

statsbomb-logo

Statsbomb are some really nice people that provide a free football data repository on Github. The data comes in JSON files that contain every event that took place on the pitch, and it's free! To get it into a format that can be used with this particular visualisation you have to wrangle it a bit.

library(jsonlite)
library(dplyr)
library(purrr)
library(tidyr)

team_id <- 790

# 43 indicates FIFA World Cup 2018
matches <- read_json("data/matches/43.json", simplifyVector = TRUE)

fixtures <- tibble(
  match_id = as.character(matches$match_id),
  home_team = matches$home_team$home_team_name,
  away_team = matches$away_team$away_team_name
)

mask <- (matches$home_team$home_team_id == team_id) |
    (matches$away_team$away_team_id == team_id)

ids <- matches$match_id[mask]

read_events <- function(id) {
  f_name <- paste0("data/events/", id, ".json")
  event <- read_json(f_name, simplifyVector = TRUE)
}

events <- setNames(lapply(ids, read_events), ids)

count_passes <- function(df, team_id) {
  mask <- (df$type$id == 30) &
    ((df$pass$type$id %in% c(64, 66)) | is.na(df$pass$type$id)) &
    is.na(df$pass$outcome$id) &
    (df$team$id == team_id)
  
  
  df %>%
    filter(mask) %>%
    mutate(from = player$name, to = pass$recipient$name) %>%
    count(from, to)
}

passes <- map_dfr(events, count_passes, team_id = team_id, .id = "match_id") %>%
  inner_join(fixtures, by = "match_id") %>%
  mutate(fixture = ifelse(home_team == "Sweden", away_team, home_team)) %>%
  select(fixture, from, to, n)

unique_pairs <- passes %>%
  expand(fixture, nesting(from, to)) %>%
  mutate(row_id = as.character(map2(from, to, ~paste(sort(c(.x, .y)), collapse = ",")))) %>%
  group_by(fixture, row_id) %>%
  filter(row_number() == 1) %>%
  ungroup() %>%
  select(-row_id)

passes_flows <- unique_pairs %>%
  left_join(passes, by = c("fixture", "from", "to")) %>%
  left_join(passes, by = c("fixture" = "fixture", "from" = "to", "to" = "from")) %>%
  replace(is.na(.), 0L) %>%
  filter(n.x + n.y > 0) %>%
  rename(left = from, right = to, left_to_right = n.x, right_to_left = n.y)

Conclusion

So is it worth the effort learning d3? If you quickly want to produce visualisations for analyses or reports, then I would say no. It does require some effort to learn and it certainly helps if you have prior experience with javascript. But if you're in to web development or producing customised visualizations for your notebooks or dashboards, then yes. D3 has a highly active community with many examples you can learn from and use. Once you have your own library of d3 scripts it's easy to bind data from R to d3 with the r2d3 package. If you're looking to take your data viz to the next level and really be creative without too many constraints, d3 might just be something for you!