Improving Observability of Wiki Education Dashboard: What it entails

Introducing the Wiki Education Dashboard 🔬

First thing to note is that the Wiki Education Dashboard actually has 2 different sites the Wiki Education Dashboard — dashboard.wikiedu.org and the Wikimedia Programs & Events Dashboard — outreachdashboard.wmflabs.org(explained better here). The term ‘Dashboard’—and sometimes ‘Wiki Education Dashboard’—is used as an umbrella term to refer to both sites, as they run the same general dashboard code, with environment-specific code where needed. The current homepage of the two sites are displayed below, can you spot the difference?

My project is titled ‘Improve observability of Wiki Education Dashboard'—pretty straightforward, right? Well unless you don’t know what observability is. A few months ago, I did not.

According to Wikipedia, “Observability measures how well a system's state can be understood from obtained telemetry (metrics, logs, traces, profiling)”. Basically how much information can be gleaned using data in form of system metrics, logs and other information. So in short, the purpose of my project is to make it easier for system administrators and end-users (course instructors, students, and others) to detect and understand problems with the system - the Dashboard.

Observability is typically the responsibility of Site Reliability Engineers (SREs), but in full-stack environments, it often falls to Backend or Full-Stack Engineers as well. As a Full-Stack Engineer with a backend focus, this project is a great fit for me.

My project can be sub-divided into three sub projects which I detail below:

1. Improving Existing documentation of the Dashboard’s Infrastructure and Deployment(s) 🖹

Phabricator is Wikimedia’s open-source tool for managing issues and projects and this task was actually born out of a Phabricator ticket. A Wikimedia SRE was trying to debug a downtime incident with the Programs & Events Dashboard but initially, could not find any details about where or how it was deployed. It took a while to figure it out, so they created a ticket requesting documentation of the Dashboard’s deployments.

In short, improving the existing documentation makes the system more observable by providing clear reference points for diagnosing issues. By clarifying how the different parts of the system interact—the servers, APIs, tools used, and others—I’m helping ensure that future incidents like this can be resolved faster.

2. Reducing Noise in Sentry (and New Relic) 🔇

What is noise, and why does it matter? Noise is the clutter of irrelevant or low-priority error reports in monitoring tools that makes it harder to focus on critical issues. Sentry helps track errors and performance, while New Relic focuses on system metrics like response times and error rates. Together, they provide a full picture of system health, but only if they’re showing the right data. Reducing noise in Sentry is key to making it more effective—helping us focus on actionable issues and giving clear insights into what really matters.

I’ve started tackling this by archiving redundant errors and addressing the actionable ones. It’s a work in progress, but I’m excited to see the difference this will make once it’s complete.

3. Building a User-facing System Status and Performance User Interface (UI) 👩‍💻

A typical system status site is used to provide essential context to admins and users about the system's status, helping reduce confusion during outages or slowdowns. For example, platforms like Reddit Status and Wikimedia Status use such sites to communicate the current state of their systems, including ongoing issues, scheduled maintenance, or performance disruptions.

The proposed Status UI for the Dashboard is however user-facing, in the sense that its purpose is to clearly indicate whether the system components are functioning normally and, if not, provide insights like when the issue might be resolved. For example, the numerous courses in the Dashboard undergo updates that pull in useful statistics such as the number of articles edited, which editors make which contributions and so on. The duration of such an update is linear to the size of the course and errors can also occur during the process. As such, the Sidekiq queues processing the updates have varying latencies that would indicate whether there is a backlog or the queues are working in good time.

An important part of my work involves deciding the level of detail to show and how to convert raw system data like latencies into user-friendly metrics that anyone—regardless of their technical knowledge—can understand.

The UI will be integrated into the existing Dashboard code, with its frontend server-rendered using Rails.

The Why

The Dashboard is an application that’s more important than it might seem. While on Phabricator, I stumbled upon these comments thanking my mentor and longtime maintainer, Sage Ross, for getting the Dashboard back up and running after a downtime:

This is what excites me about working on this project: the opportunity to contribute to something genuinely needed by people. This is the biggest project I’ve ever worked on, and the fact that it’s open source means my efforts are out there in the open for anyone to see. Contributing to the Wiki Education Dashboard felt daunting at first, but I’m so glad I took the chance. It’s helping me grow exponentially as a software developer, and I can’t wait to keep making meaningful contributions and learning even more along the way.

This is all for now, Thanks for reading and I hope you learned something! 😊✨

Improving Observability: Breaking Down My Work

Table of contents

Introducing the Wiki Education Dashboard 🔬

1. Improving Existing documentation of the Dashboard’s Infrastructure and Deployment(s) 🖹

2. Reducing Noise in Sentry (and New Relic) 🔇

3. Building a User-facing System Status and Performance User Interface (UI) 👩‍💻

The Why