Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Performance review of LoginNotify
Closed, ResolvedPublic

Description

This extension uses many hooks that might potentially slow down login.

I don't think we measure the performance of login per se at the moment.

Event Timeline

This extension uses many hooks that might potentially slow down login.

You're right. It did. Then @MaxSem worked on moving the expensive processing into a job queue:
https://gerrit.wikimedia.org/r/#/c/360787/

That has hopefully fixed all problems but a performance review would still be very welcome.

Yes, I noticed that, it's definitely a great thing that the bulk of the work happens async. There is one caveat to using jobs and I wanted to make sure that your current approach was deliberate.

Inserting a job may fail and throw an exception when it does. In the current way you've written the extension, this probably means that failure to insert these new jobs will make the login request fail.

You could, for performance purposes, wait until the response has been sent to the client before inserting the job, ignoring any potential failure. It would make performance the same as the status quo (any extra work would happen after the login has happened), but it would do so at the expense of possibly missing suspicious logins that wouldn't be reported to the user whenever the job queue has intermittent failures.

One thing to consider as well is that you might be introducing a dependency between the job queue functioning well and login. I'm not familiar with the auth code, so I don't know if it already inserts other jobs that would make login fail, but this is something worth looking into. It wouldn't be a performance concern, but an availability one. I.e. issues with the job queue could make login malfunction for everyone. This particular problems depends on the existing login/auth code and its existing use of jobs or lack thereof, which you would have to investigate (or just ask someone who knows the login/auth code very well).

All of this means that you're looking at a tradeoff between performance, security and availability in this decision of whether job insertion failure should make login fail.

Okay, I tried breaking the job queue on my local install and seeing if it would prevent me from logging in. Here are the things I tried:

  1. Set $wgJobRunRate = 0; in my LocalSettings.
  2. Returned false in this function, right at the top.
  3. Returned false in this core function, right at the top.

I tried all of the above in isolation and then all together. None of these prevented me from logging in. I did all these while making sure I was first generating a login attempt notification and then logging in,

I don't think the extension is introducing a dependency between the job queue and logins.

Change 366160 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/extensions/LoginNotify@master] Enqueue jobs postsend

https://gerrit.wikimedia.org/r/366160

Niharika added a project: Community-Tech.

Change 366160 merged by jenkins-bot:
[mediawiki/extensions/LoginNotify@master] Enqueue jobs postsend

https://gerrit.wikimedia.org/r/366160

@Gilles, anything left here? Are we good to deploy this extension further?

@Gilles is on holiday/vacation now.

@Krinkle - are you able to answer @MaxSem's question? Thanks!

Looks like Gilles is on vacation until August 6.

@kaldari I edited my previous comment to correct type on @Krinkle . @ksmith told me that Performance team will be short staffed for the next 2 weeks. @greg just told me that Timo will be only Performance engineer working during that time.

@Gilles, anything left here? Are we good to deploy this extension further?

The job queue interaction was the main thing that stood out in terms of impact to performance and/or availability (e.g. dependency on job queue).

I can't think of any other immediate concerns, but the only other thing I'd like to have in place going forward is real data.

Real data about how long backend processing takes for login and sign ups. That way we can see the impact before/after this deployment, as well as for the future when the extension changes. Two simple timing measures would suffice. Perhaps using Timing (RequestContext::getTiming), similar to how we already have viewResponseTime and editResponseTime in WikimediaEvents. Could measure the whole request (from a hook, scheduling a deferred update, like for viewResponseTime), although that probably measures too much and would be noisy. Perhaps better to measurer just from the firs to the last relevant hook. This could then be plotted in Grafana to provide login/signup counts/median/p99, etc.

Krinkle reassigned this task from Krinkle to kaldari.