Performance review of LoginNotify
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	• Gilles
	Jul 17 2017, 2:24 PM

Description

This extension uses many hooks that might potentially slow down login.

I don't think we measure the performance of login per se at the moment.

Details

	Subject	Repo	Branch	Lines +/-
	Enqueue jobs postsend	mediawiki/extensions/LoginNotify	master	+1 -1

Customize query in gerrit

Related Objects

Mentioned In: T171882: Add profiling instrumentation for log-in
rELGN19a232531424: Enqueue jobs postsend
rELGN4d141699e12d: Enqueue jobs postsend
rELGNac8602e551e7: Enqueue jobs postsend
T170358: Deploy LoginNotify everywhere
Mentioned Here: T170345: Record LoginNotify statistics
T171882: Add profiling instrumentation for log-in

Event Timeline

• Gilles created this task.Jul 17 2017, 2:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJul 17 2017, 2:24 PM

This extension uses many hooks that might potentially slow down login.

You're right. It did. Then @MaxSem worked on moving the expensive processing into a job queue:
https://gerrit.wikimedia.org/r/#/c/360787/

That has hopefully fixed all problems but a performance review would still be very welcome.

Krinkle subscribed.Jul 17 2017, 6:38 PM

Yes, I noticed that, it's definitely a great thing that the bulk of the work happens async. There is one caveat to using jobs and I wanted to make sure that your current approach was deliberate.

Inserting a job may fail and throw an exception when it does. In the current way you've written the extension, this probably means that failure to insert these new jobs will make the login request fail.

You could, for performance purposes, wait until the response has been sent to the client before inserting the job, ignoring any potential failure. It would make performance the same as the status quo (any extra work would happen after the login has happened), but it would do so at the expense of possibly missing suspicious logins that wouldn't be reported to the user whenever the job queue has intermittent failures.

One thing to consider as well is that you might be introducing a dependency between the job queue functioning well and login. I'm not familiar with the auth code, so I don't know if it already inserts other jobs that would make login fail, but this is something worth looking into. It wouldn't be a performance concern, but an availability one. I.e. issues with the job queue could make login malfunction for everyone. This particular problems depends on the existing login/auth code and its existing use of jobs or lack thereof, which you would have to investigate (or just ask someone who knows the login/auth code very well).

All of this means that you're looking at a tradeoff between performance, security and availability in this decision of whether job insertion failure should make login fail.

Niharika mentioned this in T170358: Deploy LoginNotify everywhere.Jul 18 2017, 5:36 PM

Okay, I tried breaking the job queue on my local install and seeing if it would prevent me from logging in. Here are the things I tried:

Set $wgJobRunRate = 0; in my LocalSettings.
Returned false in this function, right at the top.
Returned false in this core function, right at the top.

I tried all of the above in isolation and then all together. None of these prevented me from logging in. I did all these while making sure I was first generating a login attempt notification and then logging in,

I don't think the extension is introducing a dependency between the job queue and logins.

Change 366160 had a related patch set uploaded (by MaxSem; owner: MaxSem):
[mediawiki/extensions/LoginNotify@master] Enqueue jobs postsend

https://gerrit.wikimedia.org/r/366160

gerritbot added a project: Patch-For-Review.Jul 18 2017, 9:34 PM

MaxSem mentioned this in rELGNac8602e551e7: Enqueue jobs postsend.Jul 18 2017, 9:35 PM

Niharika triaged this task as High priority.Jul 19 2017, 2:20 PM

Niharika added a project: Community-Tech.

MaxSem mentioned this in rELGN4d141699e12d: Enqueue jobs postsend.Jul 19 2017, 6:45 PM

• Gilles claimed this task.Jul 19 2017, 6:54 PM

• Gilles moved this task from Inbox, needs triage to Doing (old) on the Performance-Team board.Jul 19 2017, 7:04 PM

MaxSem mentioned this in rELGN19a232531424: Enqueue jobs postsend.Jul 19 2017, 10:22 PM

Change 366160 merged by jenkins-bot:
[mediawiki/extensions/LoginNotify@master] Enqueue jobs postsend

https://gerrit.wikimedia.org/r/366160

ReleaseTaggerBot added a project: MW-1.30-release-notes (WMF-deploy-2017-07-25_(1.30.0-wmf.11)).Jul 20 2017, 12:01 AM

Krinkle removed a project: Patch-For-Review.Jul 22 2017, 2:35 AM

@Gilles, anything left here? Are we good to deploy this extension further?

@Gilles is on holiday/vacation now.

@Krinkle - are you able to answer @MaxSem's question? Thanks!

Looks like Gilles is on vacation until August 6.

@kaldari I edited my previous comment to correct type on @Krinkle . @ksmith told me that Performance team will be short staffed for the next 2 weeks. @greg just told me that Timo will be only Performance engineer working during that time.

greg unsubscribed.Jul 26 2017, 9:30 PM

In T170825#3473002, @MaxSem wrote:

@Gilles, anything left here? Are we good to deploy this extension further?

The job queue interaction was the main thing that stood out in terms of impact to performance and/or availability (e.g. dependency on job queue).

I can't think of any other immediate concerns, but the only other thing I'd like to have in place going forward is real data.

Real data about how long backend processing takes for login and sign ups. That way we can see the impact before/after this deployment, as well as for the future when the extension changes. Two simple timing measures would suffice. Perhaps using Timing (RequestContext::getTiming), similar to how we already have viewResponseTime and editResponseTime in WikimediaEvents. Could measure the whole request (from a hook, scheduling a deferred update, like for viewResponseTime), although that probably measures too much and would be noisy. Perhaps better to measurer just from the firs to the last relevant hook. This could then be plotted in Grafana to provide login/signup counts/median/p99, etc.

Krinkle moved this task from Doing (old) to Radar on the Performance-Team board.Jul 26 2017, 9:58 PM

kaldari mentioned this in T171882: Add profiling instrumentation for log-in.Jul 27 2017, 6:25 PM

Krinkle claimed this task.Aug 1 2017, 4:04 PM

Krinkle reassigned this task from Krinkle to kaldari.

Krinkle edited projects, added Performance-Team (Radar); removed Performance-Team.Aug 8 2017, 3:14 AM

Krinkle moved this task from Limbo to Perf issue on the Performance-Team (Radar) board.Aug 8 2017, 3:20 AM