Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page MenuHomePhabricator

Instrument blocked edit attempts
Closed, ResolvedPublic

Description

This task involves the work with implementing the instrumentation necessary to log instances when people who are blocked attempt to edit Wikipedia.

More context can be found in T303995.

Questions we're aiming to answer with this instrumentation

(Note: This list is not exhaustive and may be updated) :

  • high-level question: To what extent do blocks play a role in preventing potentially productive editors from making an edit?
  • What countries do blocked edit attempts most frequently occur?
  • What are the differences in frequency of blocked edit attempts by wiki, editing interface, and platform?
  • What types of IP blocks (local/global, short-/long-term) are more frequently encountered?
  • How many distinct users are stopped from editing by a block?
  • Do we see any sudden increases or decreases in blocked edit attempts at a certain time and from a certain country?

Requirements

In Scope
  • Log an event whenever a user (logged in or out) clicks/taps an affordance to open an editing interface [1] and they are prevented from progressing because the account/IP address they are using to access Wikipedia has been blocked.[2]

Event details:

  • Block type (local/global, short-/long-term)
  • Platform [3]
  • Wiki
  • User ID
  • User IP, Agent, and Geo-Location [4]
  • Editing Interface
  • PageNamespace
  • User Edit Count

Sampling Requirements:

  • All of these events should be sampled at 100% so we can collect sufficient data and can more directly compare block rates to overall edit trends (edits, reverts, new active editors, etc)
  • Note: If we consider adding this instrumentation to EditAttemptStep, oversampling would need to be turned on for these events as EditAttemptStep is currently sampled

NOTES:
[1] Editing interface = Reply Tool (mobile + desktop), New Topic Tool (mobile + desktop), 2010 wikitext editor (desktop), source and visual editing modes (mobile, via MobileFrontend), and New Wikitext Editor (desktop) .
[2] As noted in T303995#7996670, it is also possible for someone to be blocked while they're editing a page (after they opened the editor). Instrumentation for this is already in place in EditAttemptStep.
[3] It is important to be able to decipher edits from mobile frontend using the platform field. Per the understandings described in T303995#7990254, β€œAny analyses we do ought to name and consider the fact that many people who are blocked from editing who are accessing the site through MobileFrontend may NOT attempt to initiate an edit because the interface suggests to them – by way of showing a "πŸ”’" next to the top-most edit pencil – that they will be prevented from doing so.”
[4] If we decided to add this logging to EditAttemptStep, http.client_ip including IP and geocoded data will need to be added back in as this was removed from the schema in T262626. We will only be able to retain this data for 90 days; however, it's important for this dataset that we are able to determine where the blocked edit attempt is coming from.

Out of Scope:
  • Log an event whenever a user views a page but does not click/tap an affordance to open an editing interface because they are blocked. See rationale noted in Editing team's recommendations in T303995#7990254

Implementation details

@MNeisler and @DLynch will populate this section with implementation details that will meet the ===Requirements listed above.

Event details

  • TBD

Done

  • ===Implementation details are documented
  • All ===Requirements and ===Implementation details are implemented
  • Editing QA has verified the event(s) defined in ===Implementation details section above are being emitted when someone (logged in or out) attempt to initiate an edit using any editing interface Reason: new instrumentation mostly impacts server-side.
  • @MNeisler verifies the new events are being logged in the yet-to-be-determined schema they are intended to land in

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mpopov triaged this task as Medium priority.Jun 14 2022, 5:11 PM
mpopov moved this task from Triage to Upcoming Quarter on the Product-Analytics board.
mpopov raised the priority of this task from Medium to High.Jun 14 2022, 5:13 PM
MNeisler updated the task description. (Show Details)
MNeisler edited projects, added Product-Analytics; removed Product-Analytics (Kanban).

@DLynch - I've updated the task description with what I believe are requirements for implementation. Can you please review and let me know if you have any concerns or questions about the identified requirements?

FYI - I'm out next week (4 July - 8 July) but happy to discuss when I'm back.

@MNeisler The data to be logged all seems reasonable. I'm assuming that platform will have the same complicated relationship with reality as the existing schemas' usage of it (i.e. "mobile" will mean "MobileFrontend", without any relationship to whether it's running on a mobile device per se).

Adding this to EditAttemptStep seems like a poor fit, as it'd somewhat create a whole new class of session -- one where all the normal editor-initialization doesn't apply and we just log a "edit wasn't even attempted" event.

Should we maybe pursue the new Metrics Platform for this? @phuedx?

Change 820908 had a related patch set uploaded (by DLynch; author: DLynch):

[schemas/event/secondary@master] New schema: editattemptsblocked

https://gerrit.wikimedia.org/r/820908

Reassigning this task to me while I review the patch to create the new schema.

Apparently there is an autoblockipblock event which is a subset of what's asked here. Not sure if it actually gets logged.

Should we maybe pursue the new Metrics Platform for this? @phuedx?

I'm sorry for not having responded to this sooner.

Since we're currently working on migrating the EditAttemptStep, and the MobileWeb- and DesktopWebUIActionsTracking instruments to use the Metrics Platform Client so that we can assess the suitability of the schema and make adjustments as necessary, I think it's best to use the Event Platform for now.

Apparently there is an autoblockipblock event which is a subset of what's asked here. Not sure if it actually gets logged.

FYI the schema was confirmed as deprecated in T267340: AutoblockIpBlock Event Platform Migration. It could easily be reused and updated though as there's no active instrument using it.

@DLynch

I finished taking a look through the proposed schema and just have a few questions and suggestions. See below:

  • We should add in a block ID. This will enable us to look up more information about the block if needed in the logging table.
  • Does it make send to also add performer.user_is_bot? Note: We'll also need to be able to distinguish if the user is logged-in or registered but we should be able to do that with the user_id field.
  • You mentioned that platform and interface fields are redundant in this case, which makes sense because the proposed field includes the mobilefrontend value needed to distinguish those from desktop site events (and we are currently unable to make the distinction between devices pending resolution of T249944). Just some follow-up questions to clarify the meaning of the possible values:
    • interface = wikieditor, visualeditor or discussiontools would always be associated with desktop site visits, correct? I want to make sure we can use this field to look at desktop site vs mobilefrontend blocks in analytics.
    • What might be counted as other in this case?
  • Block_types: I think the current schema covers the key information we need to track about block types (local vs global, fixed vs indefinite) but it would be useful to align as closely as possible with the block types being identified by the Growth team in T306018 for consistency.

Does it make send to also add performer.user_is_bot?

I think that's not available from client-side logging, which some of this is going to have to be.

interface = wikieditor, visualeditor or discussiontools would always be associated with desktop site visits, correct? I want to make sure we can use this field to look at desktop site vs mobilefrontend blocks in analytics.

Drat, you're right -- discussiontools could be associated with mobile or desktop. I will add the platform field back in to account for this.

What might be counted as other in this case?

I'll admit that I just threw that in as a placeholder in case I turned up paths to being-blocked during implementation that weren't covered by the main routes.

it would be useful to align as closely as possible with the block types being identified by the Growth team in T306018 for consistency.

I could use their fields exactly. It'd leave you on the hook for working out short/long-term blocks during analysis, though, since they're just logging block_expiry as the timestamp when the block expires.

Thanks @DLynch

I could use their fields exactly. It'd leave you on the hook for working out short/long-term blocks during analysis, though, since they're just logging block_expiry as the timestamp when the block expires.

Let's go ahead and use their block_type, block_scope , and block_expiry fields as they have been defined in https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/821711/7/jsonschema/analytics/mediawiki/createaccount_blocked_user/current.yaml

I'm fine using the block_expiry field to work out short and long-term blocks. Do you how any indefinite blocks (blocks without an expiration date) would be indicated by this field? I'm primarily interested in being able to distinguish between fixed and indefinite blocks. As long as I can sort that out with the block_expiry field, we can get rid of block_duration.

Change 824578 had a related patch set uploaded (by GergΕ‘ Tisza; author: GergΕ‘ Tisza):

[mediawiki/extensions/GlobalBlocking@master] [WIP] Log block ID

https://gerrit.wikimedia.org/r/824578

Sorry, that patch was meant for the other task. Although I guess it's somewhat relevant to any block analytics work.

Re: block_expiry, I think I would go for using a null expiry for indefinite blocks. The current code for createaccount_blocked_user uses the MediaWiki representation as-is (in theory the string infinite but it could probably be something more esoteric, judging from the code in ApiResult::formatExpiry).

Change 824578 merged by jenkins-bot:

[mediawiki/extensions/GlobalBlocking@master] Log block ID

https://gerrit.wikimedia.org/r/824578

Change 832147 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/DiscussionTools@master] Add mw.track call when comment setup fails

https://gerrit.wikimedia.org/r/832147

Change 832148 had a related patch set uploaded (by DLynch; author: DLynch):

[mediawiki/extensions/WikimediaEvents@master] Log blocked edit attempts

https://gerrit.wikimedia.org/r/832148

Change 833442 had a related patch set uploaded (by DLynch; author: DLynch):

[operations/mediawiki-config@master] Register the editattempt_block schema

https://gerrit.wikimedia.org/r/833442

(I realized I'd forgotten to upload the config patch, not that it's actually needed until everything else is reviewed.)

Change 820908 merged by jenkins-bot:

[schemas/event/secondary@master] New schema: editattemptsblocked

https://gerrit.wikimedia.org/r/820908

Change 832147 merged by jenkins-bot:

[mediawiki/extensions/DiscussionTools@master] Add mw.track call when comment setup fails

https://gerrit.wikimedia.org/r/832147

Change 832148 merged by jenkins-bot:

[mediawiki/extensions/WikimediaEvents@master] Log blocked edit attempts

https://gerrit.wikimedia.org/r/832148

Change 833442 merged by jenkins-bot:

[operations/mediawiki-config@master] Register the editattempt_block schema

https://gerrit.wikimedia.org/r/833442

Mentioned in SAL (#wikimedia-operations) [2022-10-12T20:02:48Z] <samtar@deploy1002> Started scap: Backport for [[gerrit:833442|Register the editattempt_block schema (T310390)]]

Mentioned in SAL (#wikimedia-operations) [2022-10-12T20:03:11Z] <samtar@deploy1002> samtar and kemayo: Backport for [[gerrit:833442|Register the editattempt_block schema (T310390)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet

Mentioned in SAL (#wikimedia-operations) [2022-10-12T20:08:31Z] <samtar@deploy1002> Finished scap: Backport for [[gerrit:833442|Register the editattempt_block schema (T310390)]] (duration: 05m 42s)

Now that the instrumentation appears to be deployed, I'm assigning this over to @MNeisler to complete server-side QA.

I completed QA of the data logged to date for the new mediawiki_editattempt_block schema. All data appears to be logging as expected, with one small issue identified impacting less than 1% of events. See description of the the issue below as well as a summary of confirmed checks and some initial data:

Country Code Error
There are some events (about 0.8% of all events logged) where country_code field is not logged appropriately. In these cases, instead of the expected 2-digit country code, there are long character strings like the following:
'"()&%<acx><ScRiPt >XNeG(9125)</ScRiPt>
"convert(int,sys.fn_sqlvarbasetostr(HashBytes('MD5','1274892938')))"
qRp6RTLxhFdAKjduhyvZAbYK2PQ4FPeJbyTMBWaXLau1BJ3zxYcs8GqAwmwnxDyK

@DLynch -Do you know what might be causing this? Let me know if there is additional data that would be helpful.

Confirmed Checks

  • All expected block types are logged and appear as expected.
block_typen_eventsn_userspct_users
autoblock13411101.7%
ip1319912102916%
range5829406283444%
user9475246538%
  • Both desktop and mobile blocks are logged as expected (90% of blocked edit attempts occur on desktop)
  • All expected interface types are logged
interfacen_eventsn_userspct_users
discussiontools65624386%
mobilefrontend588299141719.5%
visualeditor181435184925.4%
wikieditor6383838358049%
  • Blocked edit attempts by both logged and anon users are logged. All anon users are indicated by user_id = 0.
  • Both local and global block types are logged. (64% of events are local; 36% are global)
  • Infinite blocks can be correctly indicated by block_expiry == 'infinity'
  • We start logging events on 12 October 2022 when the config patch to register the schema was deployed.
  • Except for the issue identified above, the country code appears to be logging correctly for the majority of events and numbers appear as expected.
  • All related page and edit info (page namespace, page id, and rev id) is logged correctly
  • Blocks by User Edit Count appears as expected. All anon uses are logged as having an edit count of 0.

QA notebook

@MNeisler Huh, interesting. So, the country code is generated like this:

	private static function getCountryCode() {
		$request = RequestContext::getMain()->getRequest();
		$country = false;
		// Use the GeoIP cookie if available.
		$geoip = $request->getCookie( 'GeoIP', '' );
		if ( $geoip ) {
			$components = explode( ':', $geoip );
			$country = $components[0];
		}
		// If no country was found yet, try to do GeoIP lookup
		// Requires php5-geoip package
		if ( !$country && function_exists( 'geoip_country_code_by_name' ) ) {
			$ip = $request->getIP();
			if ( IPUtils::isValid( $ip ) ) {
				$country = geoip_country_code_by_name( $ip );
			}
		}
		return $country;
	}

As such, I'd speculate that this is users who're doing something intensely weird to their GeoIP cookie. Probably, since we're talking about blocked users, it's people who're throwing potential XSS vectors at the site to see if anything happens.

I could either add some validation to the logging code to make sure the country code is 2 characters, or I could leave it as-is if this is simple for you to filter out.

@DLynch - Interesting! I think we can leave this as is as it's simple to filter out and it might be useful to track these types of events.

@ppelberg - Reassigning to you for review and sign-off. See a summary of the QA checks and some initial results in T310390#8332967