\NewEnviron

myproblem \problemquestion\BODY Input: Output: \@problemoutput Objective Function: \@problemobjectiveConstraint: \@problemconstraint

Kishu: Time-Traveling for Computational Notebooks

Zhaoheng Li, Supawit Chockchowwat, Ribhav Sahu, Areet Sheth, Yongjoo Park University of Illinois at Urbana-Champaign zl20,supawit2,ribhav2,assheth2,yongjoo@illinois.edu

Abstract.

Computational notebooks (e.g., Jupyter, Google Colab) are widely used by data scientists. A key feature of notebooks is the interactive computing model of iteratively executing cells (i.e., a set of statements) and observing the result (e.g., model or plot). Unfortunately, existing notebook systems do not offer time-traveling to past states: when the user executes a cell, the notebook session state consisting of user-defined variables can be irreversibly modified—e.g., the user cannot ’un-drop’ a dataframe column. This is because, unlike DBMS, existing notebook systems do not keep track of the session state. Existing techniques for checkpointing and restoring session states, such as OS-level memory snapshot or application-level session dump, are insufficient: checkpointing can incur prohibitive storage costs and may fail, while restoration can only be inefficiently performed from scratch by fully loading checkpoint files.

In this paper, we introduce a new notebook system, Kishu, that offers time-traveling to and from arbitrary notebook states using an efficient and fault-tolerant incremental checkpoint and checkout mechanism. Kishu creates incremental checkpoints that are small and correctly preserve complex inter-variable dependencies at a novel Co-variable granularity. Then, to return to a previous state, Kishu accurately identifies the state difference between the current and target states to perform incremental checkout at sub-second latency with minimal data loading. Kishu is compatible with 146 object classes from popular data science libraries (e.g., Ray, Spark, PyTorch), and reduces checkpoint size and checkout time by up to 4.55 $\times$ and 9.02 $\times$ , respectively, on a variety of notebooks.

1. Introduction

Computational notebooks (e.g., Jupyter (Jupyter, 2023; Team, 2023a), Rstudio (Posit Software, 2023)) are widely used by data scientists (Ormond, 2018; Perkel, 2018). A key feature of the notebook workflow is iterative code execution and result observation (Amershi et al., 2019; Chattopadhyay et al., 2020), which is highly compatible with the incremental nature of data science tasks, such as interactive tutorials (Johnson, 2020), data exploration (Crotty et al., 2015; Zgraggen et al., 2014; Dunne et al., 2012), visualization (Eichmann et al., 2020), and model tuning (Wagenmakers and Farrell, 2004; Bergstra and Bengio, 2012). This iterative workflow is enabled by notebooks systems being stateful—to do work, users would start a session, then as users execute code in the notebook system, the results are held in the session state as user-defined variables (e.g., loaded datasets, fitted models).

Limitation: no Time-Traveling for Notebooks

Oftentimes, during a workflow, users would like to revert changes made to the session state (i.e., time-travel), such as to undo a modification (e.g., restore a dropped column of a dataframe (StackOverflow, 2024)), restoring an overwritten variable (Groups, 2024), or perform reverse debugging (Brachmann and Spoth, 2020). Unfortunately, unlike program debuggers (e.g., gdb) (Phang et al., 2013; Barr and Marron, 2014; GDB, 2024), relational databases (e.g., PITR in PostgreSQL and MySQL (Group, 1996a; Oracle, 2024b)) or interactive data systems (Dunne et al., 2012; Crotty et al., 2015; Kraska, 2021) which support time-traveling to past program states, existing notebook systems do not natively keep track of past session states: cell executions cannot be undone, e.g., the user cannot ’un-drop’ a dataframe column. If the user executes a cell that alters the session state, a common approach to restore the previous state would be to restart the kernel and then (painstakingly) re-run past cells in the correct order. While code versions can be saved using tools such as Git (Git, 2024b) or native commands (e.g., Jupyter’s %checkpoint (Team, 2023b)¹¹1Despite its name, %checkpoint only stores cell code and not objects in the state.) to simplify identifying cells to rerun for restoration, cell reruns can still be time-consuming (e.g., re-training an ML model) and/or result in incorrect restoration (e.g., random train-test splits). Another approach is for the user to periodically checkpoint the session state (e.g., memory dump (CRIU, 2023; Ansel et al., 2009) or session state serialization (Foundation, 2023h)) to storage or a managed database (e.g., KV-store (Team, 2023c)). Then, users can load an appropriate checkpoint file to restore the session state. However, performing session checkpointing and restoration using these tools is limiting: checkpointing can incur prohibitive costs (§ 7.3, § 7.4) and may fail on certain workloads (e.g., GPU (cri, [n.d.])), and restoration can either (1) only be (inefficiently) performed from scratch, requiring completely loading a checkpoint file (Foundation, 2023h) and/or killing the current kernel (CRIU, 2023), or (2) may be incorrect, breaking inter-variable relations (Team, 2023c).

Our Goal: Generalizable, Correct, and Efficient Time-Traveling

We propose Kishu, a notebook system that enables time-traveling between session states: as the user executes cells, Kishu tracks the session state evolution while writing per-cell incremental checkpoints containing differing data between successive states (i.e., the state delta) for returning to any past state via an incremental checkout later. Kishu pursues three challenging goals—Delta-Efficient Checkpoint: Kishu aims to minimize incremental checkpointing overhead by exploiting the small per-cell deltas typical of data science workflows (§ 7.6.1), but also avoid high detection overhead in the face of complex access patterns and inter-variable dependencies. Correct & Non-intrusive Checkout: Kishu aims to restore past states in the same session non-intrusively by leveraging existing objects in the kernel (that don’t need updating) to minimize data loading costs, while still guaranteeing checkout accuracy as if it completely loaded a checkpoint file. Generalizability: Kishu aims to support checkpointing/checkout for almost all notebook libraries (and/or use cases), of which there is a large variety, e.g., notebooks can perform distributed computing (e.g., Spark (Zaharia et al., 2010)) or move data off-CPU (e.g., GPUs (Foundation, 2024a)). If Kishu can achieve these goals, Kishu will allow users to undo almost any executed cell that undesirably modifies the state as if it never occurred by quickly checking out to the pre-execution state at the cost of minimal workflow overhead.

Our Approach

Our core idea for achieving the aforementioned goals is to capture the session state delta with low overhead, but at a sufficiently high granularity using information exclusively available at the application level; this will allow us to perform generalizable, correct, and efficient time-traveling, as follows:

First, for delta-efficient incremental checkpointing, Kishu utilizes low-overhead live analysis (e.g., namespace patching) to track session state evolution at a novel Co-variable granularity (i.e., connected components of objects). Then, Kishu writes and versions Co-variables with the checkpoint graph representing the user workflow in terms of cell executions to minimize delta storage overhead.

Second, for correct incremental checkout, Kishu identifies the difference between the current and target session state at the aforementioned Co-variable granularity via state divergence according to the checkpoint graph. Then, it replaces (only) Co-variables that need updating in the session state by loading data from the appropriate incremental checkpoints. This approach minimizes data loading time for checkout and transparently restores the state in the same kernel process without interruption and with sub-second latency.

Third, Kishu achieves generalizability and fault-tolerance through fallback recomputation. If a Co-variable cannot be stored in a checkpoint (e.g., it contains an unserializable object such as a hash (Foundation, 2023c)) or fails to load upon checkout, Kishu can efficiently reconstruct it upon checkout via finding the shortest path combining intermediate data loading and cell re-running according to the checkpoint graph.

Difference from Existing Work

Our work enables high-efficiency time-traveling for computational notebooks through significantly different techniques compared to existing work. While OS-level tools (CRIU, 2023; Ansel et al., 2009) can perform incremental checkpointing, they fail to exploit the fine-grained deltas in data science workflows, cannot perform incremental restore, and fail on remote objects (e.g., Ray (Moritz et al., 2018), on-device data (PyTorch, 2024)). Existing application-level tools (Foundation, 2023g, f; Li et al., [n.d.]) are built for single-time C/R and lack both incremental storage for subsequent checkpoints and incremental restoration featured in this work. Works for variable versioning (Koop and Patel, 2017) and lineage capturing (Brown et al., 2023; Wang et al., 2022) serve significantly different purposes (e.g., visualizing state evolution (Wang et al., 2022)) and cannot be used to travel to a previous/different state, thus are largely orthogonal. Our work shares similarities with time-traveling and versioning in DBMS (Oracle, 2024a; Group, 1996b; Soroush and Balazinska, 2013; Schule et al., 2019); however, our techniques handle complex access patterns and inter-variable dependencies unique to notebook states to enable delta computation for arbitrary objects. We summarize differences in LABEL:tbl:existing_work.

Contributions

According to our motivations in § 2, we implement Kishu (§ 3), a notebook system with the following contributions:

•

State Delta Detection. We introduce our modeling of session state evolution at a novel Co-variable granularity, and our correct and efficient delta detection at this granularity. (§ 4)
•

State Versioning. We introduce our delta-based session state versioning with the Checkpoint Graph, which enables efficient and fault-tolerant incremental checkpointing and checkout. (§ 5)
•

Time-traveling. We show via experimental evaluation that Kishu’s time-traveling is compatible with 146 classes from popular data science libraries and reduces checkpoint size and checkout time by up to 4.55 $\times$ and 9.02 $\times$ , respectively. (§ 7)

2. Motivation

This section describes use cases for time-traveling in notebooks (§ 2.1), our intuition for efficient time-traveling (§ 2.2), and how we choose a granularity for incremental checkpointing/checkout (§ 2.3).

2.1. Why is Time Traveling Useful?

Time-traveling computational notebooks can enable users to efficiently undo cell executions and perform path-based exploration.

Undoing Cell Executions

Data cleaning and visualization operations are oftentimes irreversible (e.g., df = df.drop_col(’a’)) and/or partially performed with unobservable side-effects (e.g., a mid-cell execution error), and the user may want to return to the previous state if the execution was erroneous or undesirable (Groups, 2024; Brachmann and Spoth, 2020). To enable interruption-free time-traveling, we can checkpoint the state delta to storage after each operation such that the session state prior to performing the operation can be returned to via loading the appropriate deltas. We empirically study this use case in § 7.5.1.

Path-based exploration

Reactive execution allows users to quickly investigate alternative paths in data science workflows: when a previously executed cell is re-executed (e.g., train model with different hyperparameters), all its dependent cells (e.g., plotting cells) are too (Koop and Patel, 2017; Shankar et al., 2022). If we can efficiently persist all different variations of objects in different execution paths (i.e., as incremental deltas w.r.t. the shared state), users can efficiently evaluate each variation of values against each other: to switch paths, only the (small portion of) data differing between paths need to be updated via loading the appropriate deltas. We empirically study this use case in § 7.5.2.

2.2. Enabling Time Traveling

We discuss pros and cons of different checkpointing and checkout approaches for enabling time-traveling to a previous state.

OS-level Memory Snapshots

Tools such as CRIU (CRIU, 2023) and DMTCP (Ansel et al., 2009) can be used to create memory snapshots of notebook processes, which contain all data in the session state. (Incremental Checkpointing) Subsequent snapshots can be made incrementally w.r.t. previous snapshots such that only dirty memory pages are stored. However, memory page granularity is coarse and incurs high checkpoint storage overhead (§ 7.3), and OS-level checkpointing is limited to single processes, failing on notebooks utilizing multiple/remote processors (e.g., Spark (Zaharia et al., 2010) and Ray (Moritz et al., 2018) pipelines) or moving data out of CPU (e.g., onto the GPU). (Complete Checkout) Memory snapshots must be entirely loaded to restore the notebook state and additionally require the existing notebook process to be killed (otherwise, a PID conflict occurs) before restoration. This process is both not seamless and incurs high data loading costs (§ 7.5).

Application-level Session Dump

Application-level tools such as Dill’s dumpsession (Foundation, 2023g) and ElasticNotebook (Li et al., [n.d.]) can be used to create checkpoint files by serializing data (e.g., into bytestrings) in the session state. (Complete Checkpointing) Notably, no existing application-level tool supports incremental checkpointing: each checkpoint must independently contain sufficient data for restoring the session state. (Complete Checkout) Checkpoint files need to be entirely loaded (via data deserialization) for session restoration; despite capable of restoring into an existing notebook process, these tools do not utilize variables that are already present within the kernel which may not need to be updated for faster checkout (§ 7.5).

Application-level Incremental Checkpoint and Checkout (Ours)

If we can detect the delta between session states at a finer granularity (compared to OS-level tools) and utilize existing data within the kernel to speedup checkout (compared to Application-level dumps), we can achieve efficient and generalized application-level incremental checkpointing and checkout, as follows: (Incremental and Generalized Checkpointing) We utilize application-level information to track state deltas at a finer-than-memory-page-granularity for storage-efficient incremental checkpointing. For generalizability, we can utilize an object’s reduction (i.e., __reduce__ (Foundation, 2023f)) as storage instructions to handle checkpointing multiprocessing and off-CPU workloads. (Incremental Checkout) If we know the contents of the target session state to restore to (e.g.., in terms of a state snapshot (Bernstein and Goodman, 1983)) and accurately compute its difference from the current state, we can perform incremental checkout by only loading and updating data that differ between the current and target state.

Specifically, we compute state differences at the Co-variable granularity, which we describe and motivate in the next session.

2.3. Tracking State Delta for Time Traveling

Delta detection between states is required for efficient incremental checkpointing and checkout. We discuss pros and cons of the different methods of tracking the state delta.

Dirty Memory Page Tracking

System-level checkpointing tools (CRIU, 2023; Ansel et al., 2009) track state deltas via dirty memory pages. While fast, the delta granularity is insufficient for tracking Python notebook session states, as (1) modifying any object, regardless of size (e.g., x += 1) will cause the entire page (e.g., 4KB) holding the object to become dirty, and (2) Python data structures (e.g., lists) are often constructed and stored in a fragmented manner, leading to simple operations (e.g., mapping elements in a list in-place) creating multiple dirty pages, leading to high incremental checkpoint costs.

Provenance-based Tracking

Notebook systems such as IPyFlow (Macke et al., 2020), Dataflow (Koop and Patel, 2017), and ElasticNotebook (Li et al., [n.d.]) track session state evolution at the variable-level via provenance-based code analysis (Foundation, 2023a). While capable of producing fine-grained deltas, the tracker has to be either conservative on identifying changed variables (e.g., w.r.t. control flows and external function calls (Li et al., [n.d.])) causing many false positives and large deltas, or perform extensive live instrumentation for resolution (Macke et al., 2020), which can result in high overhead (§ 7.6).

Co-variable Granularity Live Tracking (Ours)

In order to avoid low granularity of memory-page tracking and potential efficiency issues of provenance tracking, we propose to use live object comparison to track updates to Co-variables—connected components of objects (w.r.t. pointer references). Our intuition is that (live) tracking and checkpointing/checking out individual objects are expensive and risky (i.e., may break shared references (Li et al., [n.d.])), respectively. However, at Co-variable granularity, we can achieve low-overhead tracking by reasoning in a principled way based on access patterns which Co-variables were updated by each cell execution, and correctly store/load Co-variables during incremental checkpointing/checkout as if they are independent data tables (§ 4). We depict this idea in LABEL:fig:background_variable_blob: {ls,obj} is a Co-variable (red) as the objects reachable from these variables overlap, i.e., &ls[1] = &obj.foo. {df} is another Co-variable (blue), and there is no way to reach objects under df from objects under ls or obj via references. Notably, Co-variables are the minimum granularity which data in the session state can be stored and loaded without risking breaking shared references (e.g., unlike variable-level KV-stores (shove, 2024; 0xnurl, 2024)). We formally describe Co-variables and how we correctly and efficiently capture state delta at this granularity in § 4.

Motivating Example

Suppose a data analyst is performing text mining (LABEL:fig:background_example). They load the corpus (Cell 1), define category lists (Cell 2), and sort texts by sentiment into the lists (Cell 3). The analyst incrementally checkpoints the state after each cell execution.

(1) Incremental Checkpointing: The analyst wishes to map the lists to clean the contained text and tests with sad_ls (Cell 4, blue). Due to its interleaved construction (with other lists), the list sad_ls is fragmented; incrementally checkpointing the state at memory page granularity for Cell 4 (w.r.t. Cell 3) results in all pages overlapping with sad_ls being copied. However, a Co-variable granularity incremental checkpoint stores only (the bytestring of) sad_ls.

(2) Incremental Checkout: The analyst decided to undo the mapping function in Cell 4 as the results were unsatisfactory. Returning to the state of Cell 3 by (completely) loading a memory snapshot is slow as it requires reloading the corpus. However, by identifying that the states between Cells 3 and 4 differ only by sad_ls, we can only load the value of sad_ls from Cell 3 (red) to replace value of sad_ls from Cell 4 (blue) to perform incremental checkout while keeping the rest of the session state untouched.

3. System Overview

This section presents Kishu components (§ 3.1) and workflow (§ 3.2).

3.1. Kishu Components

Kishu (see LABEL:fig:system_overview) interacts with notebook sessions through inserting non-intrusive hooks. These hooks allow Kishu to transparently (1) monitor the kernel namespace to track session state evolution, (2) write data in the session state to storage for incremental checkpointing, and (3) alter the session state when checkout is requested.

Patched Namespace

At the start of the notebook session, Kishu patches the session namespace in order to monitor accesses to its contents between successive cell executions (§ 4.3). It identifies the candidate Co-variables to check for updates through tracking user-referenced variable names, and passes the candidates to the Delta Detector to compute the Co-variable granularity state delta.

Delta Detector

The Delta Detector computes the state delta based on the candidates identified from the Patched Namespace (i.e., which of the candidate Co-variables were actually updated by the cell execution). We discuss the Kishu’s delta detection in (§ 4).

Checkpoint Graph

The Checkpoint Graph is a tree-like structure analogous to Git’s commit graph (Git, 2024a), in which Kishu writes, stores, and versions incremental checkpoints consisting of the updated Co-variables (i.e., the state delta) of each cell execution (§ 5.1). The incremental checkpoints stored in the Checkpoint Graph are used by the State Loader to perform incremental checkout.

State Loader

The State Loader restores to a session state upon requested checkout. It first identifies the difference between the current session state (i.e., existing items in the namespace) and the target state according to the Checkpoint Graph, then loads only the necessary data from the Checkpoint Graph for replacing the variables that need updating (§ 5.2). If required data for checkout is missing (e.g., Kishu failed to serialize it into the Checkpoint Graph) or fails to load, the Data Restorer is invoked to restore the data.

Data Restorer

The Data Restorer is a mechanism that utilizes fallback recomputation to restore missing data for checkout (e.g., Kishu failed to serialize the data during prior checkpointing). It reconstructs missing data by combining loading dependent data and cell re-runs according to the Checkpoint Graph. (§ 5.3)

3.2. Kishu Workflow

This section describes how users interact with Kishu during a notebook workflow. Users will attach Kishu to a notebook session upon session start; then, Kishu will monitor the namespace for state deltas to incrementally checkpoint after each cell execution (if automatic checkpointing is enabled), and perform checkout back to a previous state when requested.

Attaching Kishu to a Notebook Session

When initializing a notebook session, Kishu will be attached to the kernel; it will patch the namespace and initialize the Checkpoint Graph on storage.

Performing Incremental Checkpointing

After each cell execution, the Delta Detector utilizes the Patched Namespace to identify the updated Co-variables, then stores them as a new incremental checkpoint (i.e., node) in the Checkpoint Graph.

Incrementally Restoring a State

Kishu will restore a previous session state on request. The State Restorer first identifies the difference between the current and target state at Co-variable granularity according to the Checkpoint Graph (i.e., which variables need to be restored), then loads only the necessary data according to the difference to complete the restoration. If necessary, the Data Restorer reconstructs data that is missing (i.e., failed to serialize during checkpointing earlier) or failed to load via fallback recomputation.

4. Accurate and Fast Delta Detection

In this section, we describe how Kishu correctly detects the Co-variable granularity state deltas necessary for incremental checkpointing and checkout (§ 2.3). We formally describe the Co-variable in § 4.1, how we correctly detect Co-variable updates in § 4.2, and how we speedup the detection process in § 4.3.

4.1. Co-variables

In this section, we introduce the Co-variable—Kishu’s granularity for efficient incremental checkpointing and checkout.

Preliminary: Variables, Objects and Reachability

In Python and the Jupyter Notebook ecosystem, variables and objects are 2 distinct concepts: A variable is a named entity from which one or more objects are reachable - for example, for a list ls=[1,2,3], the list name (ls) is a variable and each of the elements (1, 2, 3) is an object. We define reachability according to references, i.e., object y is reachable from variable x if y can be accessed from x through a chain of references. Some common reachability patterns include subscripting (e.g., y = x[0]), class member (e.g., y = x.attr), and attribution (e.g., y = x.__dict__). Given our distinction between variables and objects and our definition of reachability, we now define the Co-variable as follows:

Definition 0.

A Co-variable is a set of variable names $\mathcal{X}=\{x_{1},...,x_{i}\}$ from which the reachable objects form a maximally connected component. That is, for any variable $y$ not in the set, the objects reachable from $x_{1},...,x_{i}$ are not reachable from $y$ .

A Co-variable can consist of one name (e.g., a primitive, x = 1) or multiple names from which the same object can be reached (i.e., shared references). LABEL:fig:checkpoint_variable_update shows an example—the string object ’b’ is reachable from both list ls and object obj via subscript and class member respectively, hence {ls,obj} is a Co-variable. Co-variables are self-contained by definition, i.e., there are no inter-Co-variable references. Co-variables can be modified by cell executions:

Definition 0.

A Co-variable $\mathcal{X}=\{x_{1},...,x_{i}\}$ is modified by a cell execution if the graph structure of the connected component of objects reachable from $x_{1},...,x_{i}$ is modified, counting both node (i.e., object) and edge (i.e., reference) additions and deletions.

For example, the Co-variable {ls,obj} in LABEL:fig:checkpoint_variable_update is modified node-wise with “ls[0] = ’e’” (bottom), and is modified edge-wise with “obj.foo = ls[2]” (bottom-right). Co-variables can also be created (or deleted) via split and merge (right): the Co-variable {obj,ls} is deleted via a split (as obj and ls longer share references), and the Co-variable {obj, st} is created through a merge. For brevity, we collectively refer to Co-variable modifications, creations, and deletions as updates—the set of Co-variables updated by a cell execution form the execution’s state delta.

4.2. Accurate State Delta Detection

This section describes how Kishu accurately detects Co-variable membership (i.e., which variables form a Co-variable) and updates.

VarGraphs

Kishu uses VarGraphs to detect Co-variable membership and updates. The VarGraph is a graph structure constructed from each variable in the namespace that captures its reachable objects. An example is shown in LABEL:fig:checkpoint_id_graph: each node in a variable’s VarGraph corresponds to a reachable object, containing the (1) object type, (2) memory address, and one of (3) pointers to other reachable objects (i.e., children) for non-primitive objects, or (4) value for primitive objects. For example, the node for the list reachable from ls contains 3 child pointers to the 3 nodes for strings ’a’, ’b’, and ’c’, and the node for string ’b’ holds its value ’b’. ²²2The VarGraph is inspired by ElasticNotebook’s ID graph (Li et al., [n.d.]) which captures reachable objects’ memory addresses; VarGraphs uniquely contain datatypes and primitive values for additional robustness (e.g., detecting a different primitive in the same address).

Detecting Co-variable membership

Co-variable membership is determined by intersecting VarGraphs. For example, in LABEL:fig:checkpoint_id_graph, ls and obj form a Co-variable as the node ’b’ is in both graphs (red).

Detecting Co-variable updates

Co-variable updates is determined by comparing VarGraphs before and after cell executions. A graph structure modification and/or a node attribute change (e.g., object memory address or type) indicates an update to the Co-variable.

Accuracy Guarantee

As Kishu constructs VarGraphs following object rechability, it detects Co-variable updates with no false negatives (verified empirically in § 7.2.1). However, Kishu’s update detection is conservative: there may be false positives if objects are dynamically generated (e.g., datatype objects) with a different memory address each time during VarGraph construction/object traversal, or cannot be traversed into (i.e., lacking referencing instructions, e.g., generators (Foundation, 2023b), which Kishu assumes to be updated on access).

4.3. Efficient State Delta Detection

In this section, we describe how Kishu speeds up the Co-variable update detection process. Identifying Co-variable updates across the entire global namespace via VarGraphs can be expensive (due to object traversals); hence, Kishu needs to reduce the number of Co-variables (hence the portion of the namespace) it checks after each cell execution while maintaining accuracy of detection.

Identifying Possibly Updated Co-variables

Cell executions in Jupyter Notebook interact with the global namespace (i.e., globals()). Therefore, if Kishu can capture variable references in the cell execution, it can reason about which Co-variables were possibly updated (and which ones were definitely not), as follows:

Definition 0.

A Co-variable $\mathcal{X}=\{x_{1},...,x_{i}\}$ is accessed by a cell execution if any variable $x_{1},...,x_{i}$ is accessed (via getting, setting, or deletion) during the cell execution.

A Co-variable being accessed indicates the possibility of it being updated (e.g., a subscription access ls[0] = ’e’). Kishu patches the accessor, setter, and deletion methods of the global namespace (LABEL:fig:checkpoint_detect_fast) to capture variable (hence Co-variable) accesses. Kishu identifies candidate Co-variables for update checking via the captured accessed variables: if a Co-variable’s members $(x_{1},...,x_{i})$ overlaps with the accessed variables of the cell execution, then the Co-variable may have been updated (e.g., {ls,obj} in LABEL:fig:checkpoint_detect_fast) and Kishu will have to verify the update by (1) re-generating VarGraphs for its member variables, (2) comparing the VarGraphs with those before the cell execution to identify modifications, and (3) intersecting the VarGraphs amongst variables of accessed Co-variables to identify merges and splits. Otherwise, the Co-variable surely wasn’t updated and Kishu skips its check for this cell execution (e.g., {df} in LABEL:fig:checkpoint_detect_fast, greyed out). We provide a short proof of contradiction:

Lemma 0.

A Co-variable $\mathcal{X}=\{x_{1},...,x_{i}\}$ can be updated by a cell execution only if at least one of $x_{1},...,x_{i}$ was accessed in the code.

Proof.

Suppose not, i.e., Co-variable $\mathcal{X}$ has empty intersection with the accessed variables and was updated. Then, the Co-variable must have updated through another variable $y$ that was not part of the Co-variable $x_{1},...,x_{i}$ before the start of the cell execution. Due to Co-variables’ self-containment (§ 4.1), the user cannot possibly access objects reachable from $x_{1},...,x_{i}$ via $y$ during the cell execution without creating a reference by using one of $x_{1},...,x_{i}$ first (e.g., $y.foo$ = $x\_i$ ), but doing so violates our assumption. ∎

As only a small portion of variables are accessed per cell in a typical data science notebook, Kishu significantly reduces delta detection overhead with this approach (empirically verified in § 7.6.1).

Remark

As Kishu patches the global namespace of the notebook session, it is impossible for the user to use variables from within the notebook (i.e., to modify objects) undetected. Hence, Kishu will not misidentify Co-variables possibly updated via references.

The user may still use non-referencing methods such as direct C-pointer-based modifications, but these cases are rare in notebooks (found in 0 out of 60 surveyed notebooks (Li et al., [n.d.])) Some libraries, such as NumPy (Team, 2024a), do perform memory-based updates (e.g., via slicing). However, the objects are supported by Kishu as their memory-based updates are still invoked via referencing (e.g., arr[0,1] += 1, empirically verified in § 7.2.1).

5. Inc. Checkpoint & Checkout

This section describes Kishu’s efficient time-traveling with the Co-variable granularity state deltas. We describe Kishu’ incremental checkpointing in § 5.1, Kishu’s incremental checkout in § 5.2, and how Kishu time-travels to and from notebook states with problematic (e.g., unserializable) data in a fault-tolerant manner in § 5.3.

5.1. Incremental Checkpointing

This section describes how Kishu performs incremental checkpointing by writing and managing per-cell-execution checkpoints containing the updated Co-variables with the Checkpoint Graph.

Checkpoint Graph

The Checkpoint Graph is a directed tree of (incremental) checkpoints representing the branch-based evolution of session states. It is grown (i.e., via adding nodes) with each checkpoint Kishu performs, and nodes are timestamped according to the completion time of the corresponding cell execution $t$ (which we refer as CE $t$ and node $t$ for simplicity). The Checkpoint Graph maintains a head node, which tracks the user’s current state. Each node $t$ contains the state delta consisting of (only) Co-variables updated by CE $t$ . When Co-variables are stored in the Checkpoint Graph, they are versioned according to their corresponding CE $t$ :

Definition 0.

A Versioned Co-variable is a Co-variable-timestamp pair $(\mathcal{X},t)$ representing the Co-variable $\mathcal{X}$ updated by CE $t$ .

Versioned Co-variables are analogous to versioned datasets: the same Co-variable (w.r.t. variable membership, i.e., $\mathcal{X}=\{x_{1},...,x_{i}\}$ ) can take on multiple values throughout a notebook session being updated by different cell executions. LABEL:fig:checkpoint_graph show an example: CE $t_{3}$ creates the Co-variable {plot}, hence it is stored in node $t_{3}$ (red) as the Versioned Co-variable $((plot),t_{3})$ .

Writing into the Checkpoint Graph

Kishu creates a new node $t$ in the Checkpoint Graph after each CE $t$ . The new node contains (1) the state delta of CE $t$ consisting of versioned Co-variables, (2) the code of CE $t$ , and (3) the versioned Co-variables (possibly stored in previous checkpoints) accessed by CE $t$ (§ 4.3). For example, the node $t_{3}$ in LABEL:fig:restore_diff contains the code of CE $t_{3}$ (“plot=gmm.result()“) and its dependency on the versioned Co-variable from node $t_{2}$ $((gmm),t_{2})$ (dashed line), analogous to the transformation, transaction, and dependencies in database versioning. The new node $t$ is written into the Checkpoint Graph under the head node $s$ , and an edge (i.e., a parent-child relationship) is added from the head node $s$ to the new node $t$ (which is now the new head node).

Handling Unserializable Data

If Kishu cannot write an updated Co-variable in the state delta into the Checkpoint Graph (e.g., it contains an unserializable object such as a generator (Foundation, 2023b) or hash (Foundation, 2023c)), Kishu simply skips its storage. Instead, upon checkout, the missing (unserializable) Co-variable will be restored through fallback recomputation enabled by the cell code and dependencies stored in the Checkpoint Graph node, which we discuss in § 5.3.

5.2. Efficient State Restoration

When (incremental) checkout is requested, Kishu aims to accurately restore the current state to the target state in the fastest manner possible. To do so, it must identify the contents of the target state according to its timestamp, analogous to how timestamped state snapshots are defined in MVCC (Bernstein and Goodman, 1983); instead of versioned tables, we identify Versioned Co-variables in the (timestamped) target state:

Definition 0.

The Session State³³3Unlike state deltas, cell code, and dependencies, session states are not stored in nodes on the Checkpoint Graph. They are inferred (by timestamp) at checkout time. at timestamp $t$ is a set of $n$ Versioned Co-variables $\{(\mathcal{X}_{1},t_{1}),...,(\mathcal{X}_{n},t_{n})\}$ such that for each $(\mathcal{X}_{j},t_{j})$ , $1\leq i\leq n$ :

1.

$t_{j}$ is an ancestor of $t$ on the Checkpoint Graph.
2.

There must not exist another versioned Co-variable $(\mathcal{Y}_{k},t_{k})$ such that $\mathcal{X}_{j}\cup\mathcal{Y}_{k}\neq\emptyset$ and $t_{k}$ is a child of $t_{j}$ and ancestor of $t$ .

The session state at timestamp $t$ (session state $t$ for brevity) is the set of all Versioned Co-variables that are (1) available in the namespace after CE $t$ and (2) have not been overwritten by a newer Versioned Co-variable prior to CE $t$ . For example, in LABEL:fig:restore_diff, the session state at timestamp $t_{3}$ (top-left) consists of the Versioned Co-variables $(\{plot\},t_{3})$ , $(\{gmm\},t_{2})$ , and $(\{df\},t_{1})$ . It does not contain $(\{gmm\},t_{1})$ as it was overwritten by CE $t_{2}$ (gmm.fit(k=3)) which writes $(\{gmm\},t_{2})$ . Each session state $t$ dictates which Versioned Co-variables should be loaded from various nodes on the Checkpoint Graph for checkouts; to efficiently accomplish (incremental) checkout, Kishu identifies the difference between the current and target session states in terms of the (versioned) Co-variables that need to be updated. That is, some Co-variables do not need to be updated when converting the current state to the target state. They can be identified via the Checkpoint Graph:

Definition 0.

A Co-variable $\mathcal{X}$ is identical between the current state $t_{a}$ and target state $t_{b}$ if a Versioned Co-variable $(\mathcal{X},t_{x})$ exists in the session states of $t_{a}$ , $t_{b}$ , and $t_{c}$ , where $t_{c}$ is the lowest common ancestor of node $t_{a}$ and node $t_{b}$ . Otherwise, if no such $(\mathcal{X},t_{x})$ exists, then the Co-variable $\mathcal{X}$ has diverged between $t_{a}$ and $t_{b}$ .

A Co-variable $\mathcal{X}$ is identical between current and target session states $t_{a}$ and $t_{b}$ if its versioned counterpart has the same version across $t_{a}$ , $t_{b}$ , and $t_{c}$ , i.e., none of the cell executions between (1) node $t_{a}$ and node $t_{c}$ and (2) node $t_{b}$ and node $t_{c}$ updated the Co-variable $\mathcal{X}$ , hence does not need to be updated when checking out from $t_{a}$ to $t_{b}$ . For example, in LABEL:fig:restore_diff, if checking out from current state $t_{5}$ to target state $t_{3}$ , the Co-variable {df} (blue) is identical between the states as no cell execution between (1) node $t_{1}$ and node $t_{3}$ and (2) node $t_{1}$ and node $t_{5}$ updated it.

Otherwise, if the Co-variable $\mathcal{X}$ has diverged between the current and target session states, it will need to be updated (by either loading the appropriate Versioned Co-variable or deleting it) to complete the checkout to the target state. For example, the Co-variable {gmm} (red) has diverged between nodes $t_{5}$ and $t_{3}$ as their parents ( $t_{4}$ and $t_{2}$ ) both updated gmm with their cell execution (fitting with k=3 and k=10), hence, gmm (and plot) needs to be updated via loading ({gmm}, $t_{2}$ ) if checking out from state $t_{5}$ to state $t_{3}$ .

Performing State Checkout

When checking out to the state at node $t$ , The State Restorer (§ 3.1) performs the following steps:

1.

Load the appropriate Versioned Co-variables from nodes (i.e., node $t$ and ancestors of node $t$ ) to update diverged Co-variables between the state of the current head node $s$ and node $t$ .
2.

Update/re-generate VarGraphs (§ 4.2) for updated Co-variables.
3.

Move the head from node $s$ to the checked out node $t$ .

Notably, the next cell execution will create a node in a new branch rooted at $t$ in the Checkpoint Graph, e.g., the graph in LABEL:fig:restore_diff is generated through the sequence $t_{1}\rightarrow t_{2}\rightarrow t_{3}\rightarrow(\text{checkout to }\,t_{1})% \rightarrow t_{4}\rightarrow t_{5}$ . If during checkout, a required Versioned Co-variable is missing (i.e., due to serialization failure, § 5.1) or fails to load (i.e., deserialization failure), Kishu restores it via fallback recomputation.

5.3. Robust Restoration

In this section, we describe how Kishu restores problematic data to achieve generalizable and fault-tolerant incremental checkout.

Fallback Recomputation

As each Checkpoint Graph node $t$ contains the code of CE $t$ and (2) which Versioned Co-variables $(\mathcal{X}_{j},t_{j})$ (stored in previous deltas, $t_{j}<t$ ) CE $t$ accessed (§ 5.1), any Versioned Co-variable $(\mathcal{X},t)$ in the state delta of node $t$ can be recomputed by (1) loading the accessed Versioned Co-variables from previous checkpoints into a temporary namespace, then re-running CE $t$ . For example, in LABEL:fig:restore_fallback, suppose Versioned Co-variable ({plot}, $t_{3}$ ) (green) fails to load when checking out to session state $t_{3}$ . The versioned Co-variable (({gmm}, $t_{2}$ ) is required to rerun CE $t_{3}$ (red); therefore, ({gmm}, $t_{2}$ ) is loaded from the parent node $t_{2}$ , and after rerunning CE $t_{3}$ on the input ({gmm}, $t_{2}$ ), ({plot}, $t_{3}$ ) is restored.

Dynamic and Recursive Fallbacks

Kishu’s fallback recomputation is dynamic and recursive—if another Co-variable is missing or fails to load when retrieving the inputs (i.e., accessed Co-variables) for performing fallback recomputation, fallback recomputation can be recursively performed for that Co-variable. For example, if ({gmm}, $t_{2}$ ) from node $t_{2}$ fails to load (as part of fallback recomputation for ({plot}, $t_{3}$ ) from node $t_{3}$ ), it itself can be recomputed by loading ({gmm}, $t_{1}$ ) from node $t_{1}$ and rerunning CE $t_{2}$ (blue).

Remark

Kishu guarantees exact restoration for all serializable Co-variables during checkout, i.e., they will have the same bytestring representation before and after checkout if there are no hidden serialization errors (§ 6.2). While Kishu is capable of restoring problematic Co-variables via fallback recomputation, it currently does not support exactly restoring Co-variables that both (1) fail to store or load and (2) are created through non-deterministic means (e.g., random generators). This is a limitation similarly present in Spark (Zaharia et al., 2010) and Ray’s (Moritz et al., 2018) lineage-based fault tolerance; however, in our case, unserializable objects are sufficiently rare in data science libraries (§ 7.2), hence we consider this limitation to be acceptable.

6. Implementation and Discussion

This section describes Kishu’s implementation details (§ 6.1) and design considerations (§ 6.2).

6.1. Implementation

Integrating with Jupyter

For seamless integration, Kishu is implemented as a separate application from the notebook process, and can be used without altering the base Jupyter application. When a session is initialized, Kishu places hooks into the kernel (pre_run_cell and post_run_cell (IPython, 2024b)) and patches the namespace (user_ns (IPython, 2024a)) (§ 3.1), allowing the standalone Kishu process to perform delta detection, write data in the namespace to storage, and overwrite data in the namespace upon checkout transparently.

Serialization Protocol

The Pickle protocol (i.e., __reduce__ (Foundation, 2023f)) is employed for (1) object serialization and (2) constructing VarGraphs (hence identifying Co-variables), i.e., an object y is reachable from another object x if pickle(x) includes y. As Pickle is the de-facto standard (in Python) observed by almost all data science libraries (e.g., NumPy, PyTorch (et al., 2018)), Kishu can be used for almost all use cases. Furthermore, Kishu’s per-Co-variable storage enables mixing and matching serialization libraries for coverage. Currently, Kishu will try CloudPickle (cloudpipe, 2024) first, then use Dill (Foundation, 2023g) as a fallback for Co-variables containing objects that CloudPickle fails on.

Storing Checkpoints

Currently, Kishu uses SQLite (SQLite, 2024) to store Versioned Co-variables in the Checkpoint Graph. However, any storage mechanism can be used in its place—even in-memory ones if the user wants to maximize checkpointing/checkout efficiency.

6.2. Design Considerations

Hidden and External Items

In Kishu, the session state is formally defined according to Co-variables in the user namespace (user_ns), which contains key-value pairs of variable names to their reachable objects. The session state does not include (1) local/module/hidden variables and (2) external items outside of the notebook session (e.g., on-disk files), which we do not aim to checkpoint nor restore.

Silent Serialization Errors

Certain object classes may contain incorrect serialization instructions, which, despite being able to be stored/loaded to/from storage, result in silent errors. Kishu currently assumes that the serialization instructions are correctly implemented for all objects in terms of equality before and after pickling (i.e., x == unpickle(pickle(x)) and does not prevent these silent errors. However, these cases are rare (§ 7.2.1), and Kishu provides a blocklist file for users to force fallback recomputation for Co-variables containing objects belonging to these classes.

7. Experimental Evaluation

In this section, we empirically study the effectiveness of Kishu’s time-traveling. We make the following claims:

1.

Generalizable and Robust Mechanism: Kishu is capable of identifying modifications to, and correctly restoring session states containing 146 object classes from commonly used Data Science libraries. (§ 7.2)
2.

Low Checkpoint Storage Cost: Kishu’s incremental checkpointing optimizations result in its per-cell-execution checkpoints being up to 4.55 $\times$ smaller compared to those from the next best mechanism. (§ 7.3)
3.

Low Checkpoint Times: Kishu’s incremental per-cell-execution checkpoints are created in up to 5.12 $\times$ less time compared to the next best mechanism (§ 7.4)
4.

Fast Incremental Checkout: Kishu’s novel incremental restoration is crucial to its sub-second checkout times — up to 8.18 $\times$ and 9.02 $\times$ faster than the next best mechanism for undoing cell executions and switching branches, respectively. (§ 7.5)
5.

Low Overhead Delta Detection: Kishu incurs negligible runtime overheads on data science notebooks for capturing the state delta — less than 3.0% of the notebook session runtime and up to 4.08 $\times$ less than alternative tracking approaches. (§ 7.6)

7.1. Experiment Setup

Datasets

We select a total of 8 data science notebooks from Kaggle Grandmaster-level users or tutorials of established tools (e.g., Ray). LABEL:tbl:workload reports our selected notebooks’ dataset sizes and runtimes.

Additionally, we select 146 data science library classes commonly used in data science notebooks, on which we evaluate the robustness of Kishu’s modification detection and time-traveling correctness. LABEL:tbl:library reports a classification of the libraries.

Methods

We evaluate Kishu against existing tools capable of enabling time-travelling on notebooks to various degrees:

•

CRIU (CRIU, 2023): Performs a system-level memory dump of the process hosting the notebook session. The session state is restored by loading the memory dump and reviving the process.
•

CRIU-Incremental (CRIU, 2023): CRIU with snapshot deduplication, storing only dirty memory pages in subsequent snapshots.
•

DumpSession (Foundation, 2023h): An application-level checkpointing tool that serializes the entire session state into one single file.
•

ElasticNotebook (Li et al., [n.d.]): An application-level notebook migration tool that balances data serialization and cell recomputation to achieve optimized session replication times.

Ablation Study

We additionally compare the overhead of Kishu’s update detection mechanism with the following methods:

•

IPyFlow (Shankar et al., 2022): A hybrid dynamic-static (i.e., AST analysis with live symbol resolution) for obtaining sub-variable (i.e., symbols, e.g., ls[x]) level granularity to perform reactive cell executions.
•

Kishu (Check all): Always perform update detection for all Co-variables in the session state after each cell execution, regardless of whether they were accessed by the user in the previous cell.

We consider these methods regarding session state delta tracking time (§ 7.6) to gauge the effectiveness of our overhead reduction via identifying candidates for updated Co-variables.

Environment

Experiments are performed on an Ubuntu server with 2 AMD EPYC 7552 48-Core Processors and 1TB RAM. The checkpoints for all methods are written to a mounted NFS, with disk read and write speeds of 519.8 MB/s and 358.9 MB/s, respectively.

Time measurement

We measure (1) (incremental) checkpoint time as time spent checkpointing (including both tracking and data writing) after each cell execution, (2) checkout time as time to restore the state from checkpoint files, and (3) tracking overhead as time spent after each cell execution tracking Co-variable granularity updates. We clear the page cache between runs.

Reproducibility

Our implementation of Kishu, experiment notebooks, full list of data science library classes we test with, and scripts can be found in our Github repository.⁴⁴4https://anonymous.4open.science/r/kishu-vldb-E724

7.2. Generalized and Robust Time Traveling

This section compares the robustness of Kishu’s time-traveling to existing methods. We attempt to checkpoint and checkout session states containing objects from the 146 data science library classes, and compare the number of classes each method fails to checkout. We report the results in LABEL:fig:experiment_robust. Kishu completes checkpoint and checkout for all 146 libraries, notably handling 6 classes involving multiprocessing and/or off-CPU data and 7 unserializable classes that CRIU and DumpSession fail to checkpoint and/or checkout, respectively. This is because (1) unlike CRIU, Kishu relies on object reductions to store Co-variables, hence it can store distributed or off-CPU data (e.g., Ray’s dataset(Team, 2024b) or on-GPU tensors(TensorFlow, 2024b; PyTorch, 2024)) and (2) unlike DumpSession, Kishu utilizes fallback recomputation, allowing it to restore Co-variables containing (1) unserializable objects (e.g., pl.LazyFrame (Polars, 2024b)) or (2) serializable objects that can’t deserialize (e.g., bokeh.figure (Bokeh, 2024)). We present a summary of these noteworthy object classes in LABEL:tbl:experiment_robust_example_table.

7.2.1. Accurate Delta Detection

This section verifies accuracy of Kishu’s delta detection. For each class, we test whether two VarGraphs generated for a class object differ before and after (1) updating a class attribute (e.g., model.key = ’A’) or (2) updating nothing. We count the number of VarGraph differences for case (1) as successes and case (2) as false positives.

We report the results in LABEL:tbl:experiment_fast_capture_accuracy. Kishu’s delta detection accurately captures object updates in 120 classes. While Kishu reports false positives in 14 classes, (e.g., due to dynamically generated reachable objects), these false positives only affect Kishu’s efficiency (i.e., during checkpointing/checkout); however, Kishu maintains accuracy by considering these objects to be updated on access. We also find that 12 classes contain silent pickling errors (§ 6.2); nevertheless, Kishu reports these objects to be updated on access similar to false positives, and users may force their (fallback) recomputation if needed (§ 6.2). Notably, Kishu does not have false negatives: Kishu will always report if an object is changed.

7.3. Small Incremental Checkpoint Sizes

This section compares Kishu’s checkpoint sizes with those of existing tools: we checkpoint the session state after each cell execution with each method and measure the total storage size of checkpoints.

We report the results in LABEL:fig:experiment_cheap_checkpoint. Kishu’s cumulative checkpoint size is consistently the smallest on all notebooks and is up to 4.55 $\times$ smaller than the next best alternative (HW-LM). ElasticNotebook, while the next best method on 6/8 notebooks and similarly has fault-tolerant mechanisms to successfully checkpoint all 8 notebooks, can fall short in checkpointing time (§ 4.3). CRIU-Incremental, despite also incrementally checkpointing, is not the next best method on any notebook, losing to ElasticNotebook and DumpSession on 6 and failing to checkpoint on 2 as it (1) incrementally checkpoints at the coarser memory page level (§ 2.3), and (2) does not handle off-CPU data and multiprocessing (§ 7.2). DumpSession fails on the Qiskit notebook due to its inability to handle unserializable data, and checkpointing with CRIU incurs prohibitive storage costs (94GB on TPS) as it non-incrementally checkpoints at the OS-level.

7.4. Low Incremental Checkpoint Time

This section compares the checkpoint time of Kishu with that of existing tools: we measure the total time spent by each method creating checkpoints after each cell execution.

We report the results in LABEL:fig:experiment_fast_checkpoint. Kishu’s cumulative checkpointing time is the lowest on 5/8 notebooks, being only up to 15.5% of notebook runtime (HW-LM) and up to 5.12 $\times$ faster (HW-LM) than the next best alternative on these notebooks. While CRIU-Incremental checkpoints faster compared to Kishu on 3/8 notebooks owing to writing memory pages being faster than serialization for unit amount of data, the difference is negligible (up to 3.03 $\times$ , StoreSales) compared to the reliability issues (§ 7.2), space inefficiency (§ 7.3), and it being consistently slower than Kishu by more than an order of magnitude for checkout (§ 7.5). Compared to ElasticNotebook, Kishu achieves fast checkpointing by being EAFP-based (Goodger, 2014): if it fails to store a Co-variable, it will simply recompute it upon checkout via fallback recomputation. This allows it to skip the per-execution profiling steps (i.e., for data sizes and serializability) present in ElasticNotebook’s LBYL-based (Goodger, 2014) optimization (for which data to store/recompute), which can cause checkpoint times slower than DumpSession on 2/8 notebooks.

7.5. Fast Incremental Checkout

This section compares the efficiency of Kishu’s incremental checkout with the (non-incremental) checkout of existing methods. We generate per-cell-execution checkpoints on the notebooks following the methodology in § 7.3 and § 7.4, then measure the time it takes for each method to checkout to a previous state (i.e. undo, § 7.5.1) or checkout to a different execution branch (§ 7.5.2).

7.5.1. Fast Execution Undo

For each notebook, we measure the time it takes to undo cells containing dataframe operations (notebook-D) and plot modifications (notebook-P).

We report the results in LABEL:fig:experiment_checkout_undo. Kishu achieves sub-second cell execution rollbacks on all test cases, and is up to 8.18 $\times$ faster than the next best alternative (StoreSales-D). While CRIU-Incremental achieves checkpoint times comparable with Kishu, it is up to 36 $\times$ slower for checking out (StoreSales-D) and the slowest method for undos on 5/6 notebooks, due to it needing to piece together the memory snapshot of the notebook process to restore from multiple (incremental) checkpoint files. Compared to CRIU, DumpSession, and ElasticNotebook, Kishu is the only method that consistently performs sub-second checkouts, showcasing the importance of performing incremental checkout (i.e., vs. non-incremental) by identifying state differences. For example, in the Sklearn notebook, the test case (cell 28) drops a column in an auxiliary dataframe that is 1.4MB in size (compared to the 133MB main dataframe). Kishu identifies that it only needs to load the auxiliary dataframe from before the cell execution and undoes the operation in 0.4 seconds; however, other methods all require the entire session state to be overwritten with a complete load of checkpoint data, taking an upwards of 6 seconds to do so (and in CRIU and CRIU-Incremental’s case, also killing and restarting the current notebook process).

7.5.2. Fast Path Exploration

For each notebook, we (1) run the notebook end-to-end, (2) checkout to the state before any models are trained, (3) rerun to the end of the notebook (thus creating a second branch), and measure the time taken to switch back to the first branch containing different models and plots.

We report the results in LABEL:fig:experiment_checkout_branch. Similar to § 7.5.1, Kishu performs sub-second branch switching on 4/6 notebooks by updating (only) the models and plots differing between the branches (i.e., not the input dataframes) and does so up to 9.02 $\times$ faster than the next best alternative (TorchGPU). In the StoreSales notebook, even when there is considerable divergence between the branches (i.e., new auxiliary dataframes are created along with ML models and plots), Kishu still performs branch switching at a relatively fast 1.73 seconds, which is 4.19 $\times$ faster than the next method (DumpSession).

7.6. Fast Delta Detection

This section investigates overhead of Kishu’s Co-variable granularity state tracking. We compare the cumulative time Kishu spends to track per-execution state delta with Kishu (Check all) and IPyFlow.

We report the results in LABEL:fig:experiment_fast_capture. Kishu is consistently the fastest method for detecting the state delta and is up to 11.42 $\times$ faster than the next best method out of IPyFlow and Kishu (Check all). Notably, the detection overhead is a maximum of 3.0% of the notebook runtime (Sklearn), which is (only) an average of 0.23 seconds per cell. Kishu’s fast delta detection can be attributed to it operating at a coarser-grained granularity uniquely suitable for Kishu’s use case compared to IPyFlow’s symbol-level detection (§ 2.3), which can incur up to 2.30 $\times$ overhead in terms of notebook runtime (HW-LM). Finally, as shown by the comparison between Kishu and Kishu (Check all), Kishu’s ability to limit the number of objects it has to check for changes after each cell execution via identifying candidate Co-variables for updates is necessary: it reduces the overhead from 136s to 10s on the Sklearn notebook (University, 2021), which we study in § 7.6.1.

7.6.1. Usefulness of Candidates

This section verifies usefulness of Kishu’s identification of candidate Co-variable updates by studying the number of variables (i.e., names, e.g. x) and percentage of session state (w.r.t. memory size) that Kishu checks for updates after each cell execution on the Sklearn notebook.

We report the results in LABEL:fig:experiment_fast_capture_candidates: we observe that while the user continually defines new variables throughout the session, each cell execution is highly atomic and refers to only a few variables (0-8) at a time (LABEL:fig:experiment_fast_capture_candidates_vars). Kishu successfully captures the references, which it utilizes to then check only a small portion of objects ( $\sim$ 15%, LABEL:fig:experiment_fast_capture_candidates_objs). This effectively bounds Kishu’s update detection overhead to be largely independent of the memory size of the session state.

8. Related Work

This section covers related work in Time-Traveling in Databases and conventional programs, other applicable methods for checkpointing/restoring notebook states, and notebook lineage tracing.

Time-Traveling Databases

Many existing DBMS support Point-in-time-Recovery (PITR) by using a combination of (incremental) checkpointing and logging (Oracle, 2024a; Group, 1996b; Soroush and Balazinska, 2013; Schule et al., 2019; Xu et al., 2017; Antonopoulos et al., 2019). Incremental checkpoints are commonly created at the row level where deltas would store differing records between versions of tables (Antonopoulos et al., 2019; Oracle, 2024a), while individual operations would be recorded in logs using techniques such as ARIES (Mohan et al., 1992; Mohan and Rothermel, 1988) or WBL (Arulraj et al., 2016). Then, to return to a previous state (e.g., in the case of a database failure), the DBMS will load the appropriate checkpoint and replay the logs up to the desired state (e.g., defined by state snapshots as in MVCC (Bernstein and Goodman, 1983)). More recently, PITR features have been added to general object stores such as SQLite (LITEREPLICA, 2024), MLFlow (Zaharia et al., 2018), and DeltaLake (Armbrust et al., 2020), which can hold arbitrary objects (i.e., blobs) in addition to conventional tables. PITR for DBMS or object stores is largely orthogonal and inapplicable to time-traveling in notebooks, as unlike (full) PITR after DBMS failure, notebook state restoration needs to be performed into an existing kernel at interactive speeds. Hence, time-traveling (i.e., restoring) via operation replay is less desirable for notebooks: while notebook cell executions are typically fewer in count (Google and X, 2022) compared to row transactions, each operation can take much longer (e.g., training an ML model vs. writing a row). Kishu addresses these differences in problem setting by incrementally checkpointing the state at every step to ensure interactive state restoration times while utilizing novel delta detection techniques to maintain low checkpoint storage overhead, only utilizing (logged) cell replay for fallback recomputations to restore missing or problematic data.

Data Serialization for Checkpoint/Restore

Notebook processes on Jupyter-based platforms can be checkpointed by serializing the data in the session state with various libraries (Foundation, 2023f, g, e, d; MongoDB, 2023; MessagePack, 2024; DuVall, 2015; cloudpipe, 2024). There exists a variety of checkpointing tools built on these libraries: On-disk KV-stores can save individual variables (Team, 2023c; shove, 2024; 0xnurl, 2024; Python, 2024; Foundation, 2024b; chest, 2024), DumpSession (Foundation, 2023h) writes dumps containing the serialized bytestring of the session state, ElasticNotebook (Li et al., [n.d.]) combines data storage/loading with cell replay for optimized session migration, and Tensorflow (Google, 2023b) and Pytorch (et al., 2018) offer periodical checkpointing during ML model training. These works do not checkpoint nor restore incrementally, or come with limitations/require significant user effort: Dill and ElasticNotebook always checkpoints the entire state, and their checkpoint files must be loaded in their entirety for restoration. Tensorflow (Google, 2023b) and Pytorch (et al., 2018)’s checkpointing are limited to objects in their respective libraries. While KV stores can store and load individual variables, they do not include delta detection and inter-variable reference preservation, which must be manually handled by the user. In comparison, Kishu addresses all limitations by performing low-overhead incremental checkpointing and restoration while featuring shared reference preservation (i.e., correctness), and works with almost all data science libraries (§ 7.2).

Memory Snapshotting for Checkpoint/Restore

There exists a variety of OS-level checkpointing tools that support incrementally checkpointing a process for later restoration (CRIU, 2023; Ansel et al., 2009; Chen et al., 1997; Plank et al., 1994; Garg et al., 2018; Jain and Cooperman, 2020; Li and Lan, 2010). Typically, these tools create incremental checkpoints by identifying and storing dirty memory pages, then piece back together the process image (across multiple files) for restoration (CRIU, 2023). The main limitations of these tools is (1) large incremental checkpoint file sizes resulting from coarse page-level deltas (Elnozahy et al., 2002), (2) inability to checkpoint across multiple processes (CRIU, 2024), and (3) they can only restore a process from scratch: while we found a patent (Neary et al., 2007) and paper (Ferreira et al., 2011) addressing these limitations enabling OS-level checkpointing for multiprocessing jobs and sub-memory-page granularity incremental checkpointing, respectively, we have been unable to locate working implementations. In comparison, for notebook states, Kishu is able to achieve significantly lower checkpoint overheads via finer Co-variable granularity deltas, checkpoint multiprocessing and off-CPU notebooks via application-level instructions, and fast incremental restore with minimal data loading via state difference detection and leveraging existing data in the process/kernel.

Reverse Debugging

Reverse debugging programs periodically checkpoint the session state, such that a previous intermediate program state can be returned to by loading the appropriate checkpoint and then replaying certain program executions (GDB, 2022; Engblom, 2012; Pothier and Tanter, 2009; Bellard, 2005; VMWare, 2009). Specifically for Python, IPyFlow (Shankar et al., 2022), IncPy (Guo and Engler, 2011), and PyTrace (PyCrunch, 2020) memoize cell and/or function execution results, which can be returned to later for reverse debugging. To the best of our knowledge, existing reverse debuggers do not checkpoint incrementally: users have to balance high overhead from frequent checkpointing and long restore times (via replay) from sparse checkpointing. In comparison, Kishu performs low-overhead incremental checkpointing while enabling fast incremental restore for notebook states.

Lineage Tracing in Computational Notebooks

Lineage tracing in notebook systems has been recently been explored for multi-version notebook replay (Manne et al., 2022), recommending notebook interactions (Macke et al., 2020), creating reproducible notebook containers (Ahmad et al., 2022), and session state migration (Li et al., [n.d.]; Cunha et al., 2021). Tracing methods can be divided into two categories: static code analysis (Foundation, 2023a) or live code instrumentation (GDB, 2022). Tracers in notebook systems need to balance the usage of the two: using static analysis too much will result in many false positives (e.g., due to control flows) (Li et al., [n.d.]), while live analysis can incur prohibitive overhead, especially when tracing at a fine-grained (e.g., object) level (Shankar et al., 2022; Macke et al., 2020). In Kishu, we propose the Co-variable granularity and its accompanying tracing technique using the VarGraph structure: by solely utilizing live analysis at a sufficient granularity (for our use case), we simultaneously minimize false positives while avoiding high overhead from fine-grained live analysis.

9. Conclusion

In this work, we have proposed Kishu, a new computational notebook system that offers efficient and fault-tolerant time-traveling between arbitrary notebook states. To achieve this, Kishu captures the evolution of session states at the novel Co-variable granularity, enabling time and space-efficient incremental checkpointing of state deltas, which Kishu uses in conjunction with accurate state delta identification to perform incremental checkout with minimal data loading. Its core contributions include (1) low-overhead state delta detection, (2) branch-based session state versioning, and (3) efficient time-traveling with high generalizability—preserving complex inter-variable dependencies and fault tolerance—handling missing or corrupted data through fallback recomputation. We have demonstrated that Kishu is compatible with 146 object classes popular from data science libraries, and can reduce incremental checkpoint storage overhead and checkout time by up to 4.55 $\times$ and 9.02 $\times$ , respectively, on real-world data science notebooks.

References

(1)
cri ([n.d.]) [n.d.]. CRIU CUDA Support. https://criu.org/What_cannot_be_checkpointed#Devices.
0xnurl (2024) 0xnurl. 2024. shelve — Python object persistence. https://github.com/0xnurl/redis-shelve.
Ahmad et al. (2022) Raza Ahmad, Naga Nithin Manne, and Tanu Malik. 2022. Reproducible Notebook Containers using Application Virtualization. In 2022 IEEE 18th International Conference on e-Science (e-Science). IEEE, 1–10.
Amershi et al. (2019) Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 291–300.
Ansel et al. (2009) Jason Ansel, Kapil Arya, and Gene Cooperman. 2009. DMTCP: Transparent checkpointing for cluster computations and the desktop. In 2009 IEEE International Symposium on Parallel & Distributed Processing. IEEE, 1–12.
Antonopoulos et al. (2019) Panagiotis Antonopoulos, Peter Byrne, Wayne Chen, Cristian Diaconu, Raghavendra Thallam Kodandaramaih, Hanuma Kodavalla, Prashanth Purnananda, Adrian-Leonard Radu, Chaitanya Sreenivas Ravella, and Girish Mittur Venkataramanappa. 2019. Constant time recovery in Azure SQL database. Proceedings of the VLDB Endowment 12, 12 (2019), 2143–2154.
Anyscale (2024) Anyscale. 2024. Ray - Effortlessly Scale Your Most Complex Workloads. https://www.ray.io/.
Armbrust et al. (2020) Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, et al. 2020. Delta lake: high-performance ACID table storage over cloud object stores. Proceedings of the VLDB Endowment 13, 12 (2020), 3411–3424.
Arrow (2024) Apache Arrow. 2024. PyArrow - Apache Arrow Python bindings. https://arrow.apache.org/docs/python/index.html.
Arulraj et al. (2016) Joy Arulraj, Matthew Perron, and Andrew Pavlo. 2016. Write-behind logging. Proceedings of the VLDB Endowment 10, 4 (2016), 337–348.
babreu ncsa (2023) babreu ncsa. 2023. NCSA Qiskit Demo May 2023. https://github.com/babreu-ncsa/qiskit/blob/main/demo/QiskitDemo_NCSA_May2023.ipynb.
Barr and Marron (2014) Earl T Barr and Mark Marron. 2014. Tardis: Affordable time-travel debugging in managed runtimes. ACM SIGPLAN Notices 49, 10 (2014), 67–82.
Bayar (2022) Ekrem Bayar. 2022. Store Sales TS Forecasting - A Comprehensive Guide. https://www.kaggle.com/code/ekrembayar/store-sales-ts-forecasting-a-comprehensive-guide/notebook.
Bellard (2005) Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX annual technical conference, FREENIX Track, Vol. 41. California, USA, 10–5555.
Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of machine learning research 13, 2 (2012).
Bernstein and Goodman (1983) Philip A Bernstein and Nathan Goodman. 1983. Multiversion concurrency control—theory and algorithms. ACM Transactions on Database Systems (TODS) 8, 4 (1983), 465–483.
Bokeh (2024) Bokeh. 2024. bokeh.figure. https://docs.bokeh.org/en/latest/docs/reference/plotting/figure.html.
Brachmann and Spoth (2020) Michael Brachmann and William Spoth. 2020. Your notebook is not crumby enough, REPLace it. In Conference on Innovative Data Systems Research (CIDR).
Brown et al. (2023) Colin Brown, Hamed Alhoori, and David Koop. 2023. Facilitating Dependency Exploration in Computational Notebooks. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics. 1–7.
Chattopadhyay et al. (2020) Souti Chattopadhyay, Ishita Prasad, Austin Z Henley, Anita Sarma, and Titus Barik. 2020. What’s wrong with computational notebooks? Pain points, needs, and design opportunities. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–12.
Chen et al. (1997) Yuqun Chen, James S Plank, and Kai Li. 1997. CLIP: A checkpointing tool for message-passing parallel programs. In Proceedings of the 1997 ACM/IEEE conference on Supercomputing. 1–11.
chest (2024) chest. 2024. chest - Simple on-disk dictionary. https://pypi.org/project/chest/.
cloudpipe (2024) cloudpipe. 2024. CloudPickle. https://github.com/cloudpipe/cloudpickle.
contributors (2024) Optuna contributors. 2024. Optuna: A hyperparameter optimization framework. https://optuna.readthedocs.io/en/stable/.
Contributors (2024) Torch Contributors. 2024. torchvision - Image Transformers. https://pytorch.org/vision/stable/index.html.
CRIU (2023) CRIU. 2023. Linux CRIU. https://criu.org/Main_Page.
CRIU (2024) CRIU. 2024. CRIU - What cannot be checkpointed. https://criu.org/What_cannot_be_checkpointed.
Crotty et al. (2015) Andrew Crotty, Alex Galakatos, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2015. Vizdom: interactive analytics through pen and touch. Proceedings of the VLDB Endowment 8, 12 (2015), 2024–2027.
Cunha et al. (2021) Renato LF Cunha, Lucas C Villa Real, Renan Souza, Bruno Silva, and Marco AS Netto. 2021. Context-aware Execution Migration Tool for Data Science Jupyter Notebooks on Hybrid Clouds. In 2021 IEEE 17th International Conference on eScience (eScience). IEEE, 30–39.
Devastator (2023) The Devastator. 2023. Bruteforce Clustering. https://www.kaggle.com/code/thedevastator/bruteforce-clustering.
Developers (2024) Photutils Developers. 2024. An Astropy Package for Photometry. https://photutils.readthedocs.io/en/stable/.
development team (2024) Matplotlib development team. 2024. Matplotlib - Figure. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html.
Dunne et al. (2012) Cody Dunne, Nathalie Henry Riche, Bongshin Lee, Ronald Metoyer, and George Robertson. 2012. GraphTrail: Analyzing large multivariate, heterogeneous networks while supporting exploration history. In Proceedings of the SIGCHI conference on human factors in computing systems. 1663–1672.
DuVall (2015) Clark DuVall. 2015. serpy: ridiculously fast object serialization. https://serpy.readthedocs.io/en/latest/.
Eichmann et al. (2020) Philipp Eichmann, Emanuel Zgraggen, Carsten Binnig, and Tim Kraska. 2020. Idebench: A benchmark for interactive data exploration. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1555–1569.
Elnozahy et al. (2002) Elmootazbellah Nabil Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR) 34, 3 (2002), 375–408.
Engblom (2012) Jakob Engblom. 2012. A review of reverse debugging. In Proceedings of the 2012 System, Software, SoC and Silicon Debug Conference. IEEE, 1–6.
et al. (2018) Lightning AI et al. 2018. PyTorch ModelCheckpoint. https://pytorch-lightning.readthedocs.io/en/stable/api/pytorch_lightning.callbacks.ModelCheckpoint.html.
Face (2024) Hugging Face. 2024. Hugging Face - The AI community building the future. https://huggingface.co/.
Ferreira et al. (2011) Kurt B Ferreira, Rolf Riesen, Ron Brighwell, Patrick Bridges, and Dorian Arnold. 2011. libhashckpt: hash-based incremental checkpointing using gpu’s. In European MPI Users’ Group Meeting. Springer, 272–281.
Foundation (2023a) Python Software Foundation. 2023a. Python - AST. https://docs.python.org/3/library/ast.html.
Foundation (2023b) Python Software Foundation. 2023b. Python - Generators. https://wiki.python.org/moin/Generators.
Foundation (2023c) Python Software Foundation. 2023c. Python Hashlib. https://docs.python.org/3/library/hashlib.html.
Foundation (2023d) Python Software Foundation. 2023d. Python JSON. https://docs.python.org/3/library/json.html.
Foundation (2023e) Python Software Foundation. 2023e. Python Marshal. https://docs.python.org/3/library/marshal.html.
Foundation (2023f) Python Software Foundation. 2023f. Python Pickle Documentation. https://docs.python.org/3/library/pickle.html.
Foundation (2024a) The Linux Foundation. 2024a. PyTorch. https://pytorch.org/.
Foundation (2023g) The Uncertainty Quantification Foundation. 2023g. Dill - PyPi. https://pypi.org/project/dill/.
Foundation (2023h) The Uncertainty Quantification Foundation. 2023h. Dill dump session. https://dill.readthedocs.io/en/latest/dill.html.
Foundation (2024b) Zope Foundation. 2024b. ZODB programming guide. https://zodb.org/en/latest/guide/index.html.
Garg et al. (2018) Rohan Garg, Apoorve Mohan, Michael Sullivan, and Gene Cooperman. 2018. CRUM: Checkpoint-restart support for CUDA’s unified memory. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 302–313.
GDB (2022) GDB. 2022. GDB Watchpoints. https://sourceware.org/gdb/download/onlinedocs/gdb/Set-Watchpoints.html.
GDB (2024) GDB. 2024. GDB - Running programs backward. https://sourceware.org/gdb/current/onlinedocs/gdb.html/Reverse-Execution.html.
Geron (2023) Aurélien Geron. 2023. Chapter 4 – Training Models. github.com/ageron/handson-ml3/blob/main/04_training_linear_models.ipynb.
Git (2024a) Git. 2024a. Git - Commit Graph. https://git-scm.com/docs/commit-graph.
Git (2024b) Git. 2024b. git –fast-version-control. https://git-scm.com/.
Goodger (2014) David Goodger. 2014. Code like a pythonista: Idiomatic python. Archived from the original on 27 (2014).
Google (2023a) Google. 2023a. Keras. https://keras.io/.
Google (2023b) Google. 2023b. Tensorflow Checkpoint. https://www.tensorflow.org/guide/checkpoint.
Google and X (2022) Google and X. 2022. Google AI4Code – Understand Code in Python Notebooks. https://www.kaggle.com/competitions/AI4Code.
Group (1996a) The PostgreSQL Global Development Group. 1996a. PostgreSQL - Continuous Archiving and Point-in-Time Recovery (PITR). https://www.postgresql.org/docs/current/continuous-archiving.html.
Group (1996b) The PostgreSQL Global Development Group. 1996b. PostgreSQL: The World’s Most Advanced Open Source Relational Database. https://www.postgresql.org/.
Groups (2024) Google Groups. 2024. Time Travel Analysis or Undo in Jupyter. https://groups.google.com/g/jupyter/c/hMPDL7Iw_BQ/m/MWYv1d5cAwAJ.
Guo and Engler (2011) Philip J Guo and Dawson Engler. 2011. Using automatic persistent memoization to facilitate data analysis scripting. In Proceedings of the 2011 International Symposium on Software Testing and Analysis. 287–297.
HuggingFace (2024a) HuggingFace. 2024a. HuggingFace - BERT Tokenizer. https://huggingface.co/docs/transformers/en/main_classes/tokenizer.
HuggingFace (2024b) HuggingFace. 2024b. HuggingFace - Pipelines. https://huggingface.co/docs/transformers/en/main_classes/pipelines.
IPython (2024a) IPython. 2024a. IPython Class. https://ipython.org/ipython-doc/3/api/generated/IPython.html.
IPython (2024b) IPython. 2024b. IPython Events. https://ipython.readthedocs.io/en/stable/config/callbacks.html.
Jain and Cooperman (2020) Twinkle Jain and Gene Cooperman. 2020. Crac: Checkpoint-restart architecture for cuda with streams and uvm. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
Johnson (2020) Jeremiah W Johnson. 2020. Benefits and pitfalls of jupyter notebooks in the classroom. In Proceedings of the 21st Annual Conference on Information Technology Education. 32–37.
Jupyter (2023) Project Jupyter. 2023. Jupyter Notebook. https://jupyter.org/.
Juric et al. (2021) Mario Juric, Steven Stetzler, and Colin T Slater. 2021. Checkpoint, Restore, and Live Migration for Science Platforms. arXiv preprint arXiv:2101.05782 (2021).
Koop and Patel (2017) David Koop and Jay Patel. 2017. Dataflow notebooks: encoding and tracking dependencies of cells. In 9th USENIX Workshop on the Theory and Practice of Provenance (TaPP 2017).
Kraska (2021) Tim Kraska. 2021. Northstar: An interactive data science system. (2021).
KUMBHAKAR (2024) ROUNAK KUMBHAKAR. 2024. pytorch resnet34 96.6 accuracy. https://www.kaggle.com/code/rounakkumbhakar/pytorch-resnet34-96-6-accuracy/notebook.
Li and Lan (2010) Yawei Li and Zhiling Lan. 2010. FREM: A fast restart mechanism for general checkpoint/restart. IEEE Trans. Comput. 60, 5 (2010), 639–652.
Li et al. ([n.d.]) Zhaoheng Li, Pranav Gor, Rahul Prabhu, Hui Yu, Yuzhou Mao, and Yongjoo Park. [n.d.]. ElasticNotebook: Enabling Live Migration for Computational Notebooks. ([n. d.]).
LITEREPLICA (2024) LITEREPLICA. 2024. SQLite - Point in time recovery. https://litereplica.io/sqlite-point-in-time-recovery.html.
Loria (2024) Steven Loria. 2024. TextBlob: Simplified Text Processing. https://textblob.readthedocs.io/en/dev/.
Macke et al. (2020) Stephen Macke, Hongpu Gong, Doris Jung-Lin Lee, Andrew Head, Doris Xin, and Aditya Parameswaran. 2020. Fine-grained lineage for safer notebook interactions. arXiv preprint arXiv:2012.06981 (2020).
Manne et al. (2022) Naga Nithin Manne, Shilvi Satpati, Tanu Malik, Amitabha Bagchi, Ashish Gehani, and Amitabh Chaudhary. 2022. CHEX: Multiversion Replay with Ordered Checkpoints. arXiv preprint arXiv:2202.08429 (2022).
MessagePack (2024) MessagePack. 2024. MessagePack. https://github.com/msgpack.
Mohan et al. (1992) Chandrasekaran Mohan, Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. ARIES: A transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging. ACM Transactions on Database Systems (TODS) 17, 1 (1992), 94–162.
Mohan and Rothermel (1988) C Mohan and K Rothermel. 1988. Recovery protocol for nested transactions using writeahead logging. IBM Tech. Dwclosure Bull. 31, 4 (Sept 1988) (1988).
MongoDB (2023) Inc. MongoDB. 2023. BSON. https://pymongo.readthedocs.io/en/stable/api/bson/index.html.
Moritz et al. (2018) Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerging $\{$ AI $\}$ applications. In 13th $\{$ USENIX $\}$ Symposium on Operating Systems Design and Implementation ( $\{$ OSDI $\}$ 18). 561–577.
Mueller (2024) Andreas Mueller. 2024. WordCloud for Python documentation. https://amueller.github.io/word_cloud/.
Neary et al. (2007) Michael Oliver Neary, Ashwani Wason, Shvetima Gulati, and Fabrice Ferval. 2007. Method and system for providing transparent incremental and multiprocess checkpointing to computer applications. US Patent 7,293,200.
NumFOCUS (2023) Inc. NumFOCUS. 2023. Pandas. https://pandas.pydata.org/docs/index.html.
NumFOCUS (2024) Inc. NumFOCUS. 2024. Pandas - DataFrame. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html.
Oracle (2024a) Oracle. 2024a. MySQL. https://www.mysql.com/.
Oracle (2024b) Oracle. 2024b. MySQL - Point-in-Time (Incremental) Recovery. https://dev.mysql.com/doc/refman/8.0/en/point-in-time-recovery.html.
Ormond (2018) Jim Ormond. 2018. ACM Recognizes Innovators Who Have Shaped the Digital Revolution.
Perkel (2018) Jeffrey M Perkel. 2018. Why Jupyter is data scientists’ computational notebook of choice. Nature 563, 7732 (2018), 145–147.
Phang et al. (2013) Khoo Yit Phang, Jeffrey S Foster, and Michael Hicks. 2013. Expositor: Scriptable time-travel debugging with first-class traces. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 352–361.
photutils (2024) photutils. 2024. photutils - ImageDepth. https://photutils.readthedocs.io/en/stable/api/photutils.utils.ImageDepth.html.
Plank et al. (1994) James S Plank, Micah Beck, Gerry Kingsley, and Kai Li. 1994. Libckpt: Transparent checkpointing under unix. Computer Science Department.
Plotly (2024) Plotly. 2024. Plotly - Low-Code Python Data Apps. https://plotly.com/.
Polars (2024a) Polars. 2024a. Polars - DataFrames for the new era. https://pola.rs/.
Polars (2024b) Polars. 2024b. Polars - LazyFrame. https://docs.pola.rs/py-polars/html/reference/lazyframe/index.html.
Posit Software (2023) PBC Posit Software, PBC formerly RStudio. 2023. Posit RStudio. https://posit.co/.
Pothier and Tanter (2009) Guillaume Pothier and Éric Tanter. 2009. Back to the future: Omniscient debugging. IEEE software 26, 6 (2009), 78–85.
Project (2024) NLTK Project. 2024. NLTK - Natural Language Toolkit. https://www.nltk.org/.
PyCrunch (2020) PyCrunch. 2020. PyTrace - Time Travel Debugging for Python. https://pytrace.com/.
Python (2024) Python. 2024. shelve — Python object persistence. https://docs.python.org/3/library/shelve.html.
PyTorch (2024) PyTorch. 2024. torch.tensor. https://pytorch.org/docs/stable/tensors.html.
Qiskit (2024) Qiskit. 2024. Qiskit - An open-source SDK for working with quantum computers at the level of extended quantum circuits, operators, and primitives. https://pypi.org/project/qiskit/.
ray project (2024) ray project. 2024. Overview of Ray. https://github.com/ray-project/ray-educational-materials/blob/main/Introductory_modules/Overview_of_Ray.ipynb.
Schule et al. (2019) Maximilian E Schule, Lukas Karnowski, Josef Schmeißer, Benedikt Kleiner, Alfons Kemper, and Thomas Neumann. 2019. Versioning in main-memory database systems: From musaeusdb to tardisdb. In Proceedings of the 31st International Conference on Scientific and Statistical Database Management. 169–180.
scikit learn (2024) scikit learn. 2024. scikit-learn - Machine Learning in Python. https://scikit-learn.org/stable/.
scikit-learn developers (2024) scikit-learn developers. 2024. sklearn - GaussianMixture. https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html.
scikit-learn intelex (2024) scikit-learn intelex. 2024. Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application. https://pypi.org/project/scikit-learn-intelex/.
SciPy (2024) SciPy. 2024. SciPy - Fundamental algorithms for scientific computing in Python. https://scipy.org/.
Shankar et al. (2022) Shreya Shankar, Stephen Macke, Sarah Chasins, Andrew Head, and Aditya Parameswaran. 2022. Bolt-on, compact, and rapid program slicing for notebooks. Proceedings of the VLDB Endowment 15, 13 (2022), 4038–4047.
shove (2024) shove. 2024. shove - https://pypi.org/project/shove/. https://pypi.org/project/shove/.
Soroush and Balazinska (2013) Emad Soroush and Magdalena Balazinska. 2013. Time travel in a scientific array database. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, 98–109.
Spark (2024a) Apache Spark. 2024a. PySpark Documentation. https://spark.apache.org/docs/3.3.1/api/python/index.html.
Spark (2024b) Apache Spark. 2024b. pyspark.sql module. https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html.
SQLite (2024) SQLite. 2024. SQLite. https://www.sqlite.org/.
StackOverflow (2024) StackOverflow. 2024. Undo Pandas Dataframe Column Drop - StackOverflow. https://stackoverflow.com/questions/54284994/how-to-get-columnsseries-back-from-dropped-table.
statsmodels developers (2024) statsmodels developers. 2024. statsmodels - statistical models, hypothesis tests, and data exploration. https://www.statsmodels.org/stable/index.html.
Team (2024a) NumPy Team. 2024a. NumPy - the fundamental package for scientific computing with Python. https://numpy.org/.
Team (2024b) Ray Team. 2024b. ray.data.Dataset. https://docs.ray.io/en/latest/data/api/doc/ray.data.Dataset.html.
Team (2023a) The IPython Development Team. 2023a. IPython Interactive Computing. https://ipython.org/.
Team (2023b) The IPython Development Team. 2023b. Jupyter checkpoint. https://jupyter-server.readthedocs.io/en/latest/developers/contents.html.
Team (2023c) The IPython Development Team. 2023c. Jupyter store magic. https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html.
Team (2023d) The Matplotlib Development Team. 2023d. Matplotlib. https://matplotlib.org/.
TensorFlow (2024a) TensorFlow. 2024a. TensorFlow - An end-to-end platform for machine learning. https://www.tensorflow.org/.
TensorFlow (2024b) TensorFlow. 2024b. tf.Tensor. https://www.tensorflow.org/api_docs/python/tf/Tensor.
transformers (2024) transformers. 2024. transformers - State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow. https://pypi.org/project/transformers/.
University (2021) Cornell University. 2021. SKLearn Tweet Classification. https://github.com/CornellCAC/CVW_PyDataSci2/blob/master/code/sklearn_tweet_classification.ipynb.
Vlad (2022) Devlikamov Vlad. 2022. [TPS-Mar] Fast workflow using scikit-learn-intelex. https://www.kaggle.com/code/lordozvlad/tps-mar-fast-workflow-using-scikit-learn-intelex/notebook.
VMWare (2009) VMWare. 2009. VMWare - Reverse debugging. http://www.replaydebugging.com/2008/08/vmware-workstation-65-reverse-and.html.
Wagenmakers and Farrell (2004) Eric-Jan Wagenmakers and Simon Farrell. 2004. AIC model selection using Akaike weights. Psychonomic bulletin & review 11, 1 (2004), 192–196.
Wang et al. (2022) April Yi Wang, Will Epperson, Robert A DeLine, and Steven M Drucker. 2022. Diff in the loop: Supporting data comparison in exploratory data analysis. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–10.
Waskom (2024) Michael Waskom. 2024. seaborn: statistical data visualization. https://seaborn.pydata.org/.
xgboost developers (2024) xgboost developers. 2024. XGBoost Documentation. https://xgboost.readthedocs.io/en/stable/.
Xu et al. (2017) Liqi Xu, Silu Huang, SiLi Hui, Aaron J Elmore, and Aditya Parameswaran. 2017. Orpheusdb: a lightweight approach to relational dataset versioning. In Proceedings of the 2017 ACM International Conference on Management of Data. 1655–1658.
Zaharia et al. (2018) Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, et al. 2018. Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. 41, 4 (2018), 39–45.
Zaharia et al. (2010) Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10).
Zgraggen et al. (2014) Emanuel Zgraggen, Robert Zeleznik, and Steven M Drucker. 2014. PanoramicData: Data analysis through pen & touch. IEEE transactions on visualization and computer graphics 20, 12 (2014), 2112–2121.