Dealing with Long Running Jupyter Notebooks
We’ve gotten a number of customers struggling with long running Jupyter notebooks–ones that take several hours or more to execute. Often, they would come to us because these long running notebooks would, at some point, lose connectivity between the server and the browser, as it is common with cloud services. Normally, cloud services gracefully reconnect and there are no issues. In the case of Jupyter, if the connection is lost, then Jupyter stops saving any output. Jupyter notebooks store all the state in the browser, meaning if there is a connectivity issue between the server running the code and the browser viewing it, then the state of the notebook is lost.
Struggling with long-running notebooks? Saturn Cloud offers powerful solutions for efficient and effective notebook execution. Request a free demo to learn more. to supercharge your data science projects.
If our customer’s long running code has an error in it and the connection ever cuts out, then the user has no ability to see what the output of the code was and the error messages that it created. Trying to debug these models without output is an exercise in futility. This isn’t an issue when using Jupyter locally because a computer’s connection to itself is infinitely stable, but it’s an issue when working in the cloud.
Background
Jupyter notebooks store all their state in the browser and thus require constant network connectivity. This is a well known design issue, with many implications. While having network issues won’t cause the code in a notebook to stop executing, it will affect how the output gets saved to your notebook. The flow of a Jupyter notebook is:
- the server pushes output to your browser.
- your browser adds it to the notebook object (and renders it to the screen).
- your browser saves the notebook back to the server.
In the case when the network cuts out then this flow breaks, and no output is saved. The long term solution is for Jupyter itself to be modified to handle intermittent connections, which is a pretty active area of discussion. There is no current timeline for this to be added to the open source Jupyter.
However there is a short term strategy.
Solution
We can adjust Jupyter with just a pinch of code so that it saved the output directly to a file on the server. By doing so, even if the network connectivity cuts out the server will still have the output stored to it. It’s not perfect–in an ideal world this output would still show up in the notebook itself, but it’s an improvement to have them stored somewhere instead of lost. Put this code at the top of your long-running notebook:
import sys
import logging
so = open("data.log", 'w', 10)
sys.stdout.echo = so
sys.stderr.echo = so
get_ipython().log.handlers[0].stream = so
get_ipython().log.setLevel(logging.INFO)
Execute that at the top of your notebook. TADA! Now, when you’re running the notebook all output will be mirror in the data.log
flat file.
How it works: In the Jupyter notebook, the normal stdout
and stderr
File objects are replaced with ipykernel.iostream.OutStream
objects (that’s how they get displayed in the browser). This object has an echo object, which defaults to None
which can propagate output. So the first set of lines sticks a Python file object in place of the echo, and all your normal stdout
and stderr
is now also being copied to disk. Exceptions are handled by the Python logging system. In the default configuration, that’s not outputting to stdout
or stderr
, so the second set of lines patches it to do so, and sets the log level.
Struggling with long-running notebooks? Saturn Cloud offers powerful solutions for efficient and effective notebook execution. Request a free demo to learn more. to supercharge your data science projects.
Conclusion
With this workaround, the worst pain of having long running Jupyter notebooks is gone. That said, at Saturn Cloud, we generally recommend making use of better hardware (GPUs) or parallelization (dask) to avoid having to wait 10 hours for your notebook to run. However, if your problem isn’t parallelizable - this is a reasonable workaround. However if you don’t know how to parallelize it but wish you did, you should talk to us! We’re really good at it!
Additional Resources:
- An Intro to Data Science Platforms
- What are Data Science Platforms
- Most Data Science Platforms are a Bad Idea
- Top 10 Data Science Platforms And Their Customer Reviews 2022
- Saturn Cloud: An Alternative to SageMaker
- PDF Saturn Cloud vs Amazon Sagemaker
- Configuring Sagemaker
- Top Computational Biology Platforms
- Top 10 ML Platforms
- What is dask and how does it work?
- Setting up JupyterHub
- Setting up JupyterHub Securely on AWS
- Setting up HTTPS and SSL for JupyterHub
- Using JupyterHub with a Private Container Registry
- Setting up JupyterHub with Single Sign-on (SSO) on AWS
- List: How to Setup Jupyter Notebooks on EC2
- List: How to Set Up JupyterHub on AWS
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.