[javaone 2025] Jump-Start Your Data Science Learning with Jupyter Notebooks and Java

Speaker Brian Sam-Bodden @bsbodden

See the table of contents for more posts

General

At school everything was Python. Then saw had to rewrite to get to production because was a giant notebook.
Data science – multi disciplinary field that uses scientific methods, algorithms, processes and systems to extract knowledge and actionable insights from structured/unstructured data
Data science – extracts insights/knowledge from data.
ML – algorithms capable of learning patterns from data and making predictions
Deep learning – specialized methods
Foundations – statistics, big data, ML, distributed computing, NLP, Gen AI
Present day – Applied AI – RAG evolves, agents are back

Jupyter Notebooks

Can run tests against notebooks from CI/CD
Spun off iPython project in 2014
Browser based notebook interface
Supports code, text, math, plots and other media
Originally supported Julia, Python and R
Code and markdown cells
Can replace some documentation with notebooks. Can add button to run in codelabs
Can only use one language per notebook; can’t mix and match
Important to runs in cell order to avoid errors
Good for experimentation and discovery, executable docs, defacto communication medium in the data science/ML/AI community

Build data science stack with Java

Java is the dominant force in the enterprise
Rich data science ecosystem – DJL weka, Mahout, mallet, flink, H20.ai, semantic-kernal, spark, smile, MLLIb, jenetics and many more libraries
DL4J no longer maintained. DJL (Deep Java Library) is different
python – pandas – matrix apps, display data, data frame

Java stack

Jupyter Lab Docker Stack Image (Notebook is single Notebook. Lab is interface to show notebooks)
JJava Jupyter kernel – there are others; this is one of the most stables.
Curated set of Java libraries
Glue code to streamline API usage

JJava Jupyter Kernel

Well maintained
Fork of IJava Kernel
Uses JShell
Java 21
Can write code in a cell without any ceremony – ex: just a println Or a full class and then the code to run [like how the java playground at dev.java deals with classes]

Glue code

Glue code can be in jar. Don’t have to put all code in notebook.
Good to have the methods be static

Example: linear regression with Iris dataset

Python version – uses pandas/data frame, showed tables and plots, linear regression class, test vs training data
Java version – load dependencies via maven command, DS.read() to get dataframe, DS was a three line glue code method. Then showed tables/plots. Code too long; mostly abstracted
JFreeChart to show plot
DFLib for linear regression along with commons math
Created a linear regression class using commons math

More examples

Object detection
Visualize embeddings of Vectors
Code is very short
https://github.com/bsbodden/data-science-with-java

RAG with Spring AI and Redis

Redis – very fast vector database, also caching
Vectorize question to retrieve
Enhance question to augment
And then ask LLM for answer

My take

Good intro. Assumed didn’t know anything about notebooks to start. I like that he showed the Iris example in both Python and Java. This was all new to me and great to see. I’m confused about who it is for. I thought the sales pitch of Python was that easier to code for data scientists vs Java devs. Maybe creating a DSL for the data scientists? He said at the end about reusing skillsets.

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

Leave a Reply

Share this:

Leave a Reply