[javaone 2025] Jump-Start Your Data Science Learning with Jupyter Notebooks and Java

Speaker Brian Sam-Bodden @bsbodden

See the table of contents for more posts


General

  • At school everything was Python. Then saw had to rewrite to get to production because was a giant notebook.
  • Data science – multi disciplinary field that uses scientific methods, algorithms, processes and systems to extract knowledge and actionable insights from structured/unstructured data
  • Data science – extracts insights/knowledge from data.
  • ML – algorithms capable of learning patterns from data and making predictions
  • Deep learning – specialized methods
  • Foundations – statistics, big data, ML, distributed computing, NLP, Gen AI
  • Present day – Applied AI – RAG evolves, agents are back

Jupyter Notebooks

  • Can run tests against notebooks from CI/CD
  • Spun off iPython project in 2014
  • Browser based notebook interface
  • Supports code, text, math, plots and other media
  • Originally supported Julia, Python and R
  • Code and markdown cells
  • Can replace some documentation with notebooks. Can add button to run in codelabs
  • Can only use one language per notebook; can’t mix and match
  • Important to runs in cell order to avoid errors
  • Good for experimentation and discovery, executable docs, defacto communication medium in the data science/ML/AI community

Build data science stack with Java

  • Java is the dominant force in the enterprise
  • Rich data science ecosystem – DJL weka, Mahout, mallet, flink, H20.ai, semantic-kernal, spark, smile, MLLIb, jenetics and many more libraries
  • DL4J no longer maintained. DJL (Deep Java Library) is different
  • python – pandas – matrix apps, display data, data frame

Java stack

  • Jupyter Lab Docker Stack Image (Notebook is single Notebook. Lab is interface to show notebooks)
  • JJava Jupyter kernel – there are others; this is one of the most stables.
  • Curated set of Java libraries
  • Glue code to streamline API usage

JJava Jupyter Kernel

  • Well maintained
  • Fork of IJava Kernel
  • Uses JShell
  • Java 21
  • Can write code in a cell without any ceremony – ex: just a println Or a full class and then the code to run [like how the java playground at dev.java deals with classes]

Glue code

  • Glue code can be in jar. Don’t have to put all code in notebook.
  • Good to have the methods be static

Example: linear regression with Iris dataset

  • Python version – uses pandas/data frame, showed tables and plots, linear regression class, test vs training data
  • Java version – load dependencies via maven command, DS.read() to get dataframe, DS was a three line glue code method. Then showed tables/plots. Code too long; mostly abstracted
  • JFreeChart to show plot
  • DFLib for linear regression along with commons math
  • Created a linear regression class using commons math

More examples

  • Object detection
  • Visualize embeddings of Vectors
  • Code is very short
  • https://github.com/bsbodden/data-science-with-java

RAG with Spring AI and Redis

  • Redis – very fast vector database, also caching
  • Vectorize question to retrieve
  • Enhance question to augment
  • And then ask LLM for answer

My take

Good intro. Assumed didn’t know anything about notebooks to start. I like that he showed the Iris example in both Python and Java. This was all new to me and great to see. I’m confused about who it is for. I thought the sales pitch of Python was that easier to code for data scientists vs Java devs. Maybe creating a DSL for the data scientists? He said at the end about reusing skillsets.

Leave a Reply

Your email address will not be published. Required fields are marked *