[javaone 2025] Jump-Start Your Data Science Learning with Jupyter Notebooks and Java

Speaker Brian Sam-Bodden @bsbodden

See the table of contents for more posts


General

  • At school everything was Python. Then saw had to rewrite to get to production because was a giant notebook.
  • Data science – multi disciplinary field that uses scientific methods, algorithms, processes and systems to extract knowledge and actionable insights from structured/unstructured data
  • Data science – extracts insights/knowledge from data.
  • ML – algorithms capable of learning patterns from data and making predictions
  • Deep learning – specialized methods
  • Foundations – statistics, big data, ML, distributed computing, NLP, Gen AI
  • Present day – Applied AI – RAG evolves, agents are back

Jupyter Notebooks

  • Can run tests against notebooks from CI/CD
  • Spun off iPython project in 2014
  • Browser based notebook interface
  • Supports code, text, math, plots and other media
  • Originally supported Julia, Python and R
  • Code and markdown cells
  • Can replace some documentation with notebooks. Can add button to run in codelabs
  • Can only use one language per notebook; can’t mix and match
  • Important to runs in cell order to avoid errors
  • Good for experimentation and discovery, executable docs, defacto communication medium in the data science/ML/AI community

Build data science stack with Java

  • Java is the dominant force in the enterprise
  • Rich data science ecosystem – DJL weka, Mahout, mallet, flink, H20.ai, semantic-kernal, spark, smile, MLLIb, jenetics and many more libraries
  • DL4J no longer maintained. DJL (Deep Java Library) is different
  • python – pandas – matrix apps, display data, data frame

Java stack

  • Jupyter Lab Docker Stack Image (Notebook is single Notebook. Lab is interface to show notebooks)
  • JJava Jupyter kernel – there are others; this is one of the most stables.
  • Curated set of Java libraries
  • Glue code to streamline API usage

JJava Jupyter Kernel

  • Well maintained
  • Fork of IJava Kernel
  • Uses JShell
  • Java 21
  • Can write code in a cell without any ceremony – ex: just a println Or a full class and then the code to run [like how the java playground at dev.java deals with classes]

Glue code

  • Glue code can be in jar. Don’t have to put all code in notebook.
  • Good to have the methods be static

Example: linear regression with Iris dataset

  • Python version – uses pandas/data frame, showed tables and plots, linear regression class, test vs training data
  • Java version – load dependencies via maven command, DS.read() to get dataframe, DS was a three line glue code method. Then showed tables/plots. Code too long; mostly abstracted
  • JFreeChart to show plot
  • DFLib for linear regression along with commons math
  • Created a linear regression class using commons math

More examples

  • Object detection
  • Visualize embeddings of Vectors
  • Code is very short
  • https://github.com/bsbodden/data-science-with-java

RAG with Spring AI and Redis

  • Redis – very fast vector database, also caching
  • Vectorize question to retrieve
  • Enhance question to augment
  • And then ask LLM for answer

My take

Good intro. Assumed didn’t know anything about notebooks to start. I like that he showed the Iris example in both Python and Java. This was all new to me and great to see. I’m confused about who it is for. I thought the sales pitch of Python was that easier to code for data scientists vs Java devs. Maybe creating a DSL for the data scientists? He said at the end about reusing skillsets.

[javaone 2025] how netflix uses java 2025 edition

Speaker: Paul Bakker

See the table of contents for more posts


He started out by showing the social media reaction to his few minutes in yesterday’s keynote. Which included “how much do you pay Oracle” to which he said 0 (they use Azul but also Open JDK exists). And my favorite “Java is heavyweight; you should use Kotlin”. Which is entertaining because it is literally the same runtime

For streaming

  • Hight RPS (requests per second)
  • Multi region – 4 regions. Expensive/slow (milliseconds) to communicate across region, but needs to be near customers.
  • Large fanout to backend services
  • Retry on failure, aggressive timeous
  • Non relational data store
  • GraphSQL query to API gateway, federated so can get to multiple data sources. DGS (domain graph service)
  • Spring boot
  • Kafka
  • gRPC
  • evCache
  • Stream processing – ex: Spark
  • Also have Go and Python, but mostly Java

Enterprise/studio apps (ex; managing movie production)

  • Low RPS
  • Single region
  • Relational data store
  • Failure not acceptable
  • UI and backend
  • Similar – GraphQL, Federated Gateway, DGS, spring boot
  • Database could be postgres

General

  • Were on Java 8 until recently
  • Relied on old libraries and old in house framework which were incompatible with modern Java
  • Java 11 wasn’t enough incentive to upgrade
  • Went to 17 as a big migration
  • Migrated all services to Spring Boot – 3000 apps
  • Patched unmaintained libraries for JDK availability – “might look hard; its not”

Garbage Collection

  • G1 is better on Java 17 than Java 8
  • About 20% less CPU on garbage collection
  • Switched to Generational ZGC in Java 21. More predictable. Pause times are effectively gone
  • Important to have generational garbage collector so doesn’t have to go thru whole heap each collection
  • Error rates also dropped due to not having GC related timeouts

Virtual Threads

  • Added virtual thread support to internal frameworks
  • Virtual threads and structured concurrency will replace reactive
  • Java 23 – mixing synchronized and reentrant locks lead to deadlocks due to thread pinning. Some virtual threads are pinned when waiting on lock but no more platform threads available resulting in deadlock
  • Had to back off virtual threads some because of that. Fixed in Java 24

Spring Boot

  • Added Netflix modules to open source spring boot
  • Looks like regular spriing boot to developers
  • Upgrade Netflix Spring boot to OSS minor releases in days
  • Added: security, gRPT, IPC clients, etc
  • Use WebMVC
  • Not using Webflux since not using reactive
  • Spring 3 – went to Java 17, jakarta packages. Need to upgrade libraries at same time. Used bytecode rewrite in Gradle to change package names during migration

GraphQL vs gRPC

  • GraphQL – flexible schema to query data, think in data rather than methods
  • gRPC- highly performant for server to server communication. Think in methods rather than data
  • REST – easier than GraphQL but doesn’t recommend for UI. Often returns more data than UI needs

Deployment

  • Either AWS or Titus (in house k8s)
  • Exploded JAR with embedded Tomcat
  • Not using native images – not yet working well enough. Hard to get right. Development experience is worse – build time longer, don’t hant to build a native image for development
  • Experimenting with AOT and Leyden

My take

Great case study!

[javaone 2025] stream gatherers: the architect’s cut

Speaker: Viktor Klang

See the table of contents for more posts


Oracle has sample code so I didn’t take notes on all the code

General

  • Reviewed source, intermediate operations, terminal operations vocabulary
  • Imagine if could have any intermediate Stream operations; can grow API
  • Features need (collectors don’t meet all needs) – consume/produce ratios, finite/infinite, stateful/stateless, frugal/greedy, sequential/parallelizable, whether to react to end of stream
  • Stream gatherers preview in Java 22/23. Released in Java 24

New interface

  • Gatherer<T, A, R> – R is what goes to next step
  • Supplier<A> initializer()
  • Integrator<A, T, R> integrator() – single abstract method boolean integrate(A state, T element, Downstream<R> downstream) – single abstract method – boolean push(R element)
  • BinaryOperator<A> combiner()
  • BIConsumer<A, Downstream<R>> finisher()

Basic Examples

  • Showed code to implemented map()
  • Gatherer.of() to create
  • Call as .gather(map(i -> I +1)
  • Other examples: mapMulti(), limit()

Named Gatherers

  • Progression – start as inline code and then refactor to be own class for reuse.

Parallel vs Sequential

  • For sequential, start with evaluate() and call in a loop while source.hasNext() and integrator.integrate() returns true
  • For parallel, recursively split the upstream until the chunks are small. (Split/fork into distinct parts)
  • For takeWhile(), need to deal with short circuiting/infinite streams. Can cancel() or propogate short circuit symbol

Other built in Gatherers

  • scan() – kind of like an incremental add/accumulator
  • windowFixed() – get immutable list of certain sized optionally keeping last
  • mapConcurrent() – specify maximum concurrency level

Other notes

  • Can compose
  • Stream pipeline – spliterator + Gatherer? + Collector

My take

This is the first time I’ve seen a presentation on this topic. It was great hearing the explanation and seeing a bunch of example. The font for the code was a little smaller than I’d like but I was able to make it out. Only a bit blurry. Most made sense. A few parts I’m going to need to absorb. He did say “it’s a bit tricky” so I don’t feel bad it wasn’t immediately obvious! The diagrams for parallel were helpful