Speaker Brian Sam-Bodden @bsbodden
See the table of contents for more posts
General
- At school everything was Python. Then saw had to rewrite to get to production because was a giant notebook.
- Data science – multi disciplinary field that uses scientific methods, algorithms, processes and systems to extract knowledge and actionable insights from structured/unstructured data
- Data science – extracts insights/knowledge from data.
- ML – algorithms capable of learning patterns from data and making predictions
- Deep learning – specialized methods
- Foundations – statistics, big data, ML, distributed computing, NLP, Gen AI
- Present day – Applied AI – RAG evolves, agents are back
Jupyter Notebooks
- Can run tests against notebooks from CI/CD
- Spun off iPython project in 2014
- Browser based notebook interface
- Supports code, text, math, plots and other media
- Originally supported Julia, Python and R
- Code and markdown cells
- Can replace some documentation with notebooks. Can add button to run in codelabs
- Can only use one language per notebook; can’t mix and match
- Important to runs in cell order to avoid errors
- Good for experimentation and discovery, executable docs, defacto communication medium in the data science/ML/AI community
Build data science stack with Java
- Java is the dominant force in the enterprise
- Rich data science ecosystem – DJL weka, Mahout, mallet, flink, H20.ai, semantic-kernal, spark, smile, MLLIb, jenetics and many more libraries
- DL4J no longer maintained. DJL (Deep Java Library) is different
- python – pandas – matrix apps, display data, data frame
Java stack
- Jupyter Lab Docker Stack Image (Notebook is single Notebook. Lab is interface to show notebooks)
- JJava Jupyter kernel – there are others; this is one of the most stables.
- Curated set of Java libraries
- Glue code to streamline API usage
JJava Jupyter Kernel
- Well maintained
- Fork of IJava Kernel
- Uses JShell
- Java 21
- Can write code in a cell without any ceremony – ex: just a println Or a full class and then the code to run [like how the java playground at dev.java deals with classes]
Glue code
- Glue code can be in jar. Don’t have to put all code in notebook.
- Good to have the methods be static
Example: linear regression with Iris dataset
- Python version – uses pandas/data frame, showed tables and plots, linear regression class, test vs training data
- Java version – load dependencies via maven command, DS.read() to get dataframe, DS was a three line glue code method. Then showed tables/plots. Code too long; mostly abstracted
- JFreeChart to show plot
- DFLib for linear regression along with commons math
- Created a linear regression class using commons math
More examples
- Object detection
- Visualize embeddings of Vectors
- Code is very short
- https://github.com/bsbodden/data-science-with-java
RAG with Spring AI and Redis
- Redis – very fast vector database, also caching
- Vectorize question to retrieve
- Enhance question to augment
- And then ask LLM for answer
My take
Good intro. Assumed didn’t know anything about notebooks to start. I like that he showed the Iris example in both Python and Java. This was all new to me and great to see. I’m confused about who it is for. I thought the sales pitch of Python was that easier to code for data scientists vs Java devs. Maybe creating a DSL for the data scientists? He said at the end about reusing skillsets.