[2023 kcdc] data leakage – why your ML model knows too much

Speaker: Leah Berg

For more, see the table of contents.


Notes

Data Leakage

  • Also known as leakage or target leakage
  • Different meaning for information security (data leaking to outside organization)
  • Can be difficult to spot
  • Training data includes info about test.
  • Model trained on info not available in production

How models learn

  • Split data into training data and test data.
  • Test data – data model has never seen before and makes sure model gets is right
  • Can also have an optional validation set
  • Randomly pick whether data points are training or test data. – Called random train/test split
  • More training data than test data

Don’t include data from the future

  • Using a random split of time series data doesn’t work because model has learned about future data.
  • Better to use a sliding window. Use first few months to predict next month. Then add that next value and predict one after. And keep going. Adding up error gives you accuracy of model.
  • This works because model only knows about data before one asked to predict.
  • Create timeline for when events happen. That way you make sure you aren’t using data from before the prediction
  • Don’t always know where/when data was created. Important to understand business process

Don’t randomly split groups

  • Have some data from the group you are then predicting
  • Problem when new student shows up so prediction will be bad
  • scikit-learn has GroupShuffleSplit() to get full group in same set – testing or training

Don’t forget your data is a snapshot

  • In school, have pristine data set.
  • In real world, data is always changing.
  • Could tell model about data that occurred after prediction. Again think about data on timeline

Don’t randomly split data when retraining

  • Want to use same training/test data on production and challenger models to see which better.
  • One has already seen data points during training that you are testing so you don’t know if it is better.
  • Challenger model can get more data that wasn’t available originally. Ok to split new data into test/train as long as original data part is split same way.

Split data immediately

  • Risky to rescale before split because data isn’t represented same way. Min/max can vary if split after
  • Run normalization on different sets of data
  • Before split, do analysis with business, exploratory data analysis. Split data before start modeling

Use Cross Validation

  • KFold Validation – split training data into K parts
  • ex: 3 fold validation – two parts stay as training and one is validation. The test data remains as test data and is kept separate for final evaluation.
  • The validation set is for an initial test.
  • Gives more options to train model

Be Skeptical of High Performance

  • If validation much higher than train/test, suspicious.
  • If train/test/validation sets are all high/the same, suspicious.

Use scikit-learn pipeline

  • Helps avoid leaking test data into training data

Check for features correlated with target

  • If another attribute has a high match with what looking for, make sure not mixing up correlation/causation.
  • Also, avoid timeline errors for reverse causation. Ex: the thing you are looking for causes, something else

My take

Great talk. Almost all of this was new to me. It was understandable and I learned a lot.

[2019 oracle code one] Machine Learning

Machine Learning for Java Developers in 45 Minutes

Speakers: Zoran Sevarac & Frank Greco – @zsevarac & @frankgreco

For more blog posts, see The Oracle Code One table of contents


General

  • “AI is the new electricity” – Andrew Ng (societies with AI were above those without
  • For many tasks, algorithms are well known
  • Other algorithms harder – image recognition. Rule based. Constantly add rules. Large number of rules. Complex.
  • When complexity goes up, bells should go off. Avoid complexity.
  • When complexity index is too big, it isn’t scalable. Breading ground for bugs.
  • Not all use cases are not good for ML
  • Core of ML – recognizing patterns in data and making predictions against the data
  • Learn language by understanding all the rules (algorithm) or observing patterns (ML)

Terms

  • AI – type of algorithm where machine emulates aspects of human behavior
  • ML – subset of AI. Allows machine to learn from experience/data
  • Deep learning. Subset of ML. Uses powerful computing and advanced nueral networks

Deep learning

  • Accuracy grows with more data.
  • Older learning algorithms get outperformed after a certain amount of data.
  • Think of deep learning as a graph. Each node performs computation. Computation can be reconfigured by tweaking coefficients on edges
  • Layer – groups of nodes

Examples

  • Image recognition
  • Spam classification
  • Data classification
  • Identifying handwritten characters/image transformation

Data

  • Training data
  • Try to minimize differences as go thru
  • Once goes below a certain threshold, training stops
  • Determine whether false positives or false negatives are worse for your use case

JSR381 – Visual Recognition API

  • Standard API for computer vision tasks using machine learning
  • Provides generic ML API design to support other libraries
  • Next phase is to figure out who/what get wider support/adoption
  • Brings ML closer to general Java dev audience
  • App programmers need to know this. Don’t need to become a data scientist to use.

Why matters

  • Patterns
  • Can change data structures
  • The case for Learned Index Structures – https://arxiv.org/abs/1712.01208
  • New hardware for API
  • What happens to countries that host call centers and their economy?

Issues

  • Need clean data
  • Privacy and ethics
  • Correlation vs causality
  • Data hacking/poisoning
  • DeepFakes – can create people that don’t exist
  • Interpretability
  • AI/ML talent is scarce

My take

This was a great way to get started. There were a bunch of code samples as well using Java APIs.