[2023 kcdc] data leakage – why your ML model knows too much

Speaker: Leah Berg

For more, see the table of contents.


Notes

Data Leakage

  • Also known as leakage or target leakage
  • Different meaning for information security (data leaking to outside organization)
  • Can be difficult to spot
  • Training data includes info about test.
  • Model trained on info not available in production

How models learn

  • Split data into training data and test data.
  • Test data – data model has never seen before and makes sure model gets is right
  • Can also have an optional validation set
  • Randomly pick whether data points are training or test data. – Called random train/test split
  • More training data than test data

Don’t include data from the future

  • Using a random split of time series data doesn’t work because model has learned about future data.
  • Better to use a sliding window. Use first few months to predict next month. Then add that next value and predict one after. And keep going. Adding up error gives you accuracy of model.
  • This works because model only knows about data before one asked to predict.
  • Create timeline for when events happen. That way you make sure you aren’t using data from before the prediction
  • Don’t always know where/when data was created. Important to understand business process

Don’t randomly split groups

  • Have some data from the group you are then predicting
  • Problem when new student shows up so prediction will be bad
  • scikit-learn has GroupShuffleSplit() to get full group in same set – testing or training

Don’t forget your data is a snapshot

  • In school, have pristine data set.
  • In real world, data is always changing.
  • Could tell model about data that occurred after prediction. Again think about data on timeline

Don’t randomly split data when retraining

  • Want to use same training/test data on production and challenger models to see which better.
  • One has already seen data points during training that you are testing so you don’t know if it is better.
  • Challenger model can get more data that wasn’t available originally. Ok to split new data into test/train as long as original data part is split same way.

Split data immediately

  • Risky to rescale before split because data isn’t represented same way. Min/max can vary if split after
  • Run normalization on different sets of data
  • Before split, do analysis with business, exploratory data analysis. Split data before start modeling

Use Cross Validation

  • KFold Validation – split training data into K parts
  • ex: 3 fold validation – two parts stay as training and one is validation. The test data remains as test data and is kept separate for final evaluation.
  • The validation set is for an initial test.
  • Gives more options to train model

Be Skeptical of High Performance

  • If validation much higher than train/test, suspicious.
  • If train/test/validation sets are all high/the same, suspicious.

Use scikit-learn pipeline

  • Helps avoid leaking test data into training data

Check for features correlated with target

  • If another attribute has a high match with what looking for, make sure not mixing up correlation/causation.
  • Also, avoid timeline errors for reverse causation. Ex: the thing you are looking for causes, something else

My take

Great talk. Almost all of this was new to me. It was understandable and I learned a lot.

[2023 kcdc] 10 things about postman everyone should know

Speaker: Pooja Mistry

Twitter: @poojamakes

Public workspace- https://www.postman.com/devrel/workspace/2023-10-postman-features-everyone-should-know/overview

For more, see the table of contents.


Notes

  • Moving towards an API first world
  • Postman started in 2012 with a Chrome extension. Evolved into full API platform
  • More than just sending requests – ex: collections, documentation, servers
  • Web and app versions
  • Newman – CLI for postman
  • Collections, env vars, queries, etc have own id
  • Different life cycle for two personnas: producer of APIs (define, design, developer, test, secure, deploy, observe, distribute) an consumer of APIs (discover, evaluate, integrate, test, deploy, observe)
  • Test tab to test the API. Example – pm.test(“assert text”, function () {}
  • Protocols – graphql, websocket, grpc, socket io, etc
  • Scripts – can run before and after graphql
  • Pre-request script – ex: debugging
  • Can pass in $randomXXX of various types in your postman call

Postman API

  • Sign in and fork workspace if want to play with the public workspace for this talk
  • Postman has own API. ex: CRUD for collections, envs etc
  • Some clients use collection as the deliverable and then get metrics on it.

Postman echo

  • Sends back whatever you send in.
  • When pass in get params sends back json with args map being your params.
  • Post sends the text back as the data key in json.
  • Always echos headers as well

Postman visualizer

  • Can build UI in postman
  • Visualize tab on result. Put pm.visualizer.set(template, response: pm.response.json() in test tab.
  • Can use to make charts, maps, csv, etc
  • The template is HTML (which can contain JavaScript)
  • Postman provides a library of templates that you an copy/paste
  • Also see https://learning.postman.com/docs/sending-requests/visualizer/ and https://www.postman.com/postman/workspace/more-visualizer-examples/overview

Built in Libraries

  • Can automatically use faker,js, lodash, moment.js, chai.js and cryto-js
  • Ex: lodash.functionName()

Workflow Control

  • Scripting allows oops and conditionals
  • postman.setNextRequest() lets you change the order of requests in a collection
  • pm.sendRequest() allows sending multiple APIs at once
  • Collection and environment variables let you communicate between APIs

Mock Servers

  • Create a mock server in UI
  • This gives you a URL
  • Can deactivate mock server
  • Set data to return

Code Generation

  • Includes Java, curl, Node.JS, etc for requests
  • For providers, less choices but still a number

Test Automation

  • Bread and butter of postman
  • Can run manually
  • Can schedule API runs
  • Can report on results of API over time – ex: monitoring
  • Can use Newman and generate how to run CLI on other CICD: ex: Jenkins, CircleCI, GitHub Actions, Gitlab, etc
  • New: June 15 – can do performance testing using desktop client. Gives response time graph

Flows

  • Visual diagram showing order/connection/variables.
  • Can include dashboards in flow

Docs

  • Markdown syntax: https://daringfireball.net/projects/markdown/syntax
  • Can embed images
  • If documented well, can share with others
  • Explore tab shows all public APIs across Postman. Best ones are well documented.
  • Can include link to show what person/company created.
  • Can have creator workspace and aggregate your collections
  • Get help at – community.postman.com

Can try most for free. CLI not free

My take

I like that she used Postman (a public collection) and demos for most of the presentation. A lot of the features described were new to me. Excellent start to the morning.