[2023 kcdc] DRYing out your GItLab Pipeline

Posted on June 23, 2023 by Jeanne Boyarsky

Speaker: Lynn Owens

For more, see the table of contents.

Intro/Problem

Every gitlab project has own .gitlab-ci.yml file. Great for getting started
Quickly have hundreds of projects
Goal is to eliminate copy/paste by centralizing in a few projects

What NAIC has

200+ projects maintained by 11 teams in 2 dev orgs
Pipeline is inner source
Version 6 of pipeline; working on version 7
Reduced maintenance burden by making change once and not in each project
Hosted directly on gitlab.com

Milestone 1 – Hidden jobs for pipeline project

GitLab has “hidden” jobs
Start with a period
Don’t appear in any pipeline; just for the common code
The “pipeline” project has a .gitlab-ci-base.yml which contains common code
Common code makes no assumptions about teams and is configurable for all known use cases
v1 was about two dozen lines of common code
The client projects include the pipeline code (can include in any part of gitlab so doesn’t need to be yours)

include:
   -project: 'NAIC/pipeline' 
   -file './gitlab-ci-base.yml'

Then added jobs that extended the hidden jobs to call functions in the base code. Where deploy_foo is in the base code

deploy_foo:
  stage: deploy
  extend: .deploy_s3
  variables:
   ...

Suggested practices

Advises against pinning the pipeline to a tag because don’t get bug fixes and everyone has to upgrade manually
Don’t include stages in the pipeline as it forces one opinion on everyone. Many groups had written a pipeline for their use case and not all same.

Milestone 2 – Profiles

Found a half dozen use cases. ex: Maven for Java, NPM building Angular etc.
The .gitlab-ci.yaml was a copy/paste of the others in the use case.
Made profiles/maven-java.yml and the like in the common profile
Profiles are not one size fits all because there are a bunch of different ones and can still use the milestone v1 approach.

Milestone 3 – Pipeline scripts

Common code like logging, calling rest apis, etc
Switched from bash scripting to python so had common code in modules and could unit test the modules

Options to get scripts

Could have the pipeline create a tar.zip and upload to a repo. This is a little slow
Could have a global before_script that does a git clone of peipleine-scripts. Uses a network connection
Could bake the scripts into an image. Requires a pipeline

If was doing again, wouldn’t create separate pipeline-scripts because tightly coupled to pipeline. Doesn’t change problem of using the scripts though.

Testing

If client projects are all using the default branch, small changes will affect them all.
Use a testing framework for script code (ex: python/go)
Follow development practices
Write a sample app for each profile. Have the common pipeline trigger a downstream pipeline on this project. For any merge to master, the downstream jobs must pass.
Before major refactors, inventory profile jobs and audit afterwards,

Milestone 4 – Profile Fragments

Had about 24 profiles (ex: maven-java-jar, maven-java-pom, maven-java-k8s, etc)
Typically three components – build tool, language, deployment method
These profiles had a lot of copy/paste
Decomposed into fragments – ex: maven, npm, java, angular, k8s, s3)

Selling the idea

Needed to convince people to use this pipeline instead of writing own or another team.
Offer flexibility
Show value
Follow semantic versioning to the T (he tags every merge to master of the pipeline even though encourages use of the default branch. the tags are good rollback points or if the project needs something older)
Changelog everything
Document well
Train and evangelize
Record training so have library

My take

This was a good case study and useful to see concrete examples and techniques. I wish we could see the code, but I understand that belongs to their org.

[2023 kcdc] cve 101: the unfolding of a zero day attack

Posted on June 23, 2023 by Jeanne Boyarsky

Speaker: Theresa Mammarella

Twitter: @t_mammarella

For more, see the table of contents.

Notes

Annual cost of cyber crime predicting to top 8 trillion. Only US and China have more than that as GDP

Terminology

Vulnerability – weakness/flaw in system
Threat – attack vector, potential action
Risk – probably frequency of that loss.
Goal of cybersecurity is to minimize risk. Can’t control intent to do harm so focus on vunlerability

CVEs

CVE – Common Vulnerabilities and Exposures
Format CVE-xxxx-yyyyy. xxxx = year came out. yyyy = identifier
CVSS scoring – how bad is it on a scale of 0-10. Ten is worst
CVSS score has three parts – basic (exploitability, impact), temporal, environmental. Good description here
Basic is the one we see on the CVE
CVE can be rejected. The number is used and cannot be reused. Example. Something thought found a vulnerability. Investigation was flawed and not an actual issue. Story about it here.

How to talk about

Private disclosure – organization can choose when/whether to fix/share
Coordinated/responsible disclosure – best practice – agreed upon time frame
Full/public disclosure – share everything
Best to report via company website, security.md file, security files on server, github private vulnerability reporting

Zero day vulnerability

Either unknown to company or not yet fixed
MOVEit file transfer issue for US government agencies
In early 2010’s had about a month to patch before a lot of exploits. Now it can be as little as 15 minutes for bots to start seeking out vulnerable apps.

Examples

log4jshell – remote code loading. Was reported responsibility but incomplete fix so zero days on those CVEs
Could be as simple as a bounds check. For OpenSSL. Announced something big coming and get ready. When announced learned it only affected OpenSSL 3 (not 2) and high, not critical so boy who cried wolf situation.

Security Practices for Developers

Insider threat includes poor training
A lot more developers than info security. Increasingly harder for security teams to keep up.
Cost of finding and fixing bugs increases over time
Does this touch the internet? take untrusted input/ handle sensitive data?
OWASP Top 10. Updated in 2021 to add insecure design, software/data integrity failures and server side request forgery (SSRF). Some merged such as injection.
Starting OWASP Top 10 for Large Language Model Applications. A draft version is available
mitre/hipcheck – scorecard for supply chain risk. Similarly, Sonatype security rating and OpenSSF Scorecard
Open source dependency management. Embedded in many projects. 90% of app is open source on average. North Korea attacked many apps including Putty

Attack types

Typosquatting – look alike domain with one or two wrong characters
Open source repo attackes – attempt to get maleware/weakness added into depednecy source
Build tool attacks
Dependency confusion – different version that shows up as latest

Trust?

Sometimes third party projects. ex: OpenSSF Scorecard
NPM and PyPI often have supply chain attacks. Maven Central more so
Scanning tools to find issues can be helpful
You are responsible when things go wrong

My take

Good talk. Covered concepts and good real life examples. I learned a few things like the OWASP Top 10 for LLMs. Appreciated the shout out to “the Java people in the front row” when talking about log4j. I added a few links in my blog that weren’t in the original presentation for things I wanted to learn more about.

[2023 kcdc] data leakage – why your ML model knows too much

Posted on June 23, 2023 by Jeanne Boyarsky

Speaker: Leah Berg

For more, see the table of contents.

Notes

Night job – https://www.datasciencerebalanced.com

Data Leakage

Also known as leakage or target leakage
Different meaning for information security (data leaking to outside organization)
Can be difficult to spot
Training data includes info about test.
Model trained on info not available in production

How models learn

Split data into training data and test data.
Test data – data model has never seen before and makes sure model gets is right
Can also have an optional validation set
Randomly pick whether data points are training or test data. – Called random train/test split
More training data than test data

Don’t include data from the future

Using a random split of time series data doesn’t work because model has learned about future data.
Better to use a sliding window. Use first few months to predict next month. Then add that next value and predict one after. And keep going. Adding up error gives you accuracy of model.
This works because model only knows about data before one asked to predict.
Create timeline for when events happen. That way you make sure you aren’t using data from before the prediction
Don’t always know where/when data was created. Important to understand business process

Don’t randomly split groups

Have some data from the group you are then predicting
Problem when new student shows up so prediction will be bad
scikit-learn has GroupShuffleSplit() to get full group in same set – testing or training

Don’t forget your data is a snapshot

In school, have pristine data set.
In real world, data is always changing.
Could tell model about data that occurred after prediction. Again think about data on timeline

Don’t randomly split data when retraining

Want to use same training/test data on production and challenger models to see which better.
One has already seen data points during training that you are testing so you don’t know if it is better.
Challenger model can get more data that wasn’t available originally. Ok to split new data into test/train as long as original data part is split same way.

Split data immediately

Risky to rescale before split because data isn’t represented same way. Min/max can vary if split after
Run normalization on different sets of data
Before split, do analysis with business, exploratory data analysis. Split data before start modeling

Use Cross Validation

KFold Validation – split training data into K parts
ex: 3 fold validation – two parts stay as training and one is validation. The test data remains as test data and is kept separate for final evaluation.
The validation set is for an initial test.
Gives more options to train model

Be Skeptical of High Performance

If validation much higher than train/test, suspicious.
If train/test/validation sets are all high/the same, suspicious.

Use scikit-learn pipeline

Helps avoid leaking test data into training data

Check for features correlated with target

If another attribute has a high match with what looking for, make sure not mixing up correlation/causation.
Also, avoid timeline errors for reverse causation. Ex: the thing you are looking for causes, something else

My take

Great talk. Almost all of this was new to me. It was understandable and I learned a lot.

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

[2023 kcdc] DRYing out your GItLab Pipeline

Intro/Problem

What NAIC has

Milestone 1 – Hidden jobs for pipeline project

Suggested practices

Milestone 2 – Profiles

Milestone 3 – Pipeline scripts

Options to get scripts

Testing

Milestone 4 – Profile Fragments

Selling the idea

My take

[2023 kcdc] cve 101: the unfolding of a zero day attack

Notes

Terminology

CVEs

How to talk about

Zero day vulnerability

Examples

Security Practices for Developers

Attack types

Trust?

My take

[2023 kcdc] data leakage – why your ML model knows too much

Notes

Data Leakage

How models learn

Don’t include data from the future

Don’t randomly split groups

Don’t forget your data is a snapshot

Don’t randomly split data when retraining

Split data immediately

Use Cross Validation

Be Skeptical of High Performance

Use scikit-learn pipeline

Check for features correlated with target

My take

Intro/Problem

What NAIC has

Milestone 1 – Hidden jobs for pipeline project

Suggested practices

Milestone 2 – Profiles

Milestone 3 – Pipeline scripts

Options to get scripts

Testing

Milestone 4 – Profile Fragments

Selling the idea

My take

Share this:

Notes

Terminology

CVEs

How to talk about

Zero day vulnerability

Examples

Security Practices for Developers

Attack types

Trust?

My take

Share this:

Notes

Data Leakage

How models learn

Don’t include data from the future

Don’t randomly split groups

Don’t forget your data is a snapshot

Don’t randomly split data when retraining

Split data immediately

Use Cross Validation

Be Skeptical of High Performance

Use scikit-learn pipeline

Check for features correlated with target

My take

Share this: