Speaker: Barry Burd
See the table of contents
Opener
- Imagine have to find lowest point on a line but can only see a few steps ahead or behind you. One dimensional problem.
- On a mountain, finding lowest point but in two dimensions.
- LLM does that but with many more dimensions – ex: billions
- Uses a lot of tricks, not just extending the one dimensional problem
Problem
- Training GPT-3 required 10K NVIDIA GPUs
- PyTorch is highly optimized 0 biuldt in libraries, deep integration with GPU hardware (NVIDIA CUDA)
- Apple has GPU stack
- Want to do with Java
Solution
- HAT (Heterogeneous Acceleration Toolkit)
- Work in progress
- Part of project Babylon
- Code models/reflection
- Barry’s goal: algorithms to run on these
Deeplearning4j (ND4j)
- CUDA support
- No MDS support
- Arrays stored off-heap (outside JVM)
- Several arrays can point to several subarrays of same data.
What LLM does
- After analyzing a possible incomplete string, the LLM decides to add the string’s next token.
- Too many words to predict word
- Characters too granular because missing meaning.
Tokens
- I’ve groked Heinelin’s work as tokens:
*I
've
gro
k
ked
Hein
lein
's
work
s
- Token id is a number that goes with the tokens
- Token is sequence of characters that occur together frequently enough using byte-pair encoding.
- Supposing string is “a b r a c a d a b r a” the tokens are: a, b, c, d,r . Then observer token pair ab appears frequently so ab is also a token. Now have “ab r a c a d ab r a” with “ab” added to vocabulary. Then repeat and see ab and r appear next to each other so “abr a c a d abr a” with “abr” added to token list. Then “abra”
- Python library tiktoken
- Java library JTokkkit
Brain
- 86 billion neurons in human brain
- Dendrite – input from another cell
- Soma – cell body
- Axon – output to other cells
- Oversimplified: Imagine cell body multiplies each input by a certain weight (different per cell) and adds them. That’s like multiplying a vector and a matrix
Math terms
- Vector – array/list of numbers. Can represent a point in n-dimensional space. Usually visualized as an arrow from origin to that point. ex: 1526 dimensions is 1526 number in vector
- Matrix – rectangular array of vectors. Turns one vector into another
- Tensor – stack of matrixes. Array of array of matrixes. Not important here.
- Dot product of two vectors – multiply elements in same spot in each vector and add them up.
- Matrix multiplication – had nice animation
ND4j
- N dimensions for J.
- Knows how to do vector/matrix math
Embedding
- Each token gets assigned to an arbitrary vector at first. This is the token embedding
- Picture adjusting bunny years antenna and how it only works when you are touching it. Walking away breaks the reception.
- Each number in the initial arbitrary vector is like a dial that needs to be tuned
Gradient Descent
- Normally millions of minimum points and easy to get stuck in point that is a local minimum. Looks like at lowest point in all directions, but there is a lower one elsewhere. LLM training is meant to avoid that pitfall
- Eclipse DeepLearning4J – can configure neural network and make a model
Vector meaning
- Similar vectors are similar semantics
- Applying related vectors to others should have consistent semantic meaning
- RGB for colors are vectors with three points representing the colors.
- Dot product of Cyan (0/255/255) and Red (255/0/0) because nothing in common
- Add positional embedding to token embedding so know where in sentence. (Add to each element with different scale so know which part goes with which). These combined are in the input embedding
Attention
- Attention is all you need https://arxiv.org/abs/1706.03762
- Attention examples: grammatical structure, meaning, word order
- Long range dependency – like a pronoun that refers to something many words away
- Attention helps focus on important parts, ex: “The cat sat on the mat”. What’s on the me (the mat) at that point.. Has to know.
- Apply key matrix to cat and a query matrix to mat. Key matrix offers info. Query matrix is what you want to know.
- Start with random values. Then multiply by tokens. Then take dot product and get a number to see what predicts about next word. See how far prediction off from what it actually is. Apply to surrounding neighbors and go in direction that makes error less. Repeat a very large number of times
- A lot of this can be done in parallel
Feed forward
- Linear – straight – ex: 2x + 3y (2 and 3 are the knobs to tune)
- Non-linear – wavy – more dimensions
- Languages isn’t linear by nature
- “great, terrific meal” – can add up
- “not good” – not flips meaning of sentence. So can’t just predict the next word.
Universal approximation theorem
- Imagine a wavy line as a series of bumps from the ground to that line
- The more granular the bump, the more accurate the result.
- Each bump can be represented by a linear formula
- Apply GeLU (non linear) function to make the bumps
Experiment
- Translated Karpathy’s 253 line LLM into Java with a list of baby names
- Took about a minute to train.
- Generated new names. A couple legit but most random looking
My take
This was great. Tokens and critical concepts get used without defining them and we take for granted. So being able to think about it was informative and helpful. I like that Barry showed math but said ok to not to understand. [it was understandable]