This page is an annotated reference of machine learning papers and related stuff. There are similar lists all over the internet and this list will not be complete, nor up-to-date. These are topics I am (or was) interested in or that I think that will be useful later.
Ofcourse, after starting this list I ran into the Methods section of Papers with Code. Highly recommended. Another favorite of mine is NLP-progress.
Overview
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?
- DOI: 10.1145/3442188.3445922
- year: 2021
Overview of development of LLMs and the risc of depending on them. Famous models with parameter counts, and the datasets they are trained on. Also Natural Language Understanding (NLU) is not General Inteligence
Gathering data
Common crawl
Large collection of texts from the internet to train language models.
The Colossal Clean Crawled Corpus (c4) is the common crawl dataset, but then cleaned up and ‘bad words’ removed.
Pre-processing data
Network design
Attention is all you need
- URL
- year: 2017
The Transformer network with self-attention.
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
- DOI: 10.48550/arXiv.2006.16236
- URL
- year: 2020
Linearized version of a Transformer (which is quadratic in the memory length).
ReZero is All You Need: Fast Convergence at Large Depth
- DOI: 10.48550/arXiv.2003.04887
- year: 2021
Activation function good for training deep networks – the trick can also be used for ReLU, GeLU etc.
Training
Initialization
Understanding the difficulty of training deep feedforward neural networks
Glorot and Xavier uniform / random initialization of weights and biases
Activation
Gaussian Error Linear Units (GELUs)
- DOI: 10.48550/arXiv.1606.08415
- year: 2023 (first 2016)
Activation function that mimicks dropout; assumes layers are normalized.
Loss functions
Focal Loss for Dense Object Detection
- DOI: 10.48550/arXiv.1708.02002
- year: 2018
Lossfunction that focusses on uncertain items, and on mistakes where the certainty is high. Similar to cross-entropy.
Supervised Contrastive Learning
- DOI: 10.48550/arXiv.2004.11362
- year: 2021 (2020)
Backpropagation
Adam: A Method for Stochastic Optimization
- DOI: 10.48550/arXiv.1412.6980
- year: 2015
Adaptive moment estimation; an extension to Stochastic Gradient Descent. Simple, efficient, and mostly good enough.
Evaluation
Rogue scores
- URL
- year: 2023
Reproducability of ML papers using ROGUE scores is very low. Blaimed on bad software implementations, and bad statistical practices.
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis
A score between -1 (bad, overlapping) and 1 (perfect, compact and clearly separated) for evaluating (unsupervised) clustering. Looks at within-cluster and between cluster distances.
Comparing Clusterings – an information based distance
VI, variation of information, is a score to compare clusterings. Contrary to many scores, this one really is a metric.
On the Surprising Behavior of Distance Metrics in High Dimensional Space
In high-dimensional spaces, in this paper that’s 20 or higher, traditional distances do not mean much anymore. Manhattan distance (L1) or even L0.1 can perform much better.
Hyperparameter tuning
Inference
Frameworks
I like to work with
PyTorch
The current go-to framework in python for machine learning.
Tensorboard
Undecided
I prefer not to work with
Tensorflow
Alternative to pytorch. Had some issues in the past with networks no longer working after some time.