This page is an annotated reference of machine learning papers and related stuff. There are similar lists all over the internet and this list will not be complete, nor up-to-date. These are topics I am (or was) interested in or that I think that will be useful later.

Ofcourse, after starting this list I ran into the Methods section of Papers with Code. Highly recommended. Another favorite of mine is NLP-progress.

Overview

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?

Overview of development of LLMs and the risc of depending on them. Famous models with parameter counts, and the datasets they are trained on. Also Natural Language Understanding (NLU) is not General Inteligence

Gathering data

Common crawl

Large collection of texts from the internet to train language models.

The Colossal Clean Crawled Corpus (c4) is the common crawl dataset, but then cleaned up and ‘bad words’ removed.

Pre-processing data

Network design

Attention is all you need

  • URL
  • year: 2017

The Transformer network with self-attention.

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Linearized version of a Transformer (which is quadratic in the memory length).

ReZero is All You Need: Fast Convergence at Large Depth

Activation function good for training deep networks – the trick can also be used for ReLU, GeLU etc.

Training

Initialization

Understanding the difficulty of training deep feedforward neural networks

Glorot and Xavier uniform / random initialization of weights and biases

Activation

Gaussian Error Linear Units (GELUs)

Activation function that mimicks dropout; assumes layers are normalized.

Loss functions

Focal Loss for Dense Object Detection

Lossfunction that focusses on uncertain items, and on mistakes where the certainty is high. Similar to cross-entropy.

Supervised Contrastive Learning

Backpropagation

Adam: A Method for Stochastic Optimization

Adaptive moment estimation; an extension to Stochastic Gradient Descent. Simple, efficient, and mostly good enough.

Evaluation

Rogue scores

  • URL
  • year: 2023

Reproducability of ML papers using ROGUE scores is very low. Blaimed on bad software implementations, and bad statistical practices.

Silhouettes: A graphical aid to the interpretation and validation of cluster analysis

A score between -1 (bad, overlapping) and 1 (perfect, compact and clearly separated) for evaluating (unsupervised) clustering. Looks at within-cluster and between cluster distances.

Comparing Clusterings – an information based distance

VI, variation of information, is a score to compare clusterings. Contrary to many scores, this one really is a metric.

On the Surprising Behavior of Distance Metrics in High Dimensional Space

In high-dimensional spaces, in this paper that’s 20 or higher, traditional distances do not mean much anymore. Manhattan distance (L1) or even L0.1 can perform much better.

Hyperparameter tuning

Inference

Frameworks

I like to work with

PyTorch

The current go-to framework in python for machine learning.

Tensorboard

Undecided

I prefer not to work with

Tensorflow

Alternative to pytorch. Had some issues in the past with networks no longer working after some time.