Deep Learning for Action Recognition: From Basics to Efficiency Advancements

10 min readJan 23, 2024

Action recognition is an important task in the field of computer vision that entails classifying human actions depicted in video frames. Think of it as the video counterpart of image classification. Action recognition is to videos what classification is to images. Instead of identifying objects in static 2D images, action recognition involves discerning actions within dynamic video clips, where each frame is essentially a 2D image connected to other 2D images in a sequence.

Action Recognition is more challenging than 2D classification due to the following reasons:

Densely Packed Actions: Videos often present scenarios where numerous actions unfold concurrently or in quick succession
Long-Range Processing: Actions may extend over extended intervals, requiring long-range processing to capture the nuances and transitions effectively
Irrelevant Frames: Not every frame contributes to the action recognition process, and there may be many irrelevant frames which need to be ignored
Expensive and Time Consuming Training: Video models are harder and more compute intensive than image models
Generalization Challenges: Harder to generalize due to the amount of variations possible in the video space

Videos are generally 32 or 64fps so it is common to lower the frame rate(subsample in the temporal dimension) prior to processing them.

In this blog, we’ll explore some of the early prominent approaches to action recognition and then cover some efficient methods that will help you get a strong overview of this field.

Single Stream Network

Paper: Large-scale Video Classification with Convolutional Neural Networks: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42455.pdf

This was an early seminal paper showing different ways information from different frames can be merged. The paper makes use of 3D convolution to merge multiple frames

Single-frame: Only the one middle frame is used and processed by a 2D convolutional network to determine the accuracy for disregarding temporal information. A naive baseline.
Early Fusion: Here we pick T middle frames and process through a 2D convolutional network with the first filter being of size 11×11×3×T pixels where T is the temporal resolution of the filter. Only the middle T frames are processed as shown in the above diagram.

Late Fusion: Here two separate single-frame networks with shared parameters are used at a distance of 15 frames. Then the features are merged for a final classification. This method fuses information only at the end so is known as Late Fusion
Slow Fusion: The Slow Fusion model slowly fuses temporal information throughout the network such that higher layers get access to progressively more global information in both spatial and temporal dimensions. This is implemented by carrying out temporal convolutions(3D Convolutions) which iteratively grow in temporal receptive field size as the network progresses.

As one would expect, the results showed that Slow Fusion performed the best of all the above methods however Single Frame was a close second.

Two Stream Networks

Paper: Two-Stream Convolutional Networks for Action Recognition in Videos: https://arxiv.org/pdf/1406.2199.pdf

One of the reasons networks such as the above, which are called Single Stream Networks, failed to live to their promise is because single frame image classification is a strong baseline. That is, it is usually possible to classify the whole video based on just the center frame run through a 2D CNN.

Inspired by the two-streams hypothesis of the human visual system which states that human visual cortex contains two pathways: the ventral stream (which performs object recognition) and the dorsal stream (which recognises motion), this work attempts to aggregate spatial and temporal information via processing of a spatial and temporal components separately.

Two Stream Network with Spatial and Temporal streams

Video can naturally be decomposed into spatial and temporal components. Here the spatial stream performs action classification from still video frames, whilst the temporal stream is trained to recognise action from motion in the form of dense optical flow. Optimal flow better isolates motion than RGB making it easy for the network to infer movements.

Decoupling the spatial and temporal nets also allows us to exploit the availability of large amounts of annotated image data by pre-training the spatial net on the ImageNet challenge dataset.

The model aims to learn about structure from the Spatial Stream and movement from the Temporal Stream. These features are then fused using a linear layer at the end. This method significantly outperforms Slow Fusion mentioned above.

C3D: Learning Spatiotemporal Features with 3D Convolutional Networks

Paper: Learning Spatiotemporal Features with 3D Convolutional Networks: https://arxiv.org/pdf/1412.0767.pdf

This is another very important but simple paper which aims at replacing the deterministic optical flow method used previously by a 3D CNN. Optical flow methods are not perfect and 3D CNNs can pick more granular features provided more data and compute.

The architecture is very short and simple. 3D convolutions are significantly more expensive than their 2D counterparts and they simply stack 3D convolutions in a single stream as 3D convolutions can pick up structure and motion concurrently. Quite a bit of data augmentation is done for robustness and generalisation.

The authors perform experiments with a smaller 5 layer 3D CNN to determine the optimal value for the temporal receptive field size which they determine to be 3. Then they train a larger 8 layer 3D CNN for the results below.

They do post analysis and find their 3D Convolutions learn Temporal Gabor Filters

I3D: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

Paper: Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset: https://arxiv.org/abs/1705.07750

This paper introduced the famous Kinetics Dataset for Action Recognition and summarises the research before it in the below diagram and comes up with their Two Stream versions of 3D CNNs(just when we thought Two streams were replaced by 3D CNNs) by using optical flow in one pathway and using 3D CNNs for feature fusion of the two pathways.

This work used pretrained 2D convolutional models and converted them to 3D models by replicating learned filters in the temporal dimension. They also find that including optical flow as an additional input is helpful for the model most optical flow methods are iterative and have some information difficult for a 3D CNN to fully learn. The also provide very quick access to information from the start helping the model learn early.

As can be seen in their results, ImageNet pretrained Two Stream I3D using Optical Flow obtains the best results(74.2).

Advances in Efficient Video Recognition

The above video models are quite heavy and require large training time. In this section we will cover some important papers focused on efficient processing of videos.

We will talk about three new approaches that largely aim at reducing the heavy computational cost of video models.

SlowFast Networks for Video Recognition

Paper: SlowFast Networks for Video Recognition: https://arxiv.org/pdf/1812.03982.pdf

SlowFast Networks take inspiration from P-cells and M-cells in the brain which are responsible for visual processing. The M-cells operate at high temporal frequency and are responsive to fast temporal changes, but not sensitive to spatial detail or color. P-cells provide fine spatial detail and color, but lower temporal resolution, responding slowly to stimuli.

So this paper tries to replicate this function and produces a two stream network where each stream mimics the M-cell and P-cell. One stream operates at high temporal frequency and one at a finer frequency.

The top or slow pathway subsamples frames at a low frame rate so images are much more spread out in time. It uses only a fraction of the total input(1/8th frames). This way it is focused to capture contextual information or structure.

The lower or fast channel pathway processes all the frames but is made very light weight. This patway is focused on motion determination. Despite its high temporal rate, this pathway is made very lightweight, only ∼20% of total computation.

The slow pathway has lower frame rates and higher parameters as to promote it to learn structure due to it’s high capacity. The fast pathway has lower parameters to promote it to learn motion via simplier gabor filters.

You can see above that slow pathway has non-temporal convolutions till res4 so it is mostly focused on semantics whereas the fast pathway has a higher temporal stride from res2 but lower channels making it more biased towards motion gabor filters.

They also extend to other tasks such as video object detection.

This model beats out previous methods as a much lower FLOP rate due to its curated and optimal filter sizes.

X3D: Expanding Architectures for Efficient Video Recognition

Paper: X3D: Expanding Architectures for Efficient Video Recognition: https://arxiv.org/abs/2004.04730

Perhaps one of my most favorite papers all time, X3D aims to determine exactly how many parameters we need to do efficient video recognition. This work is done by a single author Christoph Feichtenhofer.

Previously model sizes like layer count and filter count were based on heuristics and it was unclear how many parameters are required to reach certain levels of accuracy.

There are various dimensions that effect computation like: input spatial resolution, temporal resolution, number of layers, number of filters, bottleneck dimension etc. The factors can be seen as the Expansion operations on the right below:

X3D takes an iterative approach to finding the best model:

1) Train a tiny base model to convergence

2) For each of the 6 dimensions of computation, increase them as to double the computation and create 6 different models

3) From the set of 6 new models pick the one with the highest accuracy, discard the rest and permanently increase the base model in that dimension and repeat 1) in a loop

In this way we approximately know which dimension of compute to increase and progressively make the model larger with the best accuracy tradeoff for each level of available compute.

This sounds like it would take alot of training time but because we start with a very small base model training is completed after only training 30 tiny models that accumulatively require over 25 times fewer multiply-add operations for training than one of the previous large state-of-the-art network

As seen above, we can see how accuracy increases. Each point is a new model that we get as we double model capacity using the steps outlined above. After 10 GFLOPS gains are slow and we see how accuracy scales with compute clearly.

Comparision with other video models for different compute levels

As we see in the above comparison X3D beats SlowFast in accuracy by less than half the TFLOPS! Refer to the paper for more details. This is a really interesting way to optimise for compute and accuracy tradeoff.

A Multigrid Method for Efficiently Training Video Models

Paper: A Multigrid Method for Efficiently Training Video Models: https://arxiv.org/pdf/1912.00998.pdf

Results first!

How can we train video models faster without tweaking architecture? Is there a more efficient way to train our models?

Video models are typically trained using a fixed mini-batch shape, which includes a specific number of video clips, frames, and spatial dimensions. This fixed shape is chosen based on heuristics to balance accuracy and training speed. The choice of mini-batch shape involves trade-offs. Higher spatial resolutions can improve accuracy but slow down training, while lower resolutions speed up training but reduce accuracy.

Training at low spatial resolutions can increase the speed of training drastically as we can use large batch sizes and higher learning rates however the accuracy is capped. Can we get the benefits of quick training and low resolution and added accuracy of training at higher resolution?

For this the authors propose a simple variable mini-batch with different spatial-temporal resolutions. These shapes are determined by resampling training data on multiple grids so that the model being trained gets the benefits from training on different spatial resolutions.

The authors have multiple strategies of alternating between low and high spatial resolution. When using low resolution they are able to train on high batch size which is what the y-axis below represents:

The long cycle follows a coarse-to-fine strategy where the model sees different resolutions of input images so it trains really quickly in the start and then gets refined with a higher resolution to achieve maximum accuracy. The short cycle focus on mixture of multiple shapes while the long+short fuses both.

As can be seen below training with this method can reach close to the maximum accuracy around 3 times faster than regular training.

Conclusion:

In this blog we covered early prominent approaches such as single-stream and two-stream networks, C3D, and I3D. Furthermore, we explored strides in efficient video recognition with a spotlight on pioneering methods like SlowFast Networks, X3D’s parameter optimization strategy and Multigrid Training Method for faster convergence. The last two works are particularly impactful in their approach and provide methods which can be used in other domains of learning as well.

Hope you found this useful!