Andrew Zisserman

Andrew Zisserman

University of Oxford

H-index: 194

Europe-United Kingdom

About Andrew Zisserman

Andrew Zisserman, With an exceptional h-index of 194 and a recent h-index of 120 (since 2020), a distinguished researcher at University of Oxford, specializes in the field of Computer Vision, Machine Learning.

His recent articles reflect a diverse array of research interests and contributions to the field:

Action classification in video clips using attention-based neural networks

FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Parallel video processing systems

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

AutoAD III: The Prequel--Back to the Pixels

N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

The Manga Whisperer: Automatically Generating Transcriptions for Comics

Moving Object Segmentation: All You Need Is SAM (and Flow)

Andrew Zisserman Information

University

University of Oxford

Position

___

Citations(all)

396658

Citations(since 2020)

228604

Cited By

253692

hIndex(all)

194

hIndex(since 2020)

120

i10Index(all)

632

i10Index(since 2020)

445

Email

University Profile Page

University of Oxford

Andrew Zisserman Skills & Research Interests

Computer Vision

Machine Learning

Top articles of Andrew Zisserman

Action classification in video clips using attention-based neural networks

Published Date

2024/1/25

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for classifying actions in a video. One of the methods obtaining a feature representation of a video clip; obtaining data specifying a plurality of candidate agent bounding boxes in the key video frame; and for each candidate agent bounding box: processing the feature representation through an action transformer neural network.

FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Authors

Debidatta Dwibedi,Vidhi Jain,Jonathan Tompson,Andrew Zisserman,Yusuf Aytar

Journal

arXiv preprint arXiv:2403.12026

Published Date

2024/3/18

We introduce a versatile vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a approach with FlexCap can be better at open-ended object detection than a approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: https://flex-cap.github.io .

Parallel video processing systems

Published Date

2023/6/15

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for parallel processing of video frames using neural networks. One of the methods includes receiving a video sequence comprising a respective video frame at each of a plurality of time steps; and processing the video sequence using a video processing neural network to generate a video processing output for the video sequence, wherein the video processing neural network includes a sequence of network components, wherein the network components comprise a plurality of layer blocks each comprising one or more neural network layers, wherein each component is active for a respective subset of the plurality of time steps, and wherein each layer block is configured to, at each time step at which the layer block is active, receive an input generated at a previous time step and to process the input to generate a …

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

Authors

Bruno Korbar,Jaesung Huh,Andrew Zisserman

Journal

arXiv preprint arXiv:2401.12039

Published Date

2024/1/22

The goal of this paper is automatic character-aware subtitle generation. Given a video and a minimal amount of metadata, we propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified. The key idea is to first use audio-visual cues to select a set of high-precision audio exemplars for each character, and then use these exemplars to classify all speech segments by speaker identity. Notably, the method does not require face detection or tracking. We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs. We envision this system being useful for the automatic generation of subtitles to improve the accessibility of the vast amount of videos available on modern streaming services. Project page : \url{https://www.robots.ox.ac.uk/~vgg/research/look-listen-recognise/}

AutoAD III: The Prequel--Back to the Pixels

Authors

Tengda Han,Max Bain,Arsha Nagrani,Gül Varol,Weidi Xie,Andrew Zisserman

Journal

Conference on Computer Vision and Pattern Recognition (CVPR 2024)

Published Date

2024/4/22

Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.

N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

Authors

Yash Bhalgat,Iro Laina,João F Henriques,Andrew Zisserman,Andrea Vedaldi

Journal

arXiv preprint arXiv:2403.10997

Published Date

2024/3/16

Understanding complex scenes at multiple levels of abstraction remains a formidable challenge in computer vision. To address this, we introduce Nested Neural Feature Fields (N2F2), a novel approach that employs hierarchical supervision to learn a single feature field, wherein different dimensions within the same high-dimensional feature encode scene properties at varying granularities. Our method allows for a flexible definition of hierarchies, tailored to either the physical dimensions or semantics or both, thereby enabling a comprehensive and nuanced understanding of scenes. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space, and query the CLIP vision-encoder to obtain language-aligned embeddings for each of these segments. Our proposed hierarchical supervision method then assigns different nested dimensions of the feature field to distill the CLIP embeddings using deferred volumetric rendering at varying physical scales, creating a coarse-to-fine representation. Extensive experiments show that our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization, demonstrating the effectiveness of the learned nested feature field.

The Manga Whisperer: Automatically Generating Transcriptions for Comics

Authors

Ragav Sachdeva,Andrew Zisserman

Journal

arXiv preprint arXiv:2401.10224

Published Date

2024/1/18

In the past few decades, Japanese comics, commonly referred to as Manga, have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet, the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work, we seek to address this substantial barrier, with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically, we tackle the problem of diarisation i.e. generating a transcription of who said what and when, in a fully automatic way. To this end, we make the following contributions: (1) we present a unified model, Magi, that is able to (a) detect panels, text boxes and character boxes, (b) cluster characters by identity (without knowing the number of clusters apriori), and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. The code, evaluation datasets and the pre-trained model can be found at: https://github.com/ragavsachdeva/magi.

Moving Object Segmentation: All You Need Is SAM (and Flow)

Authors

Junyu Xie,Charig Yang,Weidi Xie,Andrew Zisserman

Journal

arXiv preprint arXiv:2404.12389

Published Date

2024/4/18

The objective of this paper is motion segmentation -- discovering and segmenting the moving objects in a video. This is a much studied area with numerous careful,and sometimes complex, approaches and training schemes including: self-supervised learning, learning from synthetic datasets, object-centric representations, amodal representations, and many more. Our interest in this paper is to determine if the Segment Anything model (SAM) can contribute to this task. We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks. We also extend these frame-level segmentations to sequence-level segmentations that maintain object identity. Again, this simple model outperforms previous methods on multiple video object segmentation benchmarks.

No representation rules them all in category discovery

Authors

Sagar Vaze,Andrea Vedaldi,Andrew Zisserman

Journal

Advances in Neural Information Processing Systems

Published Date

2024/2/13

In this paper we tackle the problem of Generalized Category Discovery (GCD). Specifically, given a dataset with labelled and unlabelled images, the task is to cluster all images in the unlabelled subset, whether or not they belong to the labelled categories. Our first contribution is to recognise that most existing GCD benchmarks only contain labels for a single clustering of the data, making it difficult to ascertain whether models are leveraging the available labels to solve the GCD task, or simply solving an unsupervised clustering problem. As such, we present a synthetic dataset, named'Clevr-4', for category discovery. Clevr-4 contains four equally valid partitions of the data, ie based on object'shape','texture'or'color'or'count'. To solve the task, models are required to extrapolate the taxonomy specified by labelled set, rather than simply latch onto a single natural grouping of the data. We use this dataset to demonstrate the limitations of unsupervised clustering in the GCD setting, showing that even very strong unsupervised models fail on Clevr-4. We further use Clevr-4 to examine the weaknesses of existing GCD algorithms, and propose a new method which addresses these shortcomings, leveraging consistent findings from the representation learning literature to do so. Our simple solution, which is based onMean Teachers' and termed GCD, substantially outperforms implemented baselines on Clevr-4. Finally, when we transfer these findings to real data on the challenging Semantic Shift Benchmark suite, we find that GCD outperforms all prior work, setting a new state-of-the-art.

A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

Authors

Andreea-Maria Oncescu,João F Henriques,Andrew Zisserman,Samuel Albanie,A Sophia Koepke

Published Date

2024/4/14

Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio information from video-text datasets, we introduce a methodology for generating audio-centric descriptions using Large Language Models (LLMs). In this work, we consider the egocentric video setting and propose three new text-audio retrieval benchmarks based on the EpicMIR and EgoMCQ tasks, and on the EpicSounds dataset. Our approach for obtaining audio-centric descriptions gives significantly higher zero-shot performance than using the original visual-centric descriptions. Furthermore, we show that using the same …

BootsTAP: Bootstrapped Training for Tracking-Any-Point

Authors

Carl Doersch,Yi Yang,Dilara Gokay,Pauline Luc,Skanda Koppula,Ankush Gupta,Joseph Heyward,Ross Goroshin,João Carreira,Andrew Zisserman

Journal

arXiv preprint arXiv:2402.00847

Published Date

2024/2/1

To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to be able to track any point corresponding to a solid surface in a video, potentially densely in space and time. Large-scale ground-truth training data for TAP is only available in simulation, which currently has limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a self-supervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 66.4%, and TAP-Vid-Kinetics from 57.2% to 61.5%.

TIM: A Time Interval Machine for Audio-Visual Action Recognition

Authors

Jacob Chalk,Jaesung Huh,Evangelos Kazakos,Andrew Zisserman,Dima Damen

Journal

arXiv preprint arXiv:2404.05559

Published Date

2024/4/8

Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval, as well as the surrounding context in both modalities, in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS, Perception Test, and AVE, reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS, we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally, we show that TIM can be adapted for action detection, using dense multi-scale interval queries, outperforming SOTA on EPIC-KITCHENS-100 for most metrics, and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM

Synchformer: Efficient Synchronization from Sparse Cues

Authors

Vladimir Iashin,Weidi Xie,Esa Rahtu,Andrew Zisserman

Journal

arXiv preprint arXiv:2401.16423

Published Date

2024/1/29

Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.

Made to Order: Discovering monotonic temporal changes via self-supervised video ordering

Authors

Charig Yang,Weidi Xie,Andrew Zisserman

Journal

arXiv preprint arXiv:2404.16828

Published Date

2024/4/25

Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a flexible transformer-based model for general-purpose ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple video settings covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state of the art on standard benchmarks for ordering a set of images.

Text-Conditioned Resampler For Long Form Video Understanding

Authors

Bruno Korbar,Yongqin Xian,Alessio Tonioni,Andrew Zisserman,Federico Tombari

Journal

arXiv preprint arXiv:2312.11897

Published Date

2023/12/19

Videos are highly redundant data source and it is often enough to identify a few key moments to solve any given task. In this paper, we present a text-conditioned video resampler (TCR) module that uses a pre-trained and frozen visual encoder and large language model (LLM) to process long video sequences for a task. TCR localises relevant visual features from the video given a text condition and provides them to a LLM to generate a text response. Due to its lightweight design and use of cross-attention, TCR can process more than 100 frames at a time allowing the model to use much longer chunks of video than earlier works. We make the following contributions: (i) we design a transformer-based sampling architecture that can process long videos conditioned on a task, together with a training method that enables it to bridge pre-trained visual and language models; (ii) we empirically validate its efficacy on a wide variety of evaluation tasks, and set a new state-of-the-art on NextQA, EgoSchema, and the EGO4D-LTA challenge; and (iii) we determine tasks which require longer video contexts and that can thus be used effectively for further evaluation of long-range video models.

Tapir: Tracking any point with per-frame initialization and temporal refinement

Authors

Carl Doersch,Yi Yang,Mel Vecerik,Dilara Gokay,Ankush Gupta,Yusuf Aytar,Joao Carreira,Andrew Zisserman

Published Date

2023

We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages:(1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time. Given the high-quality trajectories extracted from a large dataset, we demonstrate a proof-of-concept diffusion model which generates trajectories from static images, enabling plausible animations. Visualizations, source code, and pretrained models can be found at https://deepmind-tapir. github. io.

AutoAD: Movie description in context

Authors

Tengda Han,Max Bain,Arsha Nagrani,Gül Varol,Weidi Xie,Andrew Zisserman

Published Date

2023

The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. Generating high-quality movie AD is challenging due to the dependency of the descriptions on context, and the limited amount of training data available. In this work, we leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation. In order to obtain high-quality AD, we make the following four contributions:(i) we incorporate context from the movie clip, AD from previous clips, as well as the subtitles;(ii) we address the lack of training data by pretraining on large-scale datasets, where visual or contextual information is unavailable, eg text-only AD without movies or visual captioning datasets without context;(iii) we improve on the currently available AD datasets, by removing label noise in the MAD dataset, and adding character naming information; and (iv) we obtain strong results on the movie AD task compared with previous methods.

Bounding an archiving: assessing the relative completeness of the Jacques Toussele archive using pattern-matching and face-recognition

Authors

David Zeitlyn,Ernesto Coto,Andrew Zisserman

Journal

Visual Studies

Published Date

2023/8/8

Archival research is haunted by the question of how complete are the archival fonds being consulted. In this article, we describe how we have used a combination of pattern-matching and face-recognition to evaluate the completeness of the Jacques Toussele photographic archive which was established as part of the Endangered Archives Programme at the British Library. A random set of scanned prints from the archive was matched with originating negatives also in the archive suggesting a survival rate of only 30%. Separately, an envelope of negatives from a single event in 1982 was analysed, looking at frame numbers from the surviving negatives. In this case, the survival rate was as high as 70%. Combinations of face-recognition and pattern-matching, for example, of fabric patterns or parts of backdrops, allow us to set some limits to the relative completeness or exhaustiveness of an otherwise relatively …

What Does Stable Diffusion Know about the 3D Scene?

Authors

Guanqi Zhan,Chuanxia Zheng,Weidi Xie,Andrew Zisserman

Journal

arXiv preprint arXiv:2310.06836

Published Date

2023/10/10

Recent advances in generative models like Stable Diffusion enable the generation of highly photo-realistic images. Our objective in this paper is to probe the diffusion network to determine to what extent it 'understands' different properties of the 3D scene depicted in an image. To this end, we make the following contributions: (i) We introduce a protocol to evaluate whether a network models a number of physical 'properties' of the 3D scene by probing for explicit features that represent these properties. The probes are applied on datasets of real images with annotations for the property. (ii) We apply this protocol to properties covering scene geometry, scene material, support relations, lighting, and view dependent measures. (iii) We find that Stable Diffusion is good at a number of properties including scene geometry, support relations, shadows and depth, but less performant for occlusion. (iv) We also apply the probes to other models trained at large-scale, including DINO and CLIP, and find their performance inferior to that of Stable Diffusion.

RESHAPING THE FUTURE OF PORTUGUESE AZULEJO PATTERNS

Authors

RS Carvalho,A Pais,F Cabral,A Dias,G Bergel,A Dutta,A Zisserman,RA Coelho

Journal

The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences

Published Date

2023/6/24

This paper introduces a new approach to the inventory and catalogue of azulejo patterns found in Portuguese buildings. It uses computer-vision based software tools for automatic search and matching of azulejo patterns, thereby improving the scalability and speed of existing cataloguing methodologies. The online catalogue of azulejo patterns is called Az Infinitum (Azulejo Referencing and Indexation System), a publicly accessible online portal suitable for both researchers and the general public who are interested in exploring and understanding this cultural heritage of Portugal. The effectiveness of this catalogue as a research support tool is demonstrated using a case study based on the Marvila pattern (i.e. P-17-00999). The online catalogue has inspired the development of an engaging application, called Azulejar, which allows one to create new patterns or understand the mathematical process behind existing azulejos patterns. This application has a potential to become an effective educational tool for inspiring everyone to explore and understand the science behind the beauty of azulejo patterns.

See List of Professors in Andrew Zisserman University(University of Oxford)

Andrew Zisserman FAQs

What is Andrew Zisserman's h-index at University of Oxford?

The h-index of Andrew Zisserman has been 120 since 2020 and 194 in total.

What are Andrew Zisserman's top articles?

The articles with the titles of

Action classification in video clips using attention-based neural networks

FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Parallel video processing systems

Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling

AutoAD III: The Prequel--Back to the Pixels

N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields

The Manga Whisperer: Automatically Generating Transcriptions for Comics

Moving Object Segmentation: All You Need Is SAM (and Flow)

...

are the top articles of Andrew Zisserman at University of Oxford.

What are Andrew Zisserman's research interests?

The research interests of Andrew Zisserman are: Computer Vision, Machine Learning

What is Andrew Zisserman's total number of citations?

Andrew Zisserman has 396,658 citations in total.

What are the co-authors of Andrew Zisserman?

The co-authors of Andrew Zisserman are Philip Torr, Pietro Perona, Andrea Vedaldi, David Forsyth, Richard Hartley, Josef Sivic.

    Co-Authors

    H-index: 131
    Philip Torr

    Philip Torr

    University of Oxford

    H-index: 121
    Pietro Perona

    Pietro Perona

    California Institute of Technology

    H-index: 97
    Andrea Vedaldi

    Andrea Vedaldi

    University of Oxford

    H-index: 86
    David Forsyth

    David Forsyth

    University of Illinois at Urbana-Champaign

    H-index: 85
    Richard Hartley

    Richard Hartley

    Australian National University

    H-index: 75
    Josef Sivic

    Josef Sivic

    Ceské vysoké ucení technické v Praze

    academic-engine

    Useful Links