Ricard Marxer

I'm a Research Fellow in Speech Technology at the University of Sheffield working on unsupervised and self-supervised learning approaches for speech and audio. I previously did research at Universitat Pompeu Fabra and University of Toulon working with various groups on music information retrieval, bioacoustics and speech processing. My work focuses on developing novel machine learning methods for processing speech and audio signals. I'm particularly interested in self-supervised representation learning and how we can extract meaningful features from raw audio without relying on labeled data. Some of my recent projects involve studying the scaling properties of speech language models, improving speaker diarization through joint optimization with speech separation, and developing models for predicting speech intelligibility.

I collaborate extensively with researchers in speech, music, and marine bioacoustics. Recent work includes developing systems for underwater audio processing and marine mammal monitoring, as well as applications of deep learning to hearing aid technology and speech enhancement. I'm also interested in the intersections between speech technology and cognitive science, studying how computational models can help us understand human speech perception.

Publications

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

Joonas Kalda, Tanel Alumäe, Martin Lebourdais, Hervé Bredin, Séverin Baroudi, R. Marxer

Interspeech 2024

ABS HTML PDF

Transfer Learning from Whisper for Microscopic Intelligibility Prediction

Paul Best, Santiago Cuervo, R. Marxer

Interspeech 2024

ABS HTML PDF

Scaling Properties of Speech Language Models

Santiago Cuervo, R. Marxer

Conference on Empirical Methods in Natural Language Processing 2024

ABS HTML PDF

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

Joonas Kalda, Clément Pagés, R. Marxer, Tanel Alumäe, Hervé Bredin

The Speaker and Language Recognition Workshop 2024

ABS HTML PDF

Speech Foundation Models on Intelligibility Prediction for Hearing-Impaired Listeners

Santiago Cuervo, R. Marxer

IEEE International Conference on Acoustics, Speech, and Signal Processing 2024

ABS HTML PDF

Vocal interactivity in-and-between humans, animals and robots

M. Chetouani, E. Briefer, Angela Dassow, R. Marxer, Roger K. Moore, Nicolas Obin, D. Stowell

Interaction Studies 2023

ABS

Progress and Prospects for Spoken Language Technology: Results from Five Sexennial Surveys

Roger K. Moore, R. Marxer

Interspeech 2023

Ricard Marxer

Publications

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

Transfer Learning from Whisper for Microscopic Intelligibility Prediction

Scaling Properties of Speech Language Models

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

Speech Foundation Models on Intelligibility Prediction for Hearing-Impaired Listeners

Vocal interactivity in-and-between humans, animals and robots

Progress and Prospects for Spoken Language Technology: Results from Five Sexennial Surveys

On the Benefits of Self-supervised Learned Speech Representations for Predicting Human Phonetic Misperceptions

1st Year of running MIR at UJI

Eiffel Tower: A deep-sea underwater dataset for long-term visual localization

Deep audio embeddings for vocalisation clustering

SUCRe: Leveraging Scene Structure for Underwater Color Restoration

Author Correction: Temporal evolution of the Mediterranean fin whale song

Blind Speech Separation Through Direction of Arrival Estimation Using Deep Neural Networks with a Flexibility on the Number of Speakers

Temporal evolution of the Mediterranean fin whale song

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

Homography-Based Loss Function for Camera Pose Regression

Contrastive Prediction Strategies for Unsupervised Segmentation and Categorization of Phonemes and Words

Marine and Maritime Intelligent Robotics (MIR)

Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw

Aligned Contrastive Predictive Coding

Voice Restoration with Silent Speech Interfaces (ReSSInt)

Stereo to five-channels bombyx sonobuoys: from four years cetacean monitoring to real-time whale-ship anti-collision system

The “ScribbleLens” Dutch Historical Handwriting Corpus

DOCC10: Open access dataset of marine mammal transient studies and end-to-end CNN classification

Deep Learning and Domain Transfer for Orca Vocalization Detection

A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning

Robust Training of Vector Quantized Bottleneck Models

Deep Learning Classification with Noisy Labels

Unsupervised Neural Segmentation and Clustering for Unit Discovery in Sequential Data

Wave propagation in the biosonar organ of sperm whales using a finite difference time domain method

High-frequency Near-field Physeter macrocephalus Monitoring by Stereo-Autoencoder and 3D Model of Sonar Organ

Efficient artifacts filter by density-based clustering in long term 3D whale passive acoustic monitoring with five hydrophones fixed under an Autonomous Surface Vehicle

Real-time Passive Acoustic 3D Tracking of Deep Diving Cetacean by Small Non-uniform Mobile Surface Antenna

Lexical frequency effects in English and Spanish word misperceptions.

Deep learning for ethoacoustical mapping: Application to a single Cachalot long term recording on joint observatories in Vancouver Island

Towards the topology of autoencoder of calls versus clicks of marine mammal

DNN driven Speaker Independent Audio-Visual Mask Estimation for Speech Separation

Sperm whales ultra high frequency near field multichannel analysis

A corpus of audio-visual Lombard speech with frontal and profile views.

The impact of the Lombard effect on audio and visual speech recognition systems

The CHiME Challenges: Robust Speech Recognition in Everyday Environments

An analysis of environment, microphone and data simulation mismatches in robust speech recognition

The third 'CHiME' speech separation and recognition challenge: Analysis and outcomes

Binary Mask Estimation Strategies for Constrained Imputation-Based Speech Enhancement

Multi-microphone speech recognition in everyday environments

Guest Editorial for the special issue on Multi-Microphone Speech Recognition in Everyday Environments

A Data Driven Approach to Audiovisual Speech Mapping

Vocal Interactivity in-and-between Humans, Animals, and Robots

CloudCAST - Remote Speech Technology for Speech Professionals

Progress and Prospects for Spoken Language Technology: Results from Four Sexennial Surveys

Language Effects in Noise-Induced Word Misperceptions

An Innovative Speech-Based Interface to Control AAL and IoT Solutions to Help People with Speech and Motor Disability

Evaluation and combination of pitch estimation methods for melody extraction in symphonic classical music

The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines

Exploiting synchrony spectra and deep neural networks for noise-robust automatic speech recognition

Knowledge transfer between speakers for personalised dialogue management

Remote Speech Technology for Speech Professionals - the CloudCAST initiative

Automatic dysfluency detection in dysarthric speech using deep belief networks

Unsupervised Incremental Online Learning and Prediction of Musical Audio Signals

Unsupervised Incremental Learning and Prediction of Music Signals

Score-informed and timbre independent lead instrument separation in real-world scenarios

Combining a harmonic-based NMF decomposition with transient analysis for instantaneous percussion separation

A Tikhonov regularization method for spectrum decomposition in low latency audio source separation

Low-Latency Instrument Separation in Polyphonic Audio Using Timbre Models

What/when causal expectation modelling applied to audio signals

Computational models of music perception and cognition I: the perceptual and cognitive processing chain

Computational models of music perception and cognition II: Domain-specific music processing

Dynamical hierarchical self‐organization of harmonic and motivic musical categories

What/when causal expectation modelling applied to percussive audio

Model-based language-instructed reinforcement learning

Proceedings of the 3rd International Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots VIHAR 2021

An Innovative Speech-Based User Interface for Smarthomes and IoT Solutions to Help People with Speech and Motor Disabilities

Towards Multi-modal Hearing Aid Design and Evaluation in Realistic Audio-Visual Settings : Challenges and Opportunities

A corpus of noise-induced word misperceptions for English.

Aalborg Universitet Unsupervised Learning of Structural Representation of Percussive Audio Using a Hierarchical Dirichlet Process Hidden Markov Model Antich,

“ Are we playing like Music-Stars ? ” Placing Emerging Artists on the Italian Music Scene

Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR) (Dagstuhl Seminar 16442)