thesis on supervised learning

Princeton University Doctoral Dissertations, 2011-2024
Computer Science

Upper Right Menu

Toggle navigation, research collection, imposing and uncovering group structure in weakly-supervised learning mendeley csv ris bibtex.

2023 Theses Doctoral

Learning Video Representation from Self-supervision

Chen, Brian

This thesis investigates the problem of learning video representations for video understanding. Previous works have explored the use of data-driven deep learning approaches, which have been shown to be effective in learning useful video representations. However, obtaining large amounts of labeled data can be costly and time-consuming. We investigate self-supervised approach as for multimodal video data to overcome this challenge. Video data typically contains multiple modalities, such as visual, audio, transcribed speech, and textual captions, which can serve as pseudo-labels for representation learning without needing manual labeling. By utilizing these modalities, we can train deep representations over large-scale video data consisting of millions of video clips collected from the internet. We demonstrate the scalability benefits of multimodal self-supervision by achieving new state-of-the-art performance in various domains, including video action recognition, text-to-video retrieval, and text-to-video grounding. We also examine the limitations of these approaches, which often rely on the association assumption involving multiple modalities of data used in self-supervision. For example, the text transcript is often assumed to be about the video content, and two segments of the same video share similar semantics. To overcome this problem, we propose new methods for learning video representations with more intelligent sampling strategies to capture samples that share high-level semantics or consistent concepts. The proposed methods include a clustering component to address false negative pairs in multimodal paired contrastive learning, a novel sampling strategy for finding visually groundable video-text pairs, an investigation of object tracking supervision for temporal association, and a new multimodal task for demonstrating the effectiveness of the proposed model. We aim to develop more robust and generalizable video representations for real-world applications, such as human-to-robot interaction and event extraction from large-scale news sources.

Computer science
Video recordings
Deep learning (Machine learning)
Human-robot interaction

thumnail for Chen_columbia_0054D_17719.pdf

More About This Work

DOI Copy DOI to clipboard

Who’s Teaching What
Subject Updates
MEng program
Opportunities
Minor in Computer Science
Resources for Current Students
Program objectives and accreditation
Graduate program requirements
Admission process
Degree programs
Graduate research
EECS Graduate Funding
Resources for current students
Student profiles
Instructors
DEI data and documents
Recruitment and outreach
Community and resources
Get involved / self-education
Rising Stars in EECS
Graduate Application Assistance Program (GAAP)
MIT Summer Research Program (MSRP)
Sloan-MIT University Center for Exemplary Mentoring (UCEM)
Electrical Engineering
Computer Science
Artificial Intelligence + Decision-making
AI and Society
AI for Healthcare and Life Sciences
Artificial Intelligence and Machine Learning
Biological and Medical Devices and Systems
Communications Systems
Computational Biology
Computational Fabrication and Manufacturing
Computer Architecture
Educational Technology
Electronic, Magnetic, Optical and Quantum Materials and Devices
Graphics and Vision
Human-Computer Interaction
Information Science and Systems
Integrated Circuits and Systems
Nanoscale Materials, Devices, and Systems
Natural Language and Speech Processing
Optics + Photonics
Optimization and Game Theory
Programming Languages and Software Engineering
Quantum Computing, Communication, and Sensing
Security and Cryptography
Signal Processing
Systems and Networking
Systems Theory, Control, and Autonomy
Theory of Computation
Departmental History
Departmental Organization
Visiting Committee
News & Events
News & Events
EECS Celebrates Awards

Doctoral Thesis: Self-Supervised Learning for Speech Processing

32-449 (Kiva)

Yu-An Chung

Abstract: Deep neural networks trained with supervised learning algorithms on large amounts of labeled speech data have achieved remarkable performance on various spoken language processing applications, often being the state of the arts on the corresponding leaderboards. However, the fact that training these systems relies on large amounts of annotated speech poses a scalability bottleneck for the continued advancement of state-of-the-art performance, and an even more fundamental barrier for deployment of deep neural networks in speech domains where labeled data are intrinsically rare, costly, or time-consuming to collect.

In contrast to annotated speech, untranscribed audio is often much cheaper to accumulate. In this thesis, we explore the use of self-supervised learning—a learning paradigm where the learning target is generated from the input itself—for leveraging such easily scalable resources to improve the performance of spoken language technology. Specifically, we propose two self-supervised algorithms, one based on the idea of “future prediction” and the other based on the idea of “predicting the masked from the unmasked,” for learning contextualized speech representations from unlabeled speech data. We show that our self-supervised algorithms are capable of learning representations that transform high-level properties of speech signals such as their phonetic contents and speaker characteristics into a more accessible form than traditional acoustic features, and demonstrate their effectiveness in improving the performance of deep neural networks on a wide range of speech processing tasks. In addition to presenting new learning algorithms, we also provide extensive analysis aiming to understand the properties of the learned self-supervised representations, as well as disclosing the design factors that make one self-supervised model different from the other.

Date: Thursday, April 14
Time: 3:00 pm
Location: 32-449 (Kiva)

Additional Location Details:

Thesis Supervisor(s): James Glass, Jacob Andreas, Phillip Isola

To attend this defense via zoom, please contact the doctoral candidate at [email protected]

Learning In The Wild With Limited Supervision

Over the past decade, machine visual perception has experienced remarkable progress due to advancements in the field of deep learning. However, the performance of deep learning systems remain far from ideal in real-world tasks that lack large training datasets. In this thesis, we study learning under limited supervision with a focus on unsupervised domain adaptation (no labelled examples) and few-shot learning (few labelled examples). We propose adaptation schemes that can leverage prior knowledge from a large-labelled base domain and transfer it to the domain of limited supervision (target domain). We con-sider both image classification and semantic segmentation tasks in the limited supervision regime. The key findings in this thesis are as follows.

In unsupervised domain adaptation, object-level adaptation is more effective than pixel-level adaptation for semantic segmentation. We propose a multi-modal objectness constraint that improves self-training based approaches for this problem. In the few-shot learning setup, full model finetuning is crucial for effective transfer when the base domain lacks sufficient diversity. We propose contrastive finetuning approach that leverages negative exemplars (distractors) to alleviate the issue of data scarcity in few-shot learning. As the size and diversity of base domain scales up, parameter efficient techniques can out-perform full-finetuning on variety of tasks including image classification and semantic segmentation. To that end, we propose expres, that augments a frozen base model with a few learnable parameters in the form of input and residual prompts and optimizes them for the few-shot task. While base domain scaling improves few-shot performance, using the right pretraining objective is equally important. We show that, compared to supervised representations, self-supervised representations are more suitable for few-shot semantic segmentation and a combination of the two achieves the best of both worlds.

Through our research work, we expand the palette of adaptation techniques suitable for different scales of base domain and degrees of target domain super-vision. We show that a careful design of the adaptation method can strike a much better trade-off between performance and forgetting. On popular bench-marks for few-shot image classification and semantic segmentation, our pro-posed approaches lead to significant performance gains reducing the gap with fully supervised methods.

Degree Type

Dissertation
Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Usage metrics

Electrical and Electronic Engineering not elsewhere classified

A list of completed theses and new thesis topics from the Computer Vision Group.

Are you about to start a BSc or MSc thesis? Please read our instructions for preparing and delivering your work.

Below we list possible thesis topics for Bachelor and Master students in the areas of Computer Vision, Machine Learning, Deep Learning and Pattern Recognition. The project descriptions leave plenty of room for your own ideas. If you would like to discuss a topic in detail, please contact the supervisor listed below and Prof. Paolo Favaro to schedule a meeting. Note that for MSc students in Computer Science it is required that the official advisor is a professor in CS.

AI deconvolution of light microscopy images

Level: master.

Background Light microscopy became an indispensable tool in life sciences research. Deconvolution is an important image processing step in improving the quality of microscopy images for removing out-of-focus light, higher resolution, and beter signal to noise ratio. Currently classical deconvolution methods, such as regularisation or blind deconvolution, are implemented in numerous commercial software packages and widely used in research. Recently AI deconvolution algorithms have been introduced and being currently actively developed, as they showed a high application potential.

Aim Adaptation of available AI algorithms for deconvolution of microscopy images. Validation of these methods against state-of-the -art commercially available deconvolution software.

Material and Methods Student will implement and further develop available AI deconvolution methods and acquire test microscopy images of different modalities. Performance of developed AI algorithms will be validated against available commercial deconvolution software.

Al algorithm development and implementation: 50%.
Data acquisition: 10%.
Comparison of performance: 40 %.

Requirements

Interest in imaging.
Solid knowledge of AI.
Good programming skills.

Supervisors Paolo Favaro, Guillaume Witz, Yury Belyaev.

Institutes Computer Vison Group, Digital Science Lab, Microscopy imaging Center.

Contact Yury Belyaev, Microscopy imaging Center, [email protected] , + 41 78 899 0110.

Instance segmentation of cryo-ET images

Level: bachelor/master.

In the 1600s, a pioneering Dutch scientist named Antonie van Leeuwenhoek embarked on a remarkable journey that would forever transform our understanding of the natural world. Armed with a simple yet ingenious invention, the light microscope, he delved into uncharted territory, peering through its lens to reveal the hidden wonders of microscopic structures. Fast forward to today, where cryo-electron tomography (cryo-ET) has emerged as a groundbreaking technique, allowing researchers to study proteins within their natural cellular environments. Proteins, functioning as vital nano-machines, play crucial roles in life and understanding their localization and interactions is key to both basic research and disease comprehension. However, cryo-ET images pose challenges due to inherent noise and a scarcity of annotated data for training deep learning models.

Credit: S. Albert et al./PNAS (CC BY 4.0)

To address these challenges, this project aims to develop a self-supervised pipeline utilizing diffusion models for instance segmentation in cryo-ET images. By leveraging the power of diffusion models, which iteratively diffuse information to capture underlying patterns, the pipeline aims to refine and accurately segment cryo-ET images. Self-supervised learning, which relies on unlabeled data, reduces the dependence on extensive manual annotations. Successful implementation of this pipeline could revolutionize the field of structural biology, facilitating the analysis of protein distribution and organization within cellular contexts. Moreover, it has the potential to alleviate the limitations posed by limited annotated data, enabling more efficient extraction of valuable information from cryo-ET images and advancing biomedical applications by enhancing our understanding of protein behavior.

Methods The segmentation pipeline for cryo-electron tomography (cryo-ET) images consists of two stages: training a diffusion model for image generation and training an instance segmentation U-Net using synthetic and real segmentation masks.

1. Diffusion Model Training: a. Data Collection: Collect and curate cryo-ET image datasets from the EMPIAR database (https://www.ebi.ac.uk/empiar/). b. Architecture Design: Select an appropriate architecture for the diffusion model. c. Model Evaluation: Cryo-ET experts will help assess image quality and fidelity through visual inspection and quantitative measures 2. Building the Segmentation dataset: a. Synthetic and real mask generation: Use the trained diffusion model to generate synthetic cryo-ET images. The diffusion process will be seeded from either a real or a synthetic segmentation mask. This will yield to pairs of cryo-ET images and segmentation masks. 3. Instance Segmentation U-Net Training: a. Architecture Design: Choose an appropriate instance segmentation U-Net architecture. b. Model Evaluation: Evaluate the trained U-Net using precision, recall, and F1 score metrics.

By combining the diffusion model for cryo-ET image generation and the instance segmentation U-Net, this pipeline provides an efficient and accurate approach to segment structures in cryo-ET images, facilitating further analysis and interpretation.

References 1. Kwon, Diana. "The secret lives of cells-as never seen before." Nature 598.7882 (2021): 558-560. 2. Moebel, Emmanuel, et al. "Deep learning improves macromolecule identification in 3D cellular cryo-electron tomograms." Nature methods 18.11 (2021): 1386-1394. 3. Rice, Gavin, et al. "TomoTwin: generalized 3D localization of macromolecules in cryo-electron tomograms with structural data mining." Nature Methods (2023): 1-10.

Contacts Prof. Thomas Lemmin Institute of Biochemistry and Molecular Medicine Bühlstrasse 28, 3012 Bern ( [email protected] )

Prof. Paolo Favaro Institute of Computer Science Neubrückstrasse 10 3012 Bern ( [email protected] )

Adding and removing multiple sclerosis lesions with to imaging with diffusion networks

Background multiple sclerosis lesions are the result of demyelination: they appear as dark spots on t1 weighted mri imaging and as bright spots on flair mri imaging. image analysis for ms patients requires both the accurate detection of new and enhancing lesions, and the assessment of atrophy via local thickness and/or volume changes in the cortex. detection of new and growing lesions is possible using deep learning, but made difficult by the relative lack of training data: meanwhile cortical morphometry can be affected by the presence of lesions, meaning that removing lesions prior to morphometry may be more robust. existing ‘lesion filling’ methods are rather crude, yielding unrealistic-appearing brains where the borders of the removed lesions are clearly visible., aim: denoising diffusion networks are the current gold standard in mri image generation [1]: we aim to leverage this technology to remove and add lesions to existing mri images. this will allow us to create realistic synthetic mri images for training and validating ms lesion segmentation algorithms, and for investigating the sensitivity of morphometry software to the presence of ms lesions at a variety of lesion load levels., materials and methods: a large, annotated, heterogeneous dataset of mri data from ms patients, as well as images of healthy controls without white matter lesions, will be available for developing the method. the student will work in a research group with a long track record in applying deep learning methods to neuroimaging data, as well as experience training denoising diffusion networks..

Nature of the Thesis:

Literature review: 10%

Replication of Blob Loss paper: 10%

Implementation of the sliding window metrics:10%

Training on MS lesion segmentation task: 30%

Extension to other datasets: 20%

Results analysis: 20%

Fig. Results of an existing lesion filling algorithm, showing inadequate performance

Requirements:

Interest/Experience with image processing

Python programming knowledge (Pytorch bonus)

Interest in neuroimaging

Supervisor(s):

PD. Dr. Richard McKinley

Institutes: Diagnostic and Interventional Neuroradiology

Center for Artificial Intelligence in Medicine (CAIM), University of Bern

References: [1] Brain Imaging Generation with Latent Diffusion Models , Pinaya et al, Accepted in the Deep Generative Models workshop @ MICCAI 2022 , https://arxiv.org/abs/2209.07162

Contact : PD Dr Richard McKinley, Support Centre for Advanced Neuroimaging ( [email protected] )

Improving metrics and loss functions for targets with imbalanced size: sliding window Dice coefficient and loss.

Background The Dice coefficient is the most commonly used metric for segmentation quality in medical imaging, and a differentiable version of the coefficient is often used as a loss function, in particular for small target classes such as multiple sclerosis lesions. Dice coefficient has the benefit that it is applicable in instances where the target class is in the minority (for example, in case of segmenting small lesions). However, if lesion sizes are mixed, the loss and metric is biased towards performance on large lesions, leading smaller lesions to be missed and harming overall lesion detection. A recently proposed loss function (blob loss[1]) aims to combat this by treating each connected component of a lesion mask separately, and claims improvements over Dice loss on lesion detection scores in a variety of tasks.

Aim: The aim of this thesisis twofold. First, to benchmark blob loss against a simple, potentially superior loss for instance detection: sliding window Dice loss, in which the Dice loss is calculated over a sliding window across the area/volume of the medical image. Second, we will investigate whether a sliding window Dice coefficient is better corellated with lesion-wise detection metrics than Dice coefficient and may serve as an alternative metric capturing both global and instance-wise detection.

Materials and Methods: A large, annotated, heterogeneous dataset of MRI data from MS patients will be available for benchmarking the method, as well as our existing codebases for MS lesion segmentation. Extension of the method to other diseases and datasets (such as covered in the blob loss paper) will make the method more plausible for publication. The student will work alongside clinicians and engineers carrying out research in multiple sclerosis lesion segmentation, in particular in the context of our running project supported by the CAIM grant.

Fig. An annotated MS lesion case, showing the variety of lesion sizes

References: [1] blob loss: instance imbalance aware loss functions for semantic segmentation, Kofler et al, https://arxiv.org/abs/2205.08209

Idempotent and partial skull-stripping in multispectral MRI imaging

Background Skull stripping (or brain extraction) refers to the masking of non-brain tissue from structural MRI imaging. Since 3D MRI sequences allow reconstruction of facial features, many data providers supply data only after skull-stripping, making this a vital tool in data sharing. Furthermore, skull-stripping is an important pre-processing step in many neuroimaging pipelines, even in the deep-learning era: while many methods could now operate on data with skull present, they have been trained only on skull-stripped data and therefore produce spurious results on data with the skull present.

High-quality skull-stripping algorithms based on deep learning are now widely available: the most prominent example is HD-BET [1]. A major downside of HD-BET is its behaviour on datasets to which skull-stripping has already been applied: in this case the algorithm falsely identifies brain tissue as skull and masks it. A skull-stripping algorithm F not exhibiting this behaviour would be idempotent: F(F(x)) = F(x) for any image x. Furthermore, legacy datasets from before the availability of high-quality skull-stripping algorithms may still contain images which have been inadequately skull-stripped: currently the only solution to improve the skull-stripping on this data is to go back to the original datasource or to manually correct the skull-stripping, which is time-consuming and prone to error.

Aim: In this project, the student will develop an idempotent skull-stripping network which can also handle partially skull-stripped inputs. In the best case, the network will operate well on a large subset of the data we work with (e.g. structural MRI, diffusion-weighted MRI, Perfusion-weighted MRI, susceptibility-weighted MRI, at a variety of field strengths) to maximize the future applicability of the network across the teams in our group.

Materials and Methods: Multiple datasets, both publicly available and internal (encompassing thousands of 3D volumes) will be available. Silver standard reference data for standard sequences at 1.5T and 3T can be generated using existing tools such as HD-BET: for other sequences and field strengths semi-supervised learning or methods improving robustness to domain shift may be employed. Robustness to partial skull-stripping may be induced by a combination of learning theory and model-based approaches.

Dataset curation: 10%

Idempotent skull-stripping model building: 30%

Modelling of partial skull-stripping:10%

Extension of model to handle partial skull: 30%

Results analysis: 10%

Fig. An example of failed skull-stripping requiring manual correction

References: [1] Isensee, F, Schell, M, Pflueger, I, et al. Automated brain extraction of multisequence MRI using artificial neural networks. Hum Brain Mapp . 2019; 40: 4952– 4964. https://doi.org/10.1002/hbm.24750

Automated leaf detection and leaf area estimation (for Arabidopsis thaliana)

Correlating plant phenotypes such as leaf area or number of leaves to the genotype (i.e. changes in DNA) is a common goal for plant breeders and molecular biologists. Such data can not only help to understand fundamental processes in nature, but also can help to improve ecotypes, e.g., to perform better under climate change, or reduce fertiliser input. However, collecting data for many plants is very time consuming and automated data acquisition is necessary.

The project aims at building a machine learning model to automatically detect plants in top-view images (see examples below), segment their leaves (see Fig C) and to estimate the leaf area. This information will then be used to determine the leaf area of different Arabidopsis ecotypes. The project will be carried out in collaboration with researchers of the Institute of Plant Sciences at the University of Bern. It will also involve the design and creation of a dataset of plant top-views with the corresponding annotation (provided by experts at the Institute of Plant Sciences).

Contact: Prof. Dr. Paolo Favaro ( [email protected] )

Master Projects at the ARTORG Center

The Gerontechnology and Rehabilitation group at the ARTORG Center for Biomedical Engineering is offering multiple MSc thesis projects to students, which are interested in working with real patient data, artificial intelligence and machine learning algorithms. The goal of these projects is to transfer the findings to the clinic in order to solve today’s healthcare problems and thus to improve the quality of life of patients. Assessment of Digital Biomarkers at Home by Radar. [PDF] Comparison of Radar, Seismograph and Ballistocardiography and to Monitor Sleep at Home. [PDF] Sentimental Analysis in Speech. [PDF] Contact: Dr. Stephan Gerber ( [email protected] )

Internship in Computational Imaging at Prophesee

A 6 month intership at Prophesee, Grenoble is offered to a talented Master Student.

The topic of the internship is working on burst imaging following the work of Sam Hasinoff , and exploring ways to improve it using event-based vision.

A compensation to cover the expenses of living in Grenoble is offered. Only students that have legal rights to work in France can apply.

Anyone interested can send an email with the CV to Daniele Perrone ( [email protected] ).

Using machine learning applied to wearables to predict mental health

This Master’s project lies at the intersection of psychiatry and computer science and aims to use machine learning techniques to improve health. Using sensors to detect sleep and waking behavior has as of yet unexplored potential to reveal insights into health. In this study, we make use of a watch-like device, called an actigraph, which tracks motion to quantify sleep behavior and waking activity. Participants in the study consist of healthy and depressed adolescents and wear actigraphs for a year during which time we query their mental health status monthly using online questionnaires. For this masters thesis we aim to make use of machine learning methods to predict mental health based on the data from the actigraph. The ability to predict mental health crises based on sleep and wake behavior would provide an opportunity for intervention, significantly impacting the lives of patients and their families. This Masters thesis is a collaboration between Professor Paolo Favaro at the Institute of Computer Science ( [email protected] ) and Dr Leila Tarokh at the Universitäre Psychiatrische Dienste (UPD) ( [email protected] ). We are looking for a highly motivated individual interested in bridging disciplines.

Bachelor or Master Projects at the ARTORG Center

The Gerontechnology and Rehabilitation group at the ARTORG Center for Biomedical Engineering is offering multiple BSc- and MSc thesis projects to students, which are interested in working with real patient data, artificial intelligence and machine learning algorithms. The goal of these projects is to transfer the findings to the clinic in order to solve today’s healthcare problems and thus to improve the quality of life of patients. Machine Learning Based Gait-Parameter Extraction by Using Simple Rangefinder Technology. [PDF] Detection of Motion in Video Recordings [PDF] Home-Monitoring of Elderly by Radar [PDF] Gait feature detection in Parkinson's Disease [PDF] Development of an arthroscopic training device using virtual reality [PDF] Contact: Dr. Stephan Gerber ( [email protected] ), Michael Single ( [email protected]. ch )

Dynamic Transformer

Level: bachelor.

Visual Transformers have obtained state of the art classification accuracies [ViT, DeiT, T2T, BoTNet]. Mixture of experts could be used to increase the capacity of a neural network by learning instance dependent execution pathways in a network [MoE]. In this research project we aim to push the transformers to their limit and combine their dynamic attention with MoEs, compared to Switch Transformer [Switch], we will use a much more efficient formulation of mixing [CondConv, DynamicConv] and we will use this idea in the attention part of the transformer, not the fully connected layer.

Input dependent attention kernel generation for better transformer layers.

Publication Opportunity: Dynamic Neural Networks Meets Computer Vision (a CVPR 2021 Workshop)

Extensions:

The same idea could be extended to other ViT/Transformer based models [DETR, SETR, LSTR, TrackFormer, BERT]

Quantized ViT

Visual Transformers have obtained state of the art classification accuracies [ViT, CLIP, DeiT], but the best ViT models are extremely compute heavy and running them even only for inference (not doing backpropagation) is expensive. Running transformers cheaply by quantization is not a new problem and it has been tackled before for BERT [BERT] in NLP [Q-BERT, Q8BERT, TernaryBERT, BinaryBERT]. In this project we will be trying to quantize pretrained ViT models.

Quantizing ViT models for faster inference and smaller models without losing accuracy

Publication Opportunity: Binary Networks for Computer Vision 2021 (a CVPR workshop)

Extensions:

Having a fast pipeline for image inference with ViT will allow us to dig deep into the attention of ViT and analyze it, we might be able to prune some attention heads or replace them with static patterns (like local convolution or dilated patterns), We might be even able to replace the transformer with performer and increase the throughput even more [Performer].
The same idea could be extended to other ViT based models [DETR, SETR, LSTR, TrackFormer, CPTR, BoTNet, T2TViT]
Learning Transferable Visual Models From Natural Language Supervision [CLIP]
Visual Transformers: Token-based Image Representation and Processing for Computer Vision [ViT]
DeiT: Data-efficient Image Transformers [DeiT]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [BERT]
Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT [Q-BERT]
Q8BERT: Quantized 8Bit BERT [Q8BERT]
TernaryBERT: Distillation-aware Ultra-low Bit BERT [TernaryBERT]
BinaryBERT: Pushing the Limit of BERT Quantization [BinaryBERT]
Rethinking Attention with Performers [Performer]
End-to-End Object Detection with Transformers [DETR]
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [SETR]
End-to-end Lane Shape Prediction with Transformers [LSTR]
TrackFormer: Multi-Object Tracking with Transformers [TrackFormer]
CPTR: Full Transformer Network for Image Captioning [CPTR]
Bottleneck Transformers for Visual Recognition [BoTNet]
Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet [T2TViT]

Multimodal Contrastive Learning

Recently contrastive learning has gained a lot of attention for self-supervised image representation learning [SimCLR, MoCo]. Contrastive learning could be extended to multimodal data, like videos (images and audio) [CMC, CoCLR]. Most contrastive methods require large batch sizes (or large memory pools) which makes them expensive for training. In this project we are going to use non batch size dependent contrastive methods [SwAV, BYOL, SimSiam] to train multimodal representation extractors.

Our main goal is to compare the proposed method with the CMC baseline, so we will be working with STL10, ImageNet, UCF101, HMDB51, and NYU Depth-V2 datasets.

Inspired by the recent works on smaller datasets [ConVIRT, CPD], to accelerate the training speed, we could start with two pretrained single-modal models and finetune them with the proposed method.

Extending SwAV to multimodal datasets
Grasping a better understanding of the BYOL

Publication Opportunity: MULA 2021 (a CVPR workshop on Multimodal Learning and Applications)

Most knowledge distillation methods for contrastive learners also use large batch sizes (or memory pools) [CRD, SEED], the proposed method could be extended for knowledge distillation.
One could easily extend this idea to multiview learning, for example one could have two different networks working on the same input and train them with contrastive learning, this may lead to better models [DeiT] by cross-model inductive biases communications.
Self-supervised Co-training for Video Representation Learning [CoCLR]
Learning Spatiotemporal Features via Video and Text Pair Discrimination [CPD]
Audio-Visual Instance Discrimination with Cross-Modal Agreement [AVID-CMA]
Self-Supervised Learning by Cross-Modal Audio-Video Clustering [XDC]
Contrastive Multiview Coding [CPC]
Contrastive Learning of Medical Visual Representations from Paired Images and Text [ConVIRT]
A Simple Framework for Contrastive Learning of Visual Representations [SimCLR]
Momentum Contrast for Unsupervised Visual Representation Learning [MoCo]
Bootstrap your own latent: A new approach to self-supervised Learning [BYOL]
Exploring Simple Siamese Representation Learning [SimSiam]
Unsupervised Learning of Visual Features by Contrasting Cluster Assignments [SwAV]
Contrastive Representation Distillation [CRD]
SEED: Self-supervised Distillation For Visual Representation [SEED]

Robustness of Neural Networks

Neural Networks have been found to achieve surprising performance in several tasks such as classification, detection and segmentation. However, they are also very sensitive to small (controlled) changes to the input. It has been shown that some changes to an image that are not visible to the naked eye may lead the network to output an incorrect label. This thesis will focus on studying recent progress in this area and aim to build a procedure for a trained network to self-assess its reliability in classification or one of the popular computer vision tasks.

Contact: Paolo Favaro

Masters projects at sitem center

The Personalised Medicine Research Group at the sitem Center for Translational Medicine and Biomedical Entrepreneurship is offering multiple MSc thesis projects to the biomed eng MSc students that may also be of interest to the computer science students. Automated quantification of cartilage quality for hip treatment decision support. PDF Automated quantification of massive rotator cuff tears from MRI. PDF Deep learning-based segmentation and fat fraction analysis of the shoulder muscles using quantitative MRI. PDF Unsupervised Domain Adaption for Cross-Modality Hip Joint Segmentation. PDF Contact: Dr. Kate Gerber

Internships/Master thesis @ Chronocam

3-6 months internships on event-based computer vision. Chronocam is a rapidly growing startup developing event-based technology, with more than 15 PhDs working on problems like tracking, detection, classification, SLAM, etc. Event-based computer vision has the potential to solve many long-standing problems in traditional computer vision, and this is a super exciting time as this potential is becoming more and more tangible in many real-world applications. For next year we are looking for motivated Master and PhD students with good software engineering skills (C++ and/or python), and preferable good computer vision and deep learning background. PhD internships will be more research focused and possibly lead to a publication. For each intern we offer a compensation to cover the expenses of living in Paris. List of some of the topics we want to explore:

Photo-realistic image synthesis and super-resolution from event-based data (PhD)
Self-supervised representation learning (PhD)
End-to-end Feature Learning for Event-based Data
Bio-inspired Filtering using Spiking Networks
On-the fly Compression of Event-based Streams for Low-Power IoT Cameras
Tracking of Multiple Objects with a Dual-Frequency Tracker
Event-based Autofocus
Stabilizing an Event-based Stream using an IMU
Crowd Monitoring for Low-power IoT Cameras
Road Extraction from an Event-based Camera Mounted in a Car for Autonomous Driving
Sign detection from an Event-based Camera Mounted in a Car for Autonomous Driving
High-frequency Eye Tracking

Email with attached CV to Daniele Perrone at [email protected] .

Contact: Daniele Perrone

Object Detection in 3D Point Clouds

Today we have many 3D scanning techniques that allow us to capture the shape and appearance of objects. It is easier than ever to scan real 3D objects and transform them into a digital model for further processing, such as modeling, rendering or animation. However, the output of a 3D scanner is often a raw point cloud with little to no annotations. The unstructured nature of the point cloud representation makes it difficult for processing, e.g. surface reconstruction. One application is the detection and segmentation of an object of interest. In this project, the student is challenged to design a system that takes a point cloud (a 3D scan) as input and outputs the names of objects contained in the scan. This output can then be used to eliminate outliers or points that belong to the background. The approach involves collecting a large dataset of 3D scans and training a neural network on it.

Contact: Adrian Wälchli

Shape Reconstruction from a Single RGB Image or Depth Map

A photograph accurately captures the world in a moment of time and from a specific perspective. Since it is a projection of the 3D space to a 2D image plane, the depth information is lost. Is it possible to restore it, given only a single photograph? In general, the answer is no. This problem is ill-posed, meaning that many different plausible depth maps exist, and there is no way of telling which one is the correct one. However, if we cover one of our eyes, we are still able to recognize objects and estimate how far away they are. This motivates the exploration of an approach where prior knowledge can be leveraged to reduce the ill-posedness of the problem. Such a prior could be learned by a deep neural network, trained with many images and depth maps.

CNN Based Deblurring on Mobile

Deblurring finds many applications in our everyday life. It is particularly useful when taking pictures on handheld devices (e.g. smartphones) where camera shake can degrade important details. Therefore, it is desired to have a good deblurring algorithm implemented directly in the device. In this project, the student will implement and optimize a state-of-the-art deblurring method based on a deep neural network for deployment on mobile phones (Android). The goal is to reduce the number of network weights in order to reduce the memory footprint while preserving the quality of the deblurred images. The result will be a camera app that automatically deblurs the pictures, giving the user a choice of keeping the original or the deblurred image.

Depth from Blur

If an object in front of the camera or the camera itself moves while the aperture is open, the region of motion becomes blurred because the incoming light is accumulated in different positions across the sensor. If there is camera motion, there is also parallax. Thus, a motion blurred image contains depth information. In this project, the student will tackle the problem of recovering a depth-map from a motion-blurred image. This includes the collection of a large dataset of blurred- and sharp images or videos using a pair or triplet of GoPro action cameras. Two cameras will be used in stereo to estimate the depth map, and the third captures the blurred frames. This data is then used to train a convolutional neural network that will predict the depth map from the blurry image.

Unsupervised Clustering Based on Pretext Tasks

The idea of this project is that we have two types of neural networks that work together: There is one network A that assigns images to k clusters and k (simple) networks of type B perform a self-supervised task on those clusters. The goal of all the networks is to make the k networks of type B perform well on the task. The assumption is that clustering in semantically similar groups will help the networks of type B to perform well. This could be done on the MNIST dataset with B being linear classifiers and the task being rotation prediction.

Adversarial Data-Augmentation

The student designs a data augmentation network that transforms training images in such a way that image realism is preserved (e.g. with a constrained spatial transformer network) and the transformed images are more difficult to classify (trained via adversarial loss against an image classifier). The model will be evaluated for different data settings (especially in the low data regime), for example on the MNIST and CIFAR datasets.

Unsupervised Learning of Lip-reading from Videos

People with sensory impairment (hearing, speech, vision) depend heavily on assistive technologies to communicate and navigate in everyday life. The mass production of media content today makes it impossible to manually translate everything into a common language for assistive technologies, e.g. captions or sign language. In this project, the student employs a neural network to learn a representation for lip-movement in videos in an unsupervised fashion, possibly with an encoder-decoder structure where the decoder reconstructs the audio signal. This requires collecting a large dataset of videos (e.g. from YouTube) of speakers or conversations where lip movement is visible. The outcome will be a neural network that learns an audio-visual representation of lip movement in videos, which can then be leveraged to generate captions for hearing impaired persons.

Learning to Generate Topographic Maps from Satellite Images

Satellite images have many applications, e.g. in meteorology, geography, education, cartography and warfare. They are an accurate and detailed depiction of the surface of the earth from above. Although it is relatively simple to collect many satellite images in an automated way, challenges arise when processing them for use in navigation and cartography. The idea of this project is to automatically convert an arbitrary satellite image, of e.g. a city, to a map of simple 2D shapes (streets, houses, forests) and label them with colors (semantic segmentation). The student will collect a dataset of satellite image and topological maps and train a deep neural network that learns to map from one domain to the other. The data could be obtained from a Google Maps database or similar.

New Variables of Brain Morphometry: the Potential and Limitations of CNN Regression

Timo blattner · sept. 2022.

The calculation of variables of brain morphology is computationally very expensive and time-consuming. A previous work showed the feasibility of ex- tracting the variables directly from T1-weighted brain MRI images using a con- volutional neural network. We used significantly more data and extended their model to a new set of neuromorphological variables, which could become inter- esting biomarkers in the future for the diagnosis of brain diseases. The model shows for nearly all subjects a less than 5% mean relative absolute error. This high relative accuracy can be attributed to the low morphological variance be- tween subjects and the ability of the model to predict the cortical atrophy age trend. The model however fails to capture all the variance in the data and shows large regional differences. We attribute these limitations in part to the moderate to poor reliability of the ground truth generated by FreeSurfer. We further investigated the effects of training data size and model complexity on this regression task and found that the size of the dataset had a significant impact on performance, while deeper models did not perform better. Lack of interpretability and dependence on a silver ground truth are the main drawbacks of this direct regression approach.

Home Monitoring by Radar

Lars ziegler · sept. 2022.

Detection and tracking of humans via UWB radars is a promising and continuously evolving field with great potential for medical technology. This contactless method of acquiring data of a patients movement patterns is ideal for in home application. As irregularities in a patients movement patterns are an indicator for various health problems including neurodegenerative diseases, the insight this data could provide may enable earlier detection of such problems. In this thesis a signal processing pipeline is presented with which a persons movement is modeled. During an experiment 142 measurements were recorded by two separate radar systems and one lidar system which each consisted of multiple sensors. The models that were calculated on these measurements by the signal processing pipeline were used to predict the times when a person stood up or sat down. The predictions showed an accuracy of 72.2%.

Revisiting non-learning based 3D reconstruction from multiple images

Aaron sägesser · oct. 2021.

Arthroscopy consists of challenging tasks and requires skills that even today, young surgeons still train directly throughout the surgery. Existing simulators are expensive and rarely available. Through the growing potential of virtual reality(VR) (head-mounted) devices for simulation and their applicability in the medical context, these devices have become a promising alternative that would be orders of magnitude cheaper and could be made widely available. To build a VR-based training device for arthroscopy is the overall aim of our project, as this would be of great benefit and might even be applicable in other minimally invasive surgery (MIS). This thesis marks a first step of the project with its focus to explore and compare well-known algorithms in a multi-view stereo (MVS) based 3D reconstruction with respect to imagery acquired by an arthroscopic camera. Simultaneously with this reconstruction, we aim to gain essential measures to compare the VR environment to the real world, as validation of the realism of future VR tasks. We evaluate 3 different feature extraction algorithms with 3 different matching techniques and 2 different algorithms for the estimation of the fundamental (F) matrix. The evaluation of these 18 different setups is made with a reconstruction pipeline embedded in a jupyter notebook implemented in python based on common computer vision libraries and compared with imagery generated with a mobile phone as well as with the reconstruction results of state-of-the-art (SOTA) structure-from-motion (SfM) software COLMAP and Multi-View Environment (MVE). Our comparative analysis manifests the challenges of heavy distortion, the fish-eye shape and weak image quality of arthroscopic imagery, as all results are substantially worse using this data. However, there are huge differences regarding the different setups. Scale Invariant Feature Transform (SIFT) and Oriented FAST Rotated BRIEF (ORB) in combination with k-Nearest Neighbour (kNN) matching and Least Median of Squares (LMedS) present the most promising results. Overall, the 3D reconstruction pipeline is a useful tool to foster the process of gaining measurements from the arthroscopic exploration device and to complement the comparative research in this context.

Examination of Unsupervised Representation Learning by Predicting Image Rotations

Eric lagger · sept. 2020.

In recent years deep convolutional neural networks achieved a lot of progress. To train such a network a lot of data is required and in supervised learning algorithms it is necessary that the data is labeled. To label data there is a lot of human work needed and this takes a lot of time and money to be done. To avoid the inconveniences that come with this we would like to find systems that don’t need labeled data and therefore are unsupervised learning algorithms. This is the importance of unsupervised algorithms, even though their outcome is not yet on the same qualitative level as supervised algorithms. In this thesis we will discuss an approach of such a system and compare the results to other papers. A deep convolutional neural network is trained to learn the rotations that have been applied to a picture. So we take a large amount of images and apply some simple rotations and the task of the network is to discover in which direction the image has been rotated. The data doesn’t need to be labeled to any category or anything else. As long as all the pictures are upside down we hope to find some high dimensional patterns for the network to learn.

StitchNet: Image Stitching using Autoencoders and Deep Convolutional Neural Networks

Maurice rupp · sept. 2019.

This thesis explores the prospect of artificial neural networks for image processing tasks. More specifically, it aims to achieve the goal of stitching multiple overlapping images to form a bigger, panoramic picture. Until now, this task is solely approached with ”classical”, hardcoded algorithms while deep learning is at most used for specific subtasks. This thesis introduces a novel end-to-end neural network approach to image stitching called StitchNet, which uses a pre-trained autoencoder and deep convolutional networks. Additionally to presenting several new datasets for the task of supervised image stitching with each 120’000 training and 5’000 validation samples, this thesis also conducts various experiments with different kinds of existing networks designed for image superresolution and image segmentation adapted to the task of image stitching. StitchNet outperforms most of the adapted networks in both quantitative as well as qualitative results.

Facial Expression Recognition in the Wild

Luca rolshoven · sept. 2019.

The idea of inferring the emotional state of a subject by looking at their face is nothing new. Neither is the idea of automating this process using computers. Researchers used to computationally extract handcrafted features from face images that had proven themselves to be effective and then used machine learning techniques to classify the facial expressions using these features. Recently, there has been a trend towards using deeplearning and especially Convolutional Neural Networks (CNNs) for the classification of these facial expressions. Researchers were able to achieve good results on images that were taken in laboratories under the same or at least similar conditions. However, these models do not perform very well on more arbitrary face images with different head poses and illumination. This thesis aims to show the challenges of Facial Expression Recognition (FER) in this wild setting. It presents the currently used datasets and the present state-of-the-art results on one of the biggest facial expression datasets currently available. The contributions of this thesis are twofold. Firstly, I analyze three famous neural network architectures and their effectiveness on the classification of facial expressions. Secondly, I present two modifications of one of these networks that lead to the proposed STN-COV model. While this model does not outperform all of the current state-of-the-art models, it does beat several ones of them.

A Study of 3D Reconstruction of Varying Objects with Deformable Parts Models

Raoul grossenbacher · july 2019.

This work covers a new approach to 3D reconstruction. In traditional 3D reconstruction one uses multiple images of the same object to calculate a 3D model by taking information gained from the differences between the images, like camera position, illumination of the images, rotation of the object and so on, to compute a point cloud representing the object. The characteristic trait shared by all these approaches is that one can almost change everything about the image, but it is not possible to change the object itself, because one needs to find correspondences between the images. To be able to use different instances of the same object, we used a 3D DPM model that can find different parts of an object in an image, thereby detecting the correspondences between the different pictures, which we then can use to calculate the 3D model. To take this theory to practise, we gave a 3D DPM model, which was trained to detect cars, pictures of different car brands, where no pair of images showed the same vehicle and used the detected correspondences and the Factorization Method to compute the 3D point cloud. This technique leads to a completely new approach in 3D reconstruction, because changing the object itself was never done before.

Motion deblurring in the wild replication and improvements

Alvaro juan lahiguera · jan. 2019, coma outcome prediction with convolutional neural networks, stefan jonas · oct. 2018, automatic correction of self-introduced errors in source code, sven kellenberger · aug. 2018, neural face transfer: training a deep neural network to face-swap, till nikolaus schnabel · july 2018.

This thesis explores the field of artificial neural networks with realistic looking visual outputs. It aims at morphing face pictures of a specific identity to look like another individual by only modifying key features, such as eye color, while leaving identity-independent features unchanged. Prior works have covered the topic of symmetric translation between two specific domains but failed to optimize it on faces where only parts of the image may be changed. This work applies a face masking operation to the output at training time, which forces the image generator to preserve colors while altering the face, fitting it naturally inside the unmorphed surroundings. Various experiments are conducted including an ablation study on the final setting, decreasing the baseline identity switching performance from 81.7% to 75.8 % whilst improving the average χ2 color distance from 0.551 to 0.434. The provided code-based software gives users easy access to apply this neural face swap to images and videos of arbitrary crop and brings Computer Vision one step closer to replacing Computer Graphics in this specific area.

A Study of the Importance of Parts in the Deformable Parts Model

Sammer puran · june 2017, self-similarity as a meta feature, lucas husi · april 2017, a study of 3d deformable parts models for detection and pose-estimation, simon jenni · march 2015, amodal leaf segmentation, nicolas maier · nov. 2023.

Plant phenotyping is the process of measuring and analyzing various traits of plants. It provides essential information on how genetic and environmental factors affect plant growth and development. Manual phenotyping is highly time-consuming; therefore, many computer vision and machine learning based methods have been proposed in the past years to perform this task automatically based on images of the plants. However, the publicly available datasets (in particular, of Arabidopsis thaliana) are limited in size and diversity, making them unsuitable to generalize to new unseen environments. In this work, we propose a complete pipeline able to automatically extract traits of interest from an image of Arabidopsis thaliana. Our method uses a minimal amount of existing annotated data from a source domain to generate a large synthetic dataset adapted to a different target domain (e.g., different backgrounds, lighting conditions, and plant layouts). In addition, unlike the source dataset, the synthetic one provides ground-truth annotations for the occluded parts of the leaves, which are relevant when measuring some characteristics of the plant, e.g., its total area. This synthetic dataset is then used to train a model to perform amodal instance segmentation of the leaves to obtain the total area, leaf count, and color of each plant. To validate our approach, we create a small dataset composed of manually annotated real images of Arabidopsis thaliana, which is used to assess the performance of the models.

Assessment of movement and pose in a hospital bed by ambient and wearable sensor technology in healthy subjects

Tony licata · sept. 2022.

The use of automated systems describing the human motion has become possible in various domains. Most of the proposed systems are designed to work with people moving around in a standing position. Because such system could be interesting in a medical environment, we propose in this work a pipeline that can effectively predict human motion from people lying on beds. The proposed pipeline is tested with a data set composed of 41 participants executing 7 predefined tasks in a bed. The motion of the participants is measured with video cameras, accelerometers and pressure mat. Various experiments are carried with the information retrieved from the data set. Two approaches combining the data from the different measure technologies are explored. The performance of the different carried experiments is measured, and the proposed pipeline is composed with components providing the best results. Later on, we show that the proposed pipeline only needs to use the video cameras, which make the proposed environment easier to implement in real life situations.

Machine Learning Based Prediction of Mental Health Using Wearable-measured Time Series

Seyedeh sharareh mirzargar · sept. 2022.

Depression is the second major cause for years spent in disability and has a growing prevalence in adolescents. The recent Covid-19 pandemic has intensified the situation and limited in-person patient monitoring due to distancing measures. Recent advances in wearable devices have made it possible to record the rest/activity cycle remotely with high precision and in real-world contexts. We aim to use machine learning methods to predict an individual's mental health based on wearable-measured sleep and physical activity. Predicting an impending mental health crisis of an adolescent allows for prompt intervention, detection of depression onset or its recursion, and remote monitoring. To achieve this goal, we train three primary forecasting models; linear regression, random forest, and light gradient boosted machine (LightGBM); and two deep learning models; block recurrent neural network (block RNN) and temporal convolutional network (TCN); on Actigraph measurements to forecast mental health in terms of depression, anxiety, sleepiness, stress, sleep quality, and behavioral problems. Our models achieve a high forecasting performance, the random forest being the winner to reach an accuracy of 98% for forecasting the trait anxiety. We perform extensive experiments to evaluate the models' performance in accuracy, generalization, and feature utilization, using a naive forecaster as the baseline. Our analysis shows minimal mental health changes over two months, making the prediction task easily achievable. Due to these minimal changes in mental health, the models tend to primarily use the historical values of mental health evaluation instead of Actigraph features. At the time of this master thesis, the data acquisition step is still in progress. In future work, we plan to train the models on the complete dataset using a longer forecasting horizon to increase the level of mental health changes and perform transfer learning to compensate for the small dataset size. This interdisciplinary project demonstrates the opportunities and challenges in machine learning based prediction of mental health, paving the way toward using the same techniques to forecast other mental disorders such as internalizing disorder, Parkinson's disease, Alzheimer's disease, etc. and improving the quality of life for individuals who have some mental disorder.

CNN Spike Detector: Detection of Spikes in Intracranial EEG using Convolutional Neural Networks

Stefan jonas · oct. 2021.

The detection of interictal epileptiform discharges in the visual analysis of electroencephalography (EEG) is an important but very difficult, tedious, and time-consuming task. There have been decades of research on computer-assisted detection algorithms, most recently focused on using Convolutional Neural Networks (CNNs). In this thesis, we present the CNN Spike Detector, a convolutional neural network to detect spikes in intracranial EEG. Our dataset of 70 intracranial EEG recordings from 26 subjects with epilepsy introduces new challenges in this research field. We report cross-validation results with a mean AUC of 0.926 (+- 0.04), an area under the precision-recall curve (AUPRC) of 0.652 (+- 0.10) and 12.3 (+- 7.47) false positive epochs per minute for a sensitivity of 80%. A visual examination of false positive segments is performed to understand the model behavior leading to a relatively high false detection rate. We notice issues with the evaluation measures and highlight a major limitation of the common approach of detecting spikes using short segments, namely that the network is not capable to consider the greater context of the segment with regards to its origination. For this reason, we present the Context Model, an extension in which the CNN Spike Detector is supplied with additional information about the channel. Results show promising but limited performance improvements. This thesis provides important findings about the spike detection task for intracranial EEG and lays out promising future research directions to develop a network capable of assisting experts in real-world clinical applications.

PolitBERT - Deepfake Detection of American Politicians using Natural Language Processing

Maurice rupp · april 2021.

This thesis explores the application of modern Natural Language Processing techniques to the detection of artificially generated videos of popular American politicians. Instead of focusing on detecting anomalies and artifacts in images and sounds, this thesis focuses on detecting irregularities and inconsistencies in the words themselves, opening up a new possibility to detect fake content. A novel, domain-adapted, pre-trained version of the language model BERT combined with several mechanisms to overcome severe dataset imbalances yielded the best quantitative as well as qualitative results. Additionally to the creation of the biggest publicly available dataset of English-speaking politicians consisting of 1.5 M sentences from over 1000 persons, this thesis conducts various experiments with different kinds of text classification and sequence processing algorithms applied to the political domain. Furthermore, multiple ablations to manage severe data imbalance are presented and evaluated.

A Study on the Inversion of Generative Adversarial Networks

Ramona beck · march 2021.

The desire to use generative adversarial networks (GANs) for real-world tasks such as object segmentation or image manipulation is increasing as synthesis quality improves, which has given rise to an emerging research area called GAN inversion that focuses on exploring methods for embedding real images into the latent space of a GAN. In this work, we investigate different GAN inversion approaches using an existing generative model architecture that takes a completely unsupervised approach to object segmentation and is based on StyleGAN2. In particular, we propose and analyze algorithms for embedding real images into the different latent spaces Z, W, and W+ of StyleGAN following an optimization-based inversion approach, while also investigating a novel approach that allows fine-tuning of the generator during the inversion process. Furthermore, we investigate a hybrid and a learning-based inversion approach, where in the former we train an encoder with embeddings optimized by our best optimization-based inversion approach, and in the latter we define an autoencoder, consisting of an encoder and the generator of our generative model as a decoder, and train it to map an image into the latent space. We demonstrate the effectiveness of our methods as well as their limitations through a quantitative comparison with existing inversion methods and by conducting extensive qualitative and quantitative experiments with synthetic data as well as real images from a complex image dataset. We show that we achieve qualitatively satisfying embeddings in the W and W+ spaces with our optimization-based algorithms, that fine-tuning the generator during the inversion process leads to qualitatively better embeddings in all latent spaces studied, and that the learning-based approach also benefits from a variable generator as well as a pre-training with our hybrid approach. Furthermore, we evaluate our approaches on the object segmentation task and show that both our optimization-based and our hybrid and learning-based methods are able to generate meaningful embeddings that achieve reasonable object segmentations. Overall, our proposed methods illustrate the potential that lies in the GAN inversion and its application to real-world tasks, especially in the relaxed version of the GAN inversion where the weights of the generator are allowed to vary.

Multi-scale Momentum Contrast for Self-supervised Image Classification

Zhao xueqi · dec. 2020.

With the maturity of supervised learning technology, people gradually shift the research focus to the field of self-supervised learning. ”Momentum Contrast” (MoCo) proposes a new self-supervised learning method and raises the correct rate of self-supervised learning to a new level. Inspired by another article ”Representation Learning by Learning to Count”, if a picture is divided into four parts and passed through a neural network, it is possible to further improve the accuracy of MoCo. Different from the original MoCo, this MoCo variant (Multi-scale MoCo) does not directly pass the image through the encoder after the augmented images. Multi-scale MoCo crops and resizes the augmented images, and the obtained four parts are respectively passed through the encoder and then summed (upsampled version do not do resize to input but resize the contrastive samples). This method of images crop is not only used for queue q but also used for comparison queue k, otherwise the weights of queue k might be damaged during the moment update. This will further discussed in the experiments chapter between downsampled Multi-scale version and downsampled both Multi-scale version. Human beings also have the same principle of object recognition: when human beings see something they are familiar with, even if the object is not fully displayed, people can still guess the object itself with a high probability. Because of this, Multi-scale MoCo applies this concept to the pretext part of MoCo, hoping to obtain better feature extraction. In this thesis, there are three versions of Multi-scale MoCo, downsampled input samples version, downsampled input samples and contrast samples version and upsampled input samples version. The differences between these versions will be described in more detail later. The neural network architecture comparison includes ResNet50 , and the tested data set is STL-10. The weights obtained in pretext will be transferred to self-supervised learning, and in the process of self-supervised learning, the weights of other layers except the final linear layer are frozen without changing (these weights come from pretext).

Self-Supervised Learning Using Siamese Networks and Binary Classifier

Dušan mihajlov · march 2020.

In this thesis, we present several approaches for training a convolutional neural network using only unlabeled data. Our autonomously supervised learning algorithms are based on connections between image patch i. e. zoomed image and its original. Using the siamese architecture neural network we aim to recognize, if the image patch, which is input to the first neural network part, comes from the same image presented to the second neural network part. By applying transformations to both images, and different zoom sizes at different positions, we force the network to extract high level features using its convolutional layers. At the top of our siamese architecture, we have a simple binary classifier that measures the difference between feature maps that we extract and makes a decision. Thus, the only way that the classifier will solve the task correctly is when our convolutional layers are extracting useful representations. Those representations we can than use to solve many different tasks that are related to the data used for unsupervised training. As the main benchmark for all of our models, we used STL10 dataset, where we train a linear classifier on the top of our convolutional layers with a small amount of manually labeled images, which is a widely used benchmark for unsupervised learning tasks. We also combine our idea with recent work on the same topic, and the network called RotNet, which makes use of image rotations and therefore forces the network to learn rotation dependent features from the dataset. As a result of this combination we create a new procedure that outperforms original RotNet.

Learning Object Representations by Mixing Scenes

Lukas zbinden · may 2019.

In the digital age of ever increasing data amassment and accessibility, the demand for scalable machine learning models effective at refining the new oil is unprecedented. Unsupervised representation learning methods present a promising approach to exploit this invaluable yet unlabeled digital resource at scale. However, a majority of these approaches focuses on synthetic or simplified datasets of images. What if a method could learn directly from natural Internet-scale image data? In this thesis, we propose a novel approach for unsupervised learning of object representations by mixing natural image scenes. Without any human help, our method mixes visually similar images to synthesize new realistic scenes using adversarial training. In this process the model learns to represent and understand the objects prevalent in natural image data and makes them available for downstream applications. For example, it enables the transfer of objects from one scene to another. Through qualitative experiments on complex image data we show the effectiveness of our method along with its limitations. Moreover, we benchmark our approach quantitatively against state-of-the-art works on the STL-10 dataset. Our proposed method demonstrates the potential that lies in learning representations directly from natural image data and reinforces it as a promising avenue for future research.

Representation Learning using Semantic Distances

Markus roth · may 2019, zero-shot learning using generative adversarial networks, hamed hemati · dec. 2018, dimensionality reduction via cnns - learning the distance between images, ioannis glampedakis · sept. 2018, learning to play othello using deep reinforcement learning and self play, thomas simon steinmann · sept. 2018, aba-j interactive multi-modality tissue sectionto-volume alignment: a brain atlasing toolkit for imagej, felix meyenhofer · march 2018, learning visual odometry with recurrent neural networks, adrian wälchli · feb. 2018.

In computer vision, Visual Odometry is the problem of recovering the camera motion from a video. It is related to Structure from Motion, the problem of reconstructing the 3D geometry from a collection of images. Decades of research in these areas have brought successful algorithms that are used in applications like autonomous navigation, motion capture, augmented reality and others. Despite the success of these prior works in real-world environments, their robustness is highly dependent on manual calibration and the magnitude of noise present in the images in form of, e.g., non-Lambertian surfaces, dynamic motion and other forms of ambiguity. This thesis explores an alternative approach to the Visual Odometry problem via Deep Learning, that is, a specific form of machine learning with artificial neural networks. It describes and focuses on the implementation of a recent work that proposes the use of Recurrent Neural Networks to learn dependencies over time due to the sequential nature of the input. Together with a convolutional neural network that extracts motion features from the input stream, the recurrent part accumulates knowledge from the past to make camera pose estimations at each point in time. An analysis on the performance of this system is carried out on real and synthetic data. The evaluation covers several ways of training the network as well as the impact and limitations of the recurrent connection for Visual Odometry.

Crime location and timing prediction

Bernard swart · jan. 2018, from cartoons to real images: an approach to unsupervised visual representation learning, simon jenni · feb. 2017, automatic and large-scale assessment of fluid in retinal oct volume, nina mujkanovic · dec. 2016, segmentation in 3d using eye-tracking technology, michele wyss · july 2016, accurate scale thresholding via logarithmic total variation prior, remo diethelm · aug. 2014, novel techniques for robust and generalizable machine learning, abdelhak lemkhenter · sept. 2023.

Neural networks have transcended their status of powerful proof-of-concept machine learning into the realm of a highly disruptive technology that has revolutionized many quantitative fields such as drug discovery, autonomous vehicles, and machine translation. Today, it is nearly impossible to go a single day without interacting with a neural network-powered application. From search engines to on-device photo-processing, neural networks have become the go-to solution thanks to recent advances in computational hardware and an unprecedented scale of training data. Larger and less curated datasets, typically obtained through web crawling, have greatly propelled the capabilities of neural networks forward. However, this increase in scale amplifies certain challenges associated with training such models. Beyond toy or carefully curated datasets, data in the wild is plagued with biases, imbalances, and various noisy components. Given the larger size of modern neural networks, such models run the risk of learning spurious correlations that fail to generalize beyond their training data. This thesis addresses the problem of training more robust and generalizable machine learning models across a wide range of learning paradigms for medical time series and computer vision tasks. The former is a typical example of a low signal-to-noise ratio data modality with a high degree of variability between subjects and datasets. There, we tailor the training scheme to focus on robust patterns that generalize to new subjects and ignore the noisier and subject-specific patterns. To achieve this, we first introduce a physiologically inspired unsupervised training task and then extend it by explicitly optimizing for cross-dataset generalization using meta-learning. In the context of image classification, we address the challenge of training semi-supervised models under class imbalance by designing a novel label refinement strategy with higher local sensitivity to minority class samples while preserving the global data distribution. Lastly, we introduce a new Generative Adversarial Networks training loss. Such generative models could be applied to improve the training of subsequent models in the low data regime by augmenting the dataset using generated samples. Unfortunately, GAN training relies on a delicate balance between its components, making it prone mode collapse. Our contribution consists of defining a more principled GAN loss whose gradients incentivize the generator model to seek out missing modes in its distribution. All in all, this thesis tackles the challenge of training more robust machine learning models that can generalize beyond their training data. This necessitates the development of methods specifically tailored to handle the diverse biases and spurious correlations inherent in the data. It is important to note that achieving greater generalizability in models goes beyond simply increasing the volume of data; it requires meticulous consideration of training objectives and model architecture. By tackling these challenges, this research contributes to advancing the field of machine learning and underscores the significance of thoughtful design in obtaining more resilient and versatile models.

Automated Sleep Scoring, Deep Learning and Physician Supervision

Luigi fiorillo · oct. 2022.

Sleep plays a crucial role in human well-being. Polysomnography is used in sleep medicine as a diagnostic tool, so as to objectively analyze the quality of sleep. Sleep scoring is the procedure of extracting sleep cycle information from the wholenight electrophysiological signals. The scoring is done worldwide by the sleep physicians according to the official American Academy of Sleep Medicine (AASM) scoring manual. In the last decades, a wide variety of deep learning based algorithms have been proposed to automatise the sleep scoring task. In this thesis we study the reasons why these algorithms fail to be introduced in the daily clinical routine, with the perspective of bridging the existing gap between the automatic sleep scoring models and the sleep physicians. In this light, the primary step is the design of a simplified sleep scoring architecture, also providing an estimate of the model uncertainty. Beside achieving results on par with most up-to-date scoring systems, we demonstrate the efficiency of ensemble learning based algorithms, together with label smoothing techniques, in both enhancing the performance and calibrating the simplified scoring model. We introduced an uncertainty estimate procedure, so as to identify the most challenging sleep stage predictions, and to quantify the disagreement between the predictions given by the model and the annotation given by the physicians. In this thesis we also propose a novel method to integrate the inter-scorer variability into the training procedure of a sleep scoring model. We clearly show that a deep learning model is able to encode this variability, so as to better adapt to the consensus of a group of scorers-physicians. We finally address the generalization ability of a deep learning based sleep scoring system, further studying its resilience to the sleep complexity and to the AASM scoring rules. We can state that there is no need to train the algorithm strictly following the AASM guidelines. Most importantly, using data from multiple data centers results in a better performing model compared with training on a single data cohort. The variability among different scorers and data centers needs to be taken into account, more than the variability among sleep disorders.

Learning Representations for Controllable Image Restoration

Givi meishvili · march 2022.

Deep Convolutional Neural Networks have sparked a renaissance in all the sub-fields of computer vision. Tremendous progress has been made in the area of image restoration. The research community has pushed the boundaries of image deblurring, super-resolution, and denoising. However, given a distorted image, most existing methods typically produce a single restored output. The tasks mentioned above are inherently ill-posed, leading to an infinite number of plausible solutions. This thesis focuses on designing image restoration techniques capable of producing multiple restored results and granting users more control over the restoration process. Towards this goal, we demonstrate how one could leverage the power of unsupervised representation learning. Image restoration is vital when applied to distorted images of human faces due to their social significance. Generative Adversarial Networks enable an unprecedented level of generated facial details combined with smooth latent space. We leverage the power of GANs towards the goal of learning controllable neural face representations. We demonstrate how to learn an inverse mapping from image space to these latent representations, tuning these representations towards a specific task, and finally manipulating latent codes in these spaces. For example, we show how GANs and their inverse mappings enable the restoration and editing of faces in the context of extreme face super-resolution and the generation of novel view sharp videos from a single motion-blurred image of a face. This thesis also addresses more general blind super-resolution, denoising, and scratch removal problems, where blur kernels and noise levels are unknown. We resort to contrastive representation learning and first learn the latent space of degradations. We demonstrate that the learned representation allows inference of ground-truth degradation parameters and can guide the restoration process. Moreover, it enables control over the amount of deblurring and denoising in the restoration via manipulation of latent degradation features.

Learning Generalizable Visual Patterns Without Human Supervision

Simon jenni · oct. 2021.

Owing to the existence of large labeled datasets, Deep Convolutional Neural Networks have ushered in a renaissance in computer vision. However, almost all of the visual data we generate daily - several human lives worth of it - remains unlabeled and thus out of reach of today’s dominant supervised learning paradigm. This thesis focuses on techniques that steer deep models towards learning generalizable visual patterns without human supervision. Our primary tool in this endeavor is the design of Self-Supervised Learning tasks, i.e., pretext-tasks for which labels do not involve human labor. Besides enabling the learning from large amounts of unlabeled data, we demonstrate how self-supervision can capture relevant patterns that supervised learning largely misses. For example, we design learning tasks that learn deep representations capturing shape from images, motion from video, and 3D pose features from multi-view data. Notably, these tasks’ design follows a common principle: The recognition of data transformations. The strong performance of the learned representations on downstream vision tasks such as classiﬁcation, segmentation, action recognition, or pose estimation validate this pretext-task design. This thesis also explores the use of Generative Adversarial Networks (GANs) for unsupervised representation learning. Besides leveraging generative adversarial learning to deﬁne image transformation for self-supervised learning tasks, we also address training instabilities of GANs through the use of noise. While unsupervised techniques can signiﬁcantly reduce the burden of supervision, in the end, we still rely on some annotated examples to ﬁne-tune learned representations towards a target task. To improve the learning from scarce or noisy labels, we describe a supervised learning algorithm with improved generalization in these challenging settings.

Learning Interpretable Representations of Images

Attila szabó · june 2019.

Computers represent images with pixels and each pixel contains three numbers for red, green and blue colour values. These numbers are meaningless for humans and they are mostly useless when used directly with classical machine learning techniques like linear classifiers. Interpretable representations are the attributes that humans understand: the colour of the hair, viewpoint of a car or the 3D shape of the object in the scene. Many computer vision tasks can be viewed as learning interpretable representations, for example a supervised classification algorithm directly learns to represent images with their class labels. In this work we aim to learn interpretable representations (or features) indirectly with lower levels of supervision. This approach has the advantage of cost savings on dataset annotations and the flexibility of using the features for multiple follow-up tasks. We made contributions in three main areas: weakly supervised learning, unsupervised learning and 3D reconstruction. In the weakly supervised case we use image pairs as supervision. Each pair shares a common attribute and differs in a varying attribute. We propose a training method that learns to separate the attributes into separate feature vectors. These features then are used for attribute transfer and classification. We also show theoretical results on the ambiguities of the learning task and the ways to avoid degenerate solutions. We show a method for unsupervised representation learning, that separates semantically meaningful concepts. We explain and show ablation studies how the components of our proposed method work: a mixing autoencoder, a generative adversarial net and a classifier. We propose a method for learning single image 3D reconstruction. It is done using only the images, no human annotation, stereo, synthetic renderings or ground truth depth map is needed. We train a generative model that learns the 3D shape distribution and an encoder to reconstruct the 3D shape. For that we exploit the notion of image realism. It means that the 3D reconstruction of the object has to look realistic when it is rendered from different random angles. We prove the efficacy of our method from first principles.

Learning Controllable Representations for Image Synthesis

Qiyang hu · june 2019.

In this thesis, our focus is learning a controllable representation and applying the learned controllable feature representation on images synthesis, video generation, and even 3D reconstruction. We propose different methods to disentangle the feature representation in neural network and analyze the challenges in disentanglement such as reference ambiguity and shortcut problem when using the weak label. We use the disentangled feature representation to transfer attributes between images such as exchanging hairstyle between two face images. Furthermore, we study the problem of how another type of feature, sketch, works in a neural network. The sketch can provide shape and contour of an object such as the silhouette of the side-view face. We leverage the silhouette constraint to improve the 3D face reconstruction from 2D images. The sketch can also provide the moving directions of one object, thus we investigate how one can manipulate the object to follow the trajectory provided by a user sketch. We propose a method to automatically generate video clips from a single image input using the sketch as motion and trajectory guidance to animate the object in that image. We demonstrate the efficiency of our approaches on several synthetic and real datasets.

Beyond Supervised Representation Learning

Mehdi noroozi · jan. 2019.

The complexity of any information processing task is highly dependent on the space where data is represented. Unfortunately, pixel space is not appropriate for the computer vision tasks such as object classification. The traditional computer vision approaches involve a multi-stage pipeline where at first images are transformed to a feature space through a handcrafted function and then consequenced by the solution in the feature space. The challenge with this approach is the complexity of designing handcrafted functions that extract robust features. The deep learning based approaches address this issue by end-to-end training of a neural network for some tasks that lets the network to discover the appropriate representation for the training tasks automatically. It turns out that image classification task on large scale annotated datasets yields a representation transferable to other computer vision tasks. However, supervised representation learning is limited to annotations. In this thesis we study self-supervised representation learning where the goal is to alleviate these limitations by substituting the classification task with pseudo tasks where the labels come for free. We discuss self-supervised learning by solving jigsaw puzzles that uses context as supervisory signal. The rational behind this task is that the network requires to extract features about object parts and their spatial configurations to solve the jigsaw puzzles. We also discuss a method for representation learning that uses an artificial supervisory signal based on counting visual primitives. This supervisory signal is obtained from an equivariance relation. We use two image transformations in the context of counting: scaling and tiling. The first transformation exploits the fact that the number of visual primitives should be invariant to scale. The second transformation allows us to equate the total number of visual primitives in each tile to that in the whole image. The most effective transfer strategy is fine-tuning, which restricts one to use the same model or parts thereof for both pretext and target tasks. We discuss a novel framework for self-supervised learning that overcomes limitations in designing and comparing different tasks, models, and data domains. In particular, our framework decouples the structure of the self-supervised model from the final task-specific finetuned model. Finally, we study the problem of multi-task representation learning. A naive approach to enhance the representation learned by a task is to train the task jointly with other tasks that capture orthogonal attributes. Having a diverse set of auxiliary tasks, imposes challenges on multi-task training from scratch. We propose a framework that allows us to combine arbitrarily different feature spaces into a single deep neural network. We reduce the auxiliary tasks to classification tasks and the multi-task learning to multi-label classification task consequently. Nevertheless, combining multiple representation space without being aware of the target task might be suboptimal. As our second contribution, we show empirically that this is indeed the case and propose to combine multiple tasks after the fine-tuning on the target task.

Motion Deblurring from a Single Image

Meiguang jin · dec. 2018.

With the information explosion, a tremendous amount photos is captured and shared via social media everyday. Technically, a photo requires a finite exposure to accumulate light from the scene. Thus, objects moving during the exposure generate motion blur in a photo. Motion blur is an image degradation that makes visual content less interpretable and is therefore often seen as a nuisance. Although motion blur can be reduced by setting a short exposure time, an insufficient amount of light has to be compensated through increasing the sensor’s sensitivity, which will inevitably bring large amount of sensor noise. Thus this motivates the necessity of removing motion blur computationally. Motion deblurring is an important problem in computer vision and it is challenging due to its ill-posed nature, which means the solution is not well defined. Mathematically, a blurry image caused by uniform motion is formed by the convolution operation between a blur kernel and a latent sharp image. Potentially there are infinite pairs of blur kernel and latent sharp image that can result in the same blurry image. Hence, some prior knowledge or regularization is required to address this problem. Even if the blur kernel is known, restoring the latent sharp image is still difficult as the high frequency information has been removed. Although we can model the uniform motion deblurring problem mathematically, it can only address the camera in-plane translational motion. Practically, motion is more complicated and can be non-uniform. Non-uniform motion blur can come from many sources, camera out-of-plane rotation, scene depth change, object motion and so on. Thus, it is more challenging to remove non-uniform motion blur. In this thesis, our focus is motion blur removal. We aim to address four challenging motion deblurring problems. We start from the noise blind image deblurring scenario where blur kernel is known but the noise level is unknown. We introduce an efficient and robust solution based on a Bayesian framework using a smooth generalization of the 0−1 loss to address this problem. Then we study the blind uniform motion deblurring scenario where both the blur kernel and the latent sharp image are unknown. We explore the relative scale ambiguity between the latent sharp image and blur kernel to address this issue. Moreover, we study the face deblurring problem and introduce a novel deep learning network architecture to solve it. We also address the general motion deblurring problem and particularly we aim at recovering a sequence of 7 frames each depicting some instantaneous motion of the objects in the scene.

Towards a Novel Paradigm in Blind Deconvolution: From Natural to Cartooned Image Statistics

Daniele perrone · july 2015.

In this thesis we study the blind deconvolution problem. Blind deconvolution consists in the estimation of a sharp image and a blur kernel from an observed blurry image. Because the blur model admits several solutions it is necessary to devise an image prior that favors the true blur kernel and sharp image. Recently it has been shown that a class of blind deconvolution formulations and image priors has the no-blur solution as global minimum. Despite this shortcoming, algorithms based on these formulations and priors can successfully solve blind deconvolution. In this thesis we show that a suitable initialization can exploit the non-convexity of the problem and yield the desired solution. Based on these conclusions, we propose a novel “vanilla” algorithm stripped of any enhancement typically used in the literature. Our algorithm, despite its simplicity, is able to compete with the top performers on several datasets. We have also investigated a remarkable behavior of a 1998 algorithm, whose formulation has the no-blur solution as global minimum: even when initialized at the no-blur solution, it converges to the correct solution. We show that this behavior is caused by an apparently insignificant implementation strategy that makes the algorithm no longer minimize the original cost functional. We also demonstrate that this strategy improves the results of our “vanilla” algorithm. Finally, we present a study of image priors for blind deconvolution. We provide experimental evidence supporting the recent belief that a good image prior is one that leads to a good blur estimate rather than being a good natural image statistical model. By focusing the attention on the blur estimation alone, we show that good blur estimates can be obtained even when using images quite different from the true sharp image. This allows using image priors, such as those leading to “cartooned” images, that avoid the no-blur solution. By using an image prior that produces “cartooned” images we achieve state-of-the-art results on different publicly available datasets. We therefore suggests a shift of paradigm in blind deconvolution: from modeling natural image statistics to modeling cartooned image statistics.

New Perspectives on Uncalibrated Photometric Stereo

Thoma papadhimitri · june 2014.

This thesis investigates the problem of 3D reconstruction of a scene from 2D images. In particular, we focus on photometric stereo which is a technique that computes the 3D geometry from at least three images taken from the same viewpoint and under different illumination conditions. When the illumination is unknown (uncalibrated photometric stereo) the problem is ambiguous: different combinations of geometry and illumination can generate the same images. First, we solve the ambiguity by exploiting the Lambertian reflectance maxima. These are points defined on curved surfaces where the normals are parallel to the light direction. Then, we propose a solution that can be computed in closed-form and thus very efficiently. Our algorithm is also very robust and yields always the same estimate regardless of the initial ambiguity. We validate our method on real world experiments and achieve state-of-art results. In this thesis we also solve for the first time the uncalibrated photometric stereo problem under the perspective projection model. We show that unlike in the orthographic case, one can uniquely reconstruct the normals of the object and the lights given only the input images and the camera calibration (focal length and image center). We also propose a very efficient algorithm which we validate on synthetic and real world experiments and show that the proposed technique is a generalization of the orthographic case. Finally, we investigate the uncalibrated photometric stereo problem in the case where the lights are distributed near the scene. In this case we propose an alternating minimization technique which converges quickly and overcomes the limitations of prior work that assumes distant illumination. We show experimentally that adopting a near-light model for real world scenes yields very accurate reconstructions.

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: towards unsupervised representation learning: learning, evaluating and transferring visual representations.

Abstract: Unsupervised representation learning aims at finding methods that learn representations from data without annotation-based signals. Abstaining from annotations not only leads to economic benefits but may - and to some extent already does - result in advantages regarding the representation's structure, robustness, and generalizability to different tasks. In the long run, unsupervised methods are expected to surpass their supervised counterparts due to the reduction of human intervention and the inherently more general setup that does not bias the optimization towards an objective originating from specific annotation-based signals. While major advantages of unsupervised representation learning have been recently observed in natural language processing, supervised methods still dominate in vision domains for most tasks. In this dissertation, we contribute to the field of unsupervised (visual) representation learning from three perspectives: (i) Learning representations: We design unsupervised, backpropagation-free Convolutional Self-Organizing Neural Networks (CSNNs) that utilize self-organization- and Hebbian-based learning rules to learn convolutional kernels and masks to achieve deeper backpropagation-free models. (ii) Evaluating representations: We build upon the widely used (non-)linear evaluation protocol to define pretext- and target-objective-independent metrics for measuring and investigating the objective function mismatch between various unsupervised pretext tasks and target tasks. (iii) Transferring representations: We contribute CARLANE, the first 3-way sim-to-real domain adaptation benchmark for 2D lane detection, and a method based on prototypical self-supervised learning. Finally, we contribute a content-consistent unpaired image-to-image translation method that utilizes masks, global and local discriminators, and similarity sampling to mitigate content inconsistencies.

Submission history

Access paper:.

Download PDF
Other Formats

References & Citations

Google Scholar
Semantic Scholar

BibTeX formatted citation

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

University of Minnesota

Digital conservancy.

University Digital Conservancy Home
University of Minnesota Twin Cities
Dissertations and Theses
Dissertations

View/ Download file

Persistent link to this item, appears in collections, description, suggested citation, udc services.

About the UDC
How to Deposit
Policies and Terms of Use

Related Services

University Archives
U of M Web Archive
UMedia Archive
Copyright Services
Digital Library Services
News & Events
Staff Directory
Subject Librarians
Vision, Mission, & Goals

Share full article

A boy in a white T-shirt looking out on a park.

‘Yo Soy la Mamá’: A Migrant Mother’s Struggle to Get Back Her Son

She came to the United States fleeing her abuser. When child welfare got involved, she risked losing her son forever.

Ricardo was separated from his mother, Olga, at the age of 5. They are from Honduras. Credit... Eva Marie Uzcategui for The New York Times

Supported by

By Deborah Sontag

March 25, 2024

Over the final four months of 2021, Olga, a Honduran immigrant in Hollywood, Fla., grew increasingly panicked. She could not find her 5-year-old son, Ricardo. After she’d fled her homeland to escape her abusive husband, the man also migrated, disappeared with the boy and broke off contact.

By day, Olga lived her life. She cut, colored and styled hair at a Miami salon, chatting with clients as if she hadn’t a care in the world. She mothered her 7-year-old daughter, Dariela, straining to distract her from the fact that her little brother was missing. But the nights were tough. “I cried into my pillow,” Olga said. “Where was my sweet little boy? Was he, at least, safe?”

He was not.

By the time Olga, then 28, tracked her son to Massachusetts, he had been removed from his father over allegations of physical abuse. Calling office after office of the Department of Children and Families, she finally reached a woman who turned out to be Ricardo’s caseworker.

“Who are you?” the woman said.

“ Yo soy la mamá,” Olga replied, bursting into tears.

In early January 2022, Olga, who asked that her last name be withheld to protect her children, flew to Boston. It would only be a matter of presenting evidence — Ricardo’s birth certificate, videos of him on her phone, DNA if necessary — before she could take him home, she thought.

But when immigration and child welfare are involved — two contentious issues and their beleaguered systems — nothing is straightforward.

Under an interstate compact, Massachusetts formally asked Florida to approve the relocation. Florida said no. Though a caseworker found Olga to have a clean record, a proper home and sufficient income, she denied the move because Olga was not a legal U.S. resident.

Massachusetts does not consider undocumented status a reason to prevent reunification with a parent. But intensely cautious amid a scandal involving another child’s death, the state’s child protection authorities froze, sending Ricardo on a destabilizing odyssey through the foster care system. In a case that reveals the unique vulnerabilities of unauthorized immigrant parents, Olga risked losing her son forever.

Immigrant family separation did not start or stop with the Trump administration’s thwarted “zero tolerance” policy. Now as before, and with record numbers of new unauthorized immigrants fanning out across the country, it happens more insidiously.

“When people think of family separation, they think of the Southern border and kids in cages,” said Lori Nessel, director of an immigrant rights clinic at Seton Hall Law School in New Jersey. “But people don’t realize how much this occurs every day in the interior of the country.”

Like other poor parents, unauthorized immigrants encounter the chronic fallibility of state-run agencies in which Black and Hispanic children are overrepresented. They also tangle with the antiquated bureaucracy that governs the relocation of children across state lines.

But their status puts them at an additional disadvantage. They confront language and cultural barriers as well as limited access to services and benefits, fear of immigration enforcement, inadequate legal representation and, finally, anti-immigrant bias.

Additionally, many caseworkers and judges harbor the misconception that all unauthorized immigrants are on the brink of deportation, viewing their homes as inherently unstable. Yet fewer than 1 percent were removed by Immigration and Customs Enforcement last year.

Cristina Cooper, senior attorney with the American Bar Association’s Center on Children and the Law, described Florida’s decision in Olga’s case as “shocking and harmful.” Undocumented status alone does not make a parent unfit. And under the 14th Amendment, fit parents, regardless of immigration status, have a protected right to the care, custody and control of their children.

Asked whether it was now Florida’s policy to refuse custody based on immigration status, Miguel Nevarez, press secretary for the state’s Department of Children and Families, neither answered directly nor denied it. “Cases regarding one’s legal or illegal status wouldn’t exist if the federal government enforced our immigration laws,” he said.

In Olga’s case, that line of thinking trickled down to South Florida from Tallahassee, where Gov. Ron DeSantis signed a bill last spring that he proudly called “the strongest anti-illegal-immigration legislation in the country.”

When Olga’s advocates phoned her caseworker’s supervisor, according to Nick Herbold, the boy’s first foster father, the woman told them: “Hey, we’re in Florida. She’s undocumented. There’s no concern about the home. There’s no concern about safety with the mother. It’s just the fact that politically we cannot sign off on it.”

Olga never intended to be an undocumented immigrant. If she ever journeyed north, she thought, she would do so with a visa.

What changed her mind was something that propels many women to leave Honduras, according to a report by the Washington Office on Latin America : “domestic violence and a lack of resources for protections or aid.”

Growing up near the Maya ruins of Copán, daughter of a tailor and a factory worker, Olga set her sights on a professional career. But in her first year studying law at the University of San Pedro Sula, she met Ricardo’s father. (He did not respond to messages from The New York Times.) In her second year, she got pregnant and dropped out.

Even before they married, her boyfriend became volatilely “ machista ,” she said. After their two children were born in quick succession, he turned physically abusive to her and Dariela. When she finally kicked him out, she didn’t trust him to stay away.

Selling property, Olga raised $10,000, enough to pay the way to the United States for herself and one child. Leave Ricardo with me, her mother said, pledging to travel north with him later.

The journey was harrowing, but once Olga and Dariela were safely ensconced in a relative’s spacious house in Coral Gables, Fla., Olga started to regret leaving the boy behind.

Still, Ricardo was fine with his grandmother — until his father showed up and forcibly reclaimed him, Olga said. The man then traveled with Ricardo to the coffee fields of Hawaii and eventually disappeared with him into the vast U.S. mainland.

In mid-November 2021, Ricardo’s father enrolled him in kindergarten at the Albert F. Argenziano School in Somerville, Mass. Four days later, Ricardo told his teacher that his body hurt. A child who idolizes superheroes, he willed himself not to cry as he revealed a vivid bruise on his leg and confided that his father beat him with a belt when he misbehaved.

Alarmed but not wanting to alarm the rest of the class, the teacher, Ilana Cohen, quietly asked her paraprofessional to take the small child with the soft brown eyes to the health office. Down the hall they went, sneakers squeaking on linoleum past a rainbow-decorated bulletin board that proclaimed: “YOU BELONG HERE! YOU MATTER!”

The nurse observed not only the contusion on Ricardo’s leg but also other, fading bruises. She alerted the principal, Glenda Soto, that she would have to immediately report suspected abuse to the child welfare department.

“O Lord, give me strength,” Ms. Soto said to herself. In her seven years as an administrator at Argenziano, an elementary school with nearly 600 students, she had dealt with only one case in which a child had to be taken from a parent.

Ricardo was whisked away for a forensic examination at Boston Children’s Hospital. Christianne Sharr had just started there as a physician assistant, though she was not at work when her phone rang late that Thursday.

“We have an emergency removal,” a foster care worker said. “The child needs a home tonight.”

Ms. Sharr and her husband, Mr. Herbold, a software engineer, were new foster parents, having been eager during the pandemic “to do something hopeful when the world felt superheavy,” she said.

A couple sitting on a white-picket porch, surrounded by a red children’s scooter and bicycle.

At 3 a.m., a social worker delivered Ricardo to the porch of their Cambridge home. When he shifted the sleeping child into Ms. Sharr’s arms, she studied his face and thought, “Oh, he’s beautiful.” After tucking him into bed, she kept vigil outside the bedroom until day broke and she heard him stir.

Leaping up, Ricardo ran to the window. “Papá! Papá!” he cried. He had been on the move for much of the year. Now, like his mother, he had no idea where he was.

On that first day, Mr. Herbold and Ms. Sharr whisked Ricardo away to a preplanned family gathering at a farm resort. In their hotel room, when he and Ms. Sharr were building block towers, he blurted, “Oh, you know what I want to tell you? I want to tell you that sometimes my dad scares me.”

Ms. Sharr, fluent in Spanish, answered: “OK, tell me more.”

“You know, he hit me here,” Ricardo said, pointing to his leg. He tapped the nape of his neck. “There was one time he hit me here.”

“Oh, my,” Ms. Sharr said. “I’m so sorry that happened to you.”

The following week, the Department of Children and Families, known as D.C.F., granted Ricardo’s father a supervised visit at the school. When Ricardo saw his father, he collapsed, screaming and crying.

Afterward, the principal, Ms. Soto, who is Puerto Rican and bilingual, intercepted the man. “I need to ask you, Where is Ricardo’s mom?” she said. “Because she needs to be notified.”

The man insisted he had no idea, but the principal did not buy it: “I told D.C.F., ‘This is very fishy — it feels to me that he’s hiding something.’ And D.C.F. assured us that they had their technology specialists trying to find her.”

What precise steps they took are unclear; child welfare cases are confidential. But generally speaking, the agency said, “when a child enters foster care, D.C.F. first tries to safely reunify the child with their biological parents by working collaboratively with immediate and extended family.”

At significant cost to Olga and Ricardo, that initial effort failed.

From that point onward, the school took Ricardo under wing. “I felt, you know, in the absence of his mother, we have to try to replace that here in the building,” Ms. Soto said.

In his English-immersion kindergarten, his teacher understood that he needed “an extra piece of care.” While a quick learner, he would sometimes crumple. If he got frustrated writing his letters, he’d scribble over his work with an angry pencil. When he grew restless during individual work times, Ms. Cohn took him on walks, having planted ninja toys in the hallway that Ricardo could search for along the way.

“We spent a lot of time together, a lot of time just holding hands,” she said, adding that she considered Ricardo “kind of magical,” with a gift for making people “love him and go out of their way to take care of him.”

At home, his foster parents focused on developing a routine.

Every morning, they woke him with a peppy playlist of Latin music and made light of wadding up his sheets when he wet his bed. He dressed himself, with flair, gravitating to dark jeans, dress shoes and a dab of cologne. Following his after-school program, Ricardo walked their dog and did some schoolwork. On weekends, there was family swim at the YMCA, followed by games on the Nintendo Switch with Mr. Herbold.

A month later, shortly before Christmas 2021, the child welfare department announced it was going to move him.

Ricardo’s principal and foster parents worried about inflicting another disruption. But under federal law, states must consider “kinship care” for foster children, and Ricardo’s father proposed his girlfriend’s sister.

In legalese, she was “fictive kin.” Yet fictive kin are supposed to have a relationship with a child, whereas Ricardo didn’t know the woman.

“It really wasn’t right,” Ms. Soto said. “He was so OK with Christy and Nick.”

In mid-January, Olga nervously paced the lobby of a Boston Holiday Inn, waiting for a social worker to arrive with her son. When they walked through the door, she fell to her knees and enveloped him in her arms.

Ricardo wriggled out of her embrace, shouting: “What took you so long? Why didn’t you come find me sooner?”

Olga was at a loss to explain to her 5-year-old just how desperately hard she had tried, how nobody had notified her and how she had tracked him down through a network of relatives only after months of detective work. She showered him with presents and, when he softened, kisses. At their visit’s end, she took his picture in the woolen hat with tasseled ear flaps she had brought him, capturing the sad eyes of a boy about to be separated from his mother again.

Civil rights groups have long accused the D.C.F. of mishandling immigrant families.

Hispanic children, who, like Black children, are more likely to be reported for neglect or abuse, are also more likely to be removed from their homes and more likely to be placed with strangers. They get moved around more and tend to stay longer in foster care.

But language barriers compound things. In a federal civil rights complaint, the Greater Boston Latino Network and other groups have accused the department of failing to provide adequate interpretation services, creating a risk of wrongful family separations.

Olga was appointed a free lawyer who did not speak Spanish. Because her English was still rudimentary, she decided to pay for one who did, and that cost her more than his $2,000 fee: The lawyer specialized in immigration, not family law. And it appears from the docket — the record is impounded — that he failed to make what could have been a crucial early plea.

In his place, lawyers consulted for this article said they would have immediately requested a temporary custody hearing and argued that Olga should be presumed fit absent any proof that she posed an imminent risk to her child. A simple background check could have been done and the judge could have questioned Olga. And then, in the best of circumstances, Olga could have walked out of the courtroom with her child.

But the child protection system was at that very moment embroiled in a cross-border custody scandal.

It involved a 5-year-old girl named Harmony Montgomery, a ward of the state whose father, a New Hampshire resident, had sought her custody. Abiding by its internal regulations, the Massachusetts D.C.F. asked New Hampshire to approve the move under a 62-year-old agreement called the Interstate Compact on the Placement of Children. But the judge disagreed with this request, considering it an infringement on the father’s right to parent his child, and did not wait for New Hampshire to respond.

The interstate compact was created primarily to govern cross-border foster care moves. Whether it applies to fit parents has been widely debated across the country, and high courts in at least a dozen states have said it does not.

The National Association of Counsel for Children agrees. “Applying the compact to parents who simply live out of state, when there is no finding or even allegation of wrongdoing, is unconstitutional and harmful to children,’’” said Allison Green, its legal director.

But in late 2019, two years after the Massachusetts judge awarded custody to Harmony Montgomery’s father, the authorities in New Hampshire revealed that the girl was missing and presumed to be dead.

Her shadow hung over Ricardo’s case. Nobody in the Massachusetts child-welfare system wanted to take another potentially deadly risk involving the interstate compact.

And so not long after Olga returned to Hollywood, she was fingerprinted, drug-tested and visited by ChildNet, a private agency under contract to Florida. A caseworker found her home “very neatly kept and well maintained,” with a nicely decorated children’s bedroom. Yet in her report, in which she misidentified Olga as Guatemalan, the caseworker concluded that Ricardo would not be safe there.

“The mother is not a legal resident of the United States,” she wrote. “She could be deported at any time.”

This argument comes up frequently in cases where parents are detained by immigration authorities and fighting deportation. Yet Olga was not in deportation proceedings. She was simply one of the hundreds of thousands of unauthorized immigrants in Florida who, posing no threat to national security or public safety, are not an enforcement priority under Biden administration rules.

And she already had a child in her custody, Dariela, then 7, whom Florida had made no effort to remove on the same grounds.

Under the interstate compact, a receiving state’s denial is technically binding. But legal experts said there could have been a quiet agreement among the parties to ignore a decision considered inappropriate.

Back in Massachusetts, the case froze, just as Ricardo’s situation was growing newly turbulent. His father’s girlfriend’s sister wanted him out, and because no foster homes were available, the boy was likely to be placed in a group home and possibly in a different school district, too.

“That’s when all hell broke loose,” his principal said. “I was like: ‘No, this cannot happen. This is the only place he knows as safe since he arrived here.’”

In the end, she saw just one solution: She would take the boy into her own home even if it temporarily upended her life.

Every day, Ms. Soto had hundreds of children in her care. But, with her own children grown, it was something else to mother one — to bathe, feed and discipline one. To be summoned over the loudspeaker to help him in the bathroom. To lie by his side until he was snoring softly because he couldn’t fall asleep alone.

After weeks of observing video calls between Olga and Ricardo, Ms. Soto made it clear to caseworkers that she endorsed a speedy reunion. When spring break arrived with no progress, she asked Mr. Herbold and Ms. Sharr to take in the boy temporarily so she could visit family in San Juan. They leaped at the chance.

It was a tough visit. “He was mad, a mad kid who had been passed around,” Ms. Sharr said. Still, their relationship deepened, and they, too, got to know and trust Olga. When Ricardo returned to the principal, they vowed to do whatever they could to help reunite mother and son — a highly unusual commitment, child welfare experts said.

Their chance came quickly. By the end of the school year, Ricardo was starting to call his principal “Mamá.” But she had summer commitments, and Ricardo was shuttled into a fourth family’s care.

At that point, Ms. Sharr proposed that Olga move into their Cambridge home temporarily so the system could get to know her better. They would support her, they said. Arranging to leave her daughter with the girl’s grandmother, Olga accepted. She was floored by their generosity. “There really are people who are angels,” she said.

She was allotted twice-weekly visits with Ricardo at their house, at the end of which he would beg Olga to let him stay, as if she had a say, and promise to be a good boy, as if his behavior was the issue.

Before long, the child welfare department was proposing a fifth placement. This was not unusual in Massachusetts, which in 2021 ranked 48th among the states in “placement stability” for foster children, according to the Annie E. Casey Foundation. (The agency says it has undertaken new initiatives to minimize moves.)

That was where Ricardo’s advocates drew the line.

By that point, Olga had dismissed her private lawyer and accepted the original court-appointed one, with Ms. Sharr volunteering to serve as interpreter. The lawyer devised a strategy: Mr. Herbold and Ms. Sharr could become Ricardo’s conditional guardians. They would have to surrender their foster care license and assume responsibility for the boy’s health care, but they readily agreed.

In summer 2022, Ricardo joined his mother at their home, and Olga’s new lawyer pushed immediately to schedule a trial to determine permanent custody. But with the government seemingly unable to find an open court date, Mr. Herbold and Ms. Sharr reached out to officials in Florida in hopes of catalyzing a resolution. If the state’s main concern was taking on Ricardo as a ward should Olga be deported, they offered themselves — “citizens of the United States by birth” — as backup.

“To the State of Florida,” Ms. Sharr wrote. “Nick and I are available for Ricardo should the need arise at any point in the future. We are able to care for Ricardo through his 18th birthday (and beyond). He is a part of our family now and we want the best for him, his mother, full biological sister and extended family in Florida.”

In first grade, showing significant improvement behaviorally, academically and in English, Ricardo moved into a mainstream class. “He’s wicked sharp,” Mr. Herbold said. Purple paw prints, awards for class contributions, proliferated on the refrigerator. Spiderman took over elsewhere: Spider-Man pajamas, LEGOs, shampoo. Still, anxiety lurked beneath this cheery surface; a knock on the door would send him diving under a couch to hide.

Over winter break, with Dariela visiting, they all took a trip to New York City, where they rode a carriage through Central Park, sipping hot chocolate. They saw it, with fingers crossed, as a kind of farewell tour: Finally, six months after Olga’s lawyer requested a trial date, one had been scheduled.

On Jan. 19, 2023, after a four-hour hearing, the judge found that Ricardo was “not in need of care and protection as to mother” and should be returned to Olga’s custody.

Why he felt able to disregard the Florida denial at that point is unclear; juvenile judges in Massachusetts are not allowed to discuss their cases.

But before Olga had a chance to embrace her victory, the judge stayed his order for six days to give the child protection department time to appeal. And as she left the courtroom and returned to Florida to get her daughter back to school, Olga feared the worst.

Her advocates, however, chose optimism. On the eve of the department’s decision, Mr. Herbold flew south with Ricardo. A few months later, under a new Florida law, Mr. Herbold would have been criminally liable for transporting an unauthorized immigrant into the state.

But at that moment, as they checked into a hotel, he was on tenterhooks for a different reason.

“OK, so now we go to Mom’s, right?” Ricardo asked him.

“Oh, dude,” Mr. Herbold replied. “You have to hang out with me for the night, because tomorrow the big boss is going to make a call as to whether you get to live with Mom or if you just get to see Mom and then we have to fly back to Boston.”

The next day, more than a year after Olga first presented herself to the authorities in Massachusetts expecting an imminent reunion with her son, the custody decision became final.

Ten minutes after she got the news, Olga arrived at the hotel in buoyant spirits. She ran toward Ricardo and scooped him up in a fierce hug. As she stared into his eyes and he into hers, she staggered into the future with the boy in her arms, dangling but attached.

COMMENTS

DataSpace: Towards Understanding Self-Supervised Representation Learning
While supervised learning sparked the deep learning boom, it has some critical shortcomings: (1) it requires an abundance of expensive labeled data, and (2) it solves tasks from scratch rather than the human-like approach of leveraging knowledge and skills acquired from prior experiences. ... In this thesis we present works that initiate and ...
PDF Enhancing Self-Supervised Learning through Transformations in Higher
data, self-supervised learning allows for the use of larger models trained on more data, with reduced risk of overfitting. As such, self-supervised learning has gained popularity as an effective method for learning high-quality and transferable representations. The understanding of the learning mechanisms employed by NNs has been an ongoing
PDF Self-Supervised Learning
Can self-supervised learning help? •Self-supervised learning (informal definition): supervise using labels generated from the data without any manual or weak label sources •Idea: Hide or modify part of the input. Ask model to recover input or classify what changed. •Self-supervised task referred to as the pretext task 6
PDF Weaker Than You Think: A Critical Look at Weakly Supervised Learning
value of weakly supervised learning, we thor-oughly analyze diverse NLP datasets and tasks to ascertain when and why weakly supervised approaches work. Based on our findings, we provide recommendations for future research.1 1 Introduction Weakly supervised learning (WSL) is one of the most popular approaches for alleviating the anno-
PDF Model Selection and Evaluation in Supervised Machine Learning
Supervised Machine Learning Author: Max Westphal Supervisor: Prof. Dr. Werner Brannath A thesis submitted in partial fulfilment of the requirements for the degree of Dr. rer. nat. in the Working Group of Applied Statistics and Biometry Faculty 3: Mathematics and Computer Science April 6, 2020
PDF RECURSIVE DEEP LEARNING A DISSERTATION
The main three chapters of the thesis explore three recursive deep learning modeling choices. The rst modeling choice I investigate is the overall objective function that crucially guides what the RNNs need to capture. I explore unsupervised, supervised and semi-supervised learning for structure prediction (parsing), structured sentiment
PDF Self-supervised Multi-view Clustering in Computer Vision: A Survey
vision, self-supervised learning has also made substantial research progress and is progressively becoming dominant in MVC methods. It guides the clustering process by designing proxy tasks to mine the representation of image and video data itself as supervisory information. Despite the rapid development of self-supervised MVC, there has yet to ...
Imposing and Uncovering Group Structure in Weakly-Supervised Learning
Our thesis focuses on learning from data characterized by weak supervision, delving into the interrelationships among group members. ... Therefore, in the final section, we shift our focus to minimizing the assumptions required when learning from weakly supervised data and simultaneously deducing the group structure during the learning process ...
A Comparison of Supervised Machine Learning Classification Techniques
Gmyzin, D. (2017) A Comparison of Supervised Machine Learning Classification Techniques and Theory-Driven Approaches for the Prediction of Subjective Mental Workload. Masters dissertation, Technological University Dublin, 2017. doi:10.21427/D7533X
Learning Video Representation from Self-supervision
This thesis investigates the problem of learning video representations for video understanding. Previous works have explored the use of data-driven deep learning approaches, which have been shown to be effective in learning useful video representations. However, obtaining large amounts of labeled data can be costly and time-consuming. We investigate self-supervised approach as for multimodal ...
[2403.13001] Fundamental Components of Deep Learning: A category
Combining them yields parametric weighted optics, a categorical model of artificial neural networks, and more. Part II justifies the abstractions from Part I, applying them to model backpropagation, architectures, and supervised learning. We provide a lens-theoretic axiomatisation of differentiation, covering not just smooth spaces, but ...
Doctoral Thesis: Self-Supervised Learning for Speech Processing
data are intrinsically rare, costly, or time-consuming to collect. In contrast to annotated speech, untranscribed audio is often much cheaper to accumulate. In. this thesis, we explore the use of self-supervised learning—a learning paradigm where the. learning target is generated from the input itself—for leveraging such easily scalable ...
Supervised machine learning: A brief primer
Supervised learning has been applied to large data structures including demographic, clinical, and social predictors in order to develop risk scores predicting the onset and trajectory of a range of mental disorders (e.g., anxiety, depression, and trauma-related disorders) and suicidal behavior ( Galatzer-Levy, 2015; Gradus et al., 2020 ...
PDF Fundamental Limitations of Semi-Supervised Learning
supervised learning paradigm. Outside of supervised learning, however, our current theoretical understanding of two important areas known as unsupervised learning and semi-supervised learning (SSL) leaves a lot to be desired. Unsupervised learning is concerned with discovering meaningful structure in a raw dataset. This may include grouping ...
Semi-supervised learning for natural language
Statistical supervised learning techniques have been successful for many natural language processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available "for free" in large quantities. Unlabeled data has shown promise in improving the performance of a number ...
Learning In The Wild With Limited Supervision
Over the past decade, machine visual perception has experienced remarkable progress due to advancements in the field of deep learning. However, the performance of deep learning systems remain far from ideal in real-world tasks that lack large training datasets. In this thesis, we study learning under limited supervision with a focus on unsupervised domain adaptation (no labelled examples) and ...
PDF Semi-Supervised Learning with Graphs
•reinforcement learning. The learning system repeatedly observes the envi-ronment x, performs an action a, and receives a reward r. The goal is to choose the actions that maximize the future rewards. This thesis focuses on classiﬁcation, which is traditionally a supervised lear n-ing task.
PDF Self-supervisedscenerepresentationlearning Adissertation
Preface In this thesis, Self-supervised Scene Representation Learning, we propose novel approaches to enable artiﬁcial intelligence models to infer representations of 3D environments conditioned exclusively on posed images. •We propose to exploit 3D-structured feature spaces in the form of voxelgrids of features,
Theses
Self-supervised learning, which relies on unlabeled data, reduces the dependence on extensive manual annotations. ... In this thesis we study self-supervised representation learning where the goal is to alleviate these limitations by substituting the classification task with pseudo tasks where the labels come for free. We discuss self ...
[2312.00101] Towards Unsupervised Representation Learning: Learning
Unsupervised representation learning aims at finding methods that learn representations from data without annotation-based signals. Abstaining from annotations not only leads to economic benefits but may - and to some extent already does - result in advantages regarding the representation's structure, robustness, and generalizability to different tasks. In the long run, unsupervised methods ...
PDF Semi-Supervised Learning for Natural Language
Statistical supervised learning techniques have been successful for many natural lan-guage processing tasks, but they require labeled datasets, which can be expensive to obtain. On the other hand, unlabeled data (raw text) is often available \for free" in large quantities. Unlabeled data has shown promise in improving the performance
Supervisor and Student Perspectives on Undergraduate Thesis Supervision
Diagnosing teachers are teachers who perceive diagnostic information about students' learning process, interpret these aspects, decide how to respond, and act based on this diagnostic decision. During supervision meetings about the undergraduate thesis supervisors make in-the-moment decisions while interacting with their students.
Self-Supervised Physics-Guided Deep Learning for Solving Inverse
Self-supervised deep learning algorithms split the pixels for each image into two disjoint sets to perform training and defining loss. In existent self-supervised denoising approaches which are purely data-driven, the set of pixels used as input to the network is not re-utilized in the end-to-end training since the network is only comprised of ...
Introduction to Semi-Supervised Learning
This introductory book presents some popular semi-supervised learning models, including self-training, mixture models, co-training and multiview learning, graph-based methods, and semi- supervised support vector machines, and discusses their basic mathematical formulation. Semi-supervised learning is a learning paradigm concerned with the study of how computers and natural systems such as ...
'Yo Soy la Mamá': A Migrant Mother's Struggle to Get Back Her Son
March 25, 2024. Over the final four months of 2021, Olga, a Honduran immigrant in Hollywood, Fla., grew increasingly panicked. She could not find her 5-year-old son, Ricardo. After she'd fled ...