Subscribe to the PwC Newsletter

Join the community, computer vision, semantic segmentation.

research paper on computer vision

Tumor Segmentation

research paper on computer vision

Panoptic Segmentation

research paper on computer vision

3D Semantic Segmentation

research paper on computer vision

Weakly-Supervised Semantic Segmentation

Representation learning.

research paper on computer vision

Disentanglement

Graph representation learning, sentence embeddings.

research paper on computer vision

Network Embedding

Classification.

research paper on computer vision

Text Classification

research paper on computer vision

Graph Classification

research paper on computer vision

Audio Classification

research paper on computer vision

Medical Image Classification

Object detection.

research paper on computer vision

3D Object Detection

research paper on computer vision

Real-Time Object Detection

research paper on computer vision

RGB Salient Object Detection

research paper on computer vision

Few-Shot Object Detection

Image classification.

research paper on computer vision

Out of Distribution (OOD) Detection

research paper on computer vision

Few-Shot Image Classification

research paper on computer vision

Fine-Grained Image Classification

research paper on computer vision

Semi-Supervised Image Classification

2d object detection.

research paper on computer vision

Edge Detection

Thermal image segmentation.

research paper on computer vision

Open Vocabulary Object Detection

Reinforcement learning (rl), off-policy evaluation, multi-objective reinforcement learning, 3d point cloud reinforcement learning, deep hashing, table retrieval, domain adaptation.

research paper on computer vision

Unsupervised Domain Adaptation

research paper on computer vision

Domain Generalization

research paper on computer vision

Test-time Adaptation

Source-free domain adaptation, image generation.

research paper on computer vision

Image-to-Image Translation

research paper on computer vision

Image Inpainting

research paper on computer vision

Text-to-Image Generation

research paper on computer vision

Conditional Image Generation

Data augmentation.

research paper on computer vision

Image Augmentation

research paper on computer vision

Text Augmentation

Autonomous vehicles.

research paper on computer vision

Autonomous Driving

research paper on computer vision

Self-Driving Cars

research paper on computer vision

Simultaneous Localization and Mapping

research paper on computer vision

Autonomous Navigation

research paper on computer vision

Image Denoising

research paper on computer vision

Color Image Denoising

research paper on computer vision

Sar Image Despeckling

Grayscale image denoising, meta-learning.

research paper on computer vision

Few-Shot Learning

research paper on computer vision

Sample Probing

Universal meta-learning, contrastive learning.

research paper on computer vision

Super-Resolution

research paper on computer vision

Image Super-Resolution

research paper on computer vision

Video Super-Resolution

research paper on computer vision

Multi-Frame Super-Resolution

research paper on computer vision

Reference-based Super-Resolution

Pose estimation.

research paper on computer vision

3D Human Pose Estimation

research paper on computer vision

Keypoint Detection

research paper on computer vision

3D Pose Estimation

research paper on computer vision

6D Pose Estimation

Self-supervised learning.

research paper on computer vision

Point Cloud Pre-training

Unsupervised video clustering, 2d semantic segmentation, image segmentation, text style transfer.

research paper on computer vision

Scene Parsing

research paper on computer vision

Reflection Removal

Visual question answering (vqa).

research paper on computer vision

Visual Question Answering

research paper on computer vision

Machine Reading Comprehension

research paper on computer vision

Chart Question Answering

research paper on computer vision

Embodied Question Answering

research paper on computer vision

Depth Estimation

research paper on computer vision

3D Reconstruction

research paper on computer vision

Neural Rendering

research paper on computer vision

3D Face Reconstruction

research paper on computer vision

3D Shape Reconstruction

Sentiment analysis.

research paper on computer vision

Aspect-Based Sentiment Analysis (ABSA)

research paper on computer vision

Multimodal Sentiment Analysis

research paper on computer vision

Aspect Sentiment Triplet Extraction

research paper on computer vision

Twitter Sentiment Analysis

Anomaly detection.

research paper on computer vision

Unsupervised Anomaly Detection

research paper on computer vision

One-Class Classification

Supervised anomaly detection, anomaly detection in surveillance videos.

research paper on computer vision

Temporal Action Localization

research paper on computer vision

Video Understanding

research paper on computer vision

Video Object Segmentation

Video generation.

research paper on computer vision

Action Classification

Activity recognition.

research paper on computer vision

Action Recognition

research paper on computer vision

Human Activity Recognition

Egocentric activity recognition.

research paper on computer vision

Group Activity Recognition

research paper on computer vision

One-Shot Learning

research paper on computer vision

Few-Shot Semantic Segmentation

Cross-domain few-shot.

research paper on computer vision

Unsupervised Few-Shot Learning

3d object super-resolution, medical image segmentation.

research paper on computer vision

Lesion Segmentation

research paper on computer vision

Brain Tumor Segmentation

research paper on computer vision

Cell Segmentation

research paper on computer vision

Brain Segmentation

Monocular depth estimation.

research paper on computer vision

Stereo Depth Estimation

Depth and camera motion.

research paper on computer vision

3D Depth Estimation

Exposure fairness, optical character recognition (ocr).

research paper on computer vision

Active Learning

research paper on computer vision

Handwriting Recognition

Handwritten digit recognition, irregular text recognition, instance segmentation.

research paper on computer vision

Referring Expression Segmentation

research paper on computer vision

3D Instance Segmentation

research paper on computer vision

Real-time Instance Segmentation

research paper on computer vision

Unsupervised Object Segmentation

Facial recognition and modelling.

research paper on computer vision

Face Recognition

research paper on computer vision

Face Swapping

research paper on computer vision

Face Detection

research paper on computer vision

Facial Expression Recognition (FER)

research paper on computer vision

Face Verification

Object tracking.

research paper on computer vision

Multi-Object Tracking

research paper on computer vision

Visual Object Tracking

research paper on computer vision

Multiple Object Tracking

research paper on computer vision

Cell Tracking

Zero-shot learning.

research paper on computer vision

Generalized Zero-Shot Learning

research paper on computer vision

Compositional Zero-Shot Learning

Multi-label zero-shot learning, quantization, data free quantization, unet quantization.

research paper on computer vision

Action Recognition In Videos

research paper on computer vision

3D Action Recognition

Self-supervised action recognition, few shot action recognition, continual learning.

research paper on computer vision

Class Incremental Learning

Continual named entity recognition, unsupervised class-incremental learning.

research paper on computer vision

Scene Understanding

research paper on computer vision

Scene Text Recognition

research paper on computer vision

Scene Graph Generation

research paper on computer vision

Scene Recognition

Adversarial attack.

research paper on computer vision

Backdoor Attack

research paper on computer vision

Adversarial Text

Adversarial attack detection, real-world adversarial attack, image retrieval.

research paper on computer vision

Sketch-Based Image Retrieval

research paper on computer vision

Content-Based Image Retrieval

research paper on computer vision

Composed Image Retrieval (CoIR)

research paper on computer vision

Medical Image Retrieval

Active object detection, dimensionality reduction.

research paper on computer vision

Supervised dimensionality reduction

Online nonnegative cp decomposition, emotion recognition.

research paper on computer vision

Speech Emotion Recognition

research paper on computer vision

Emotion Recognition in Conversation

research paper on computer vision

Multimodal Emotion Recognition

Emotion-cause pair extraction.

research paper on computer vision

Monocular 3D Object Detection

research paper on computer vision

3D Object Detection From Stereo Images

research paper on computer vision

Multiview Detection

Robust 3d object detection, style transfer.

research paper on computer vision

Image Stylization

Font style transfer, style generalization, face transfer, optical flow estimation.

research paper on computer vision

Video Stabilization

Image reconstruction.

research paper on computer vision

MRI Reconstruction

Action localization.

research paper on computer vision

Action Segmentation

Spatio-temporal action localization, person re-identification.

research paper on computer vision

Unsupervised Person Re-Identification

Video-based person re-identification, generalizable person re-identification, cloth-changing person re-identification, image captioning.

research paper on computer vision

3D dense captioning

Controllable image captioning, aesthetic image captioning.

research paper on computer vision

Relational Captioning

Visual relationship detection, lighting estimation.

research paper on computer vision

3D Room Layouts From A Single RGB Panorama

Road scene understanding, image restoration.

research paper on computer vision

Demosaicking

Spectral reconstruction, underwater image restoration.

research paper on computer vision

JPEG Artifact Correction

Action detection.

research paper on computer vision

Skeleton Based Action Recognition

research paper on computer vision

Online Action Detection

Audio-visual active speaker detection, metric learning.

research paper on computer vision

Object Recognition

research paper on computer vision

3D Object Recognition

Continuous object recognition.

research paper on computer vision

Depiction Invariant Object Recognition

research paper on computer vision

Monocular 3D Human Pose Estimation

Pose prediction.

research paper on computer vision

3D Multi-Person Pose Estimation

3d human pose and shape estimation, multi-label classification.

research paper on computer vision

Missing Labels

Extreme multi-label classification, medical code prediction, hierarchical multi-label classification, image enhancement.

research paper on computer vision

Low-Light Image Enhancement

Image relighting, de-aliasing, continuous control.

research paper on computer vision

Steering Control

Drone controller.

research paper on computer vision

Semi-Supervised Video Object Segmentation

research paper on computer vision

Unsupervised Video Object Segmentation

research paper on computer vision

Referring Video Object Segmentation

research paper on computer vision

Video Salient Object Detection

3d face modelling.

research paper on computer vision

Trajectory Prediction

research paper on computer vision

Trajectory Forecasting

Human motion prediction, out-of-sight trajectory prediction.

research paper on computer vision

Multivariate Time Series Imputation

Object localization.

research paper on computer vision

Weakly-Supervised Object Localization

Image-based localization, unsupervised object localization, monocular 3d object localization, novel view synthesis.

research paper on computer vision

Novel LiDAR View Synthesis

research paper on computer vision

Gournd video synthesis from satellite image

Image quality assessment, no-reference image quality assessment, blind image quality assessment.

research paper on computer vision

Aesthetics Quality Assessment

Stereoscopic image quality assessment.

research paper on computer vision

Blind Image Deblurring

Single-image blind deblurring, out-of-distribution detection, video semantic segmentation.

research paper on computer vision

Camera shot segmentation

Cloud removal.

research paper on computer vision

Facial Inpainting

research paper on computer vision

Fine-Grained Image Inpainting

10-shot image generation, gan image forensics, instruction following, visual instruction following, saliency detection.

research paper on computer vision

Saliency Prediction

research paper on computer vision

Co-Salient Object Detection

Video saliency detection, unsupervised saliency detection, change detection.

research paper on computer vision

Semi-supervised Change Detection

Image compression.

research paper on computer vision

Feature Compression

Jpeg compression artifact reduction.

research paper on computer vision

Lossy-Compression Artifact Reduction

Color image compression artifact reduction, explainable artificial intelligence, explainable models, explanation fidelity evaluation, fad curve analysis, image registration.

research paper on computer vision

Unsupervised Image Registration

Visual reasoning.

research paper on computer vision

Visual Commonsense Reasoning

Ensemble learning, salient object detection, saliency ranking, prompt engineering.

research paper on computer vision

Visual Prompting

Visual tracking.

research paper on computer vision

Point Tracking

Rgb-t tracking, real-time visual tracking.

research paper on computer vision

RF-based Visual Tracking

2d classification.

research paper on computer vision

Neural Network Compression

research paper on computer vision

Music Source Separation

Cell detection.

research paper on computer vision

Plant Phenotyping

Open-set classification, motion estimation, 3d point cloud classification.

research paper on computer vision

3D Object Classification

research paper on computer vision

Few-Shot 3D Point Cloud Classification

Zero-shot transfer 3d point cloud classification, image manipulation detection.

research paper on computer vision

Generalized Zero Shot skeletal action recognition

Zero shot skeletal action recognition, activity prediction, motion prediction, cyber attack detection, sequential skip prediction, point cloud registration.

research paper on computer vision

Image to Point Cloud Registration

Whole slide images.

research paper on computer vision

Robust 3D Semantic Segmentation

research paper on computer vision

Real-Time 3D Semantic Segmentation

research paper on computer vision

Unsupervised 3D Semantic Segmentation

Furniture segmentation, gesture recognition.

research paper on computer vision

Hand Gesture Recognition

research paper on computer vision

Hand-Gesture Recognition

research paper on computer vision

RF-based Gesture Recognition

Text detection, 3d point cloud interpolation, video captioning.

research paper on computer vision

Dense Video Captioning

Boundary captioning, visual text correction, audio-visual video captioning, medical diagnosis.

research paper on computer vision

Alzheimer's Disease Detection

research paper on computer vision

Retinal OCT Disease Classification

Blood cell count, thoracic disease classification, video question answering.

research paper on computer vision

Zero-Shot Video Question Answer

Few-shot video question answering, visual grounding.

research paper on computer vision

Person-centric Visual Grounding

research paper on computer vision

Phrase Extraction and Grounding (PEG)

Visual odometry.

research paper on computer vision

Face Anti-Spoofing

Monocular visual odometry.

research paper on computer vision

Hand Pose Estimation

research paper on computer vision

Hand Segmentation

Gesture-to-gesture translation, rain removal.

research paper on computer vision

Single Image Deraining

Image clustering.

research paper on computer vision

Online Clustering

research paper on computer vision

Face Clustering

Multi-view subspace clustering, multi-modal subspace clustering, colorization.

research paper on computer vision

Line Art Colorization

research paper on computer vision

Point-interactive Image Colorization

research paper on computer vision

Color Mismatch Correction

research paper on computer vision

Image Dehazing

research paper on computer vision

Single Image Dehazing

Robot navigation.

research paper on computer vision

PointGoal Navigation

Social navigation.

research paper on computer vision

Sequential Place Learning

Image manipulation.

research paper on computer vision

Unsupervised Image-To-Image Translation

research paper on computer vision

Synthetic-to-Real Translation

research paper on computer vision

Multimodal Unsupervised Image-To-Image Translation

research paper on computer vision

Cross-View Image-to-Image Translation

research paper on computer vision

Fundus to Angiography Generation

Stereo matching, visual localization.

research paper on computer vision

Visual Place Recognition

research paper on computer vision

Indoor Localization

3d place recognition, image editing, rolling shutter correction, shadow removal, joint deblur and frame interpolation, multimodal fashion image editing, multimodel-guided image editing, conformal prediction.

research paper on computer vision

Crowd Counting

research paper on computer vision

Visual Crowd Analysis

Group detection in crowds, human-object interaction detection.

research paper on computer vision

Affordance Recognition

Object reconstruction.

research paper on computer vision

3D Object Reconstruction

Deepfake detection.

research paper on computer vision

Synthetic Speech Detection

Human detection of deepfakes, multimodal forgery detection, point cloud classification, jet tagging, few-shot point cloud classification, image matching.

research paper on computer vision

Semantic correspondence

Patch matching, set matching.

research paper on computer vision

Matching Disparate Images

Image deblurring, low-light image deblurring and enhancement, document text classification, learning with noisy labels, multi-label classification of biomedical texts, political salient issue orientation detection.

research paper on computer vision

Weakly Supervised Action Localization

Weakly-supervised temporal action localization.

research paper on computer vision

Temporal Action Proposal Generation

Activity recognition in videos, earth observation, hyperspectral.

research paper on computer vision

Hyperspectral Image Classification

Hyperspectral unmixing, hyperspectral image segmentation, classification of hyperspectral images, video quality assessment, video alignment, temporal sentence grounding, long-video activity recognition, 2d human pose estimation, action anticipation.

research paper on computer vision

3D Face Animation

Semi-supervised human pose estimation, scene classification.

research paper on computer vision

Point Cloud Generation

Point cloud completion, referring expression, compressive sensing, keyword spotting.

research paper on computer vision

Small-Footprint Keyword Spotting

Visual keyword spotting, reconstruction, 3d human reconstruction.

research paper on computer vision

Single-View 3D Reconstruction

4d reconstruction, single-image-based hdr reconstruction, scene text detection.

research paper on computer vision

Curved Text Detection

Multi-oriented scene text detection, boundary detection.

research paper on computer vision

Junction Detection

Image matting.

research paper on computer vision

Semantic Image Matting

Camera calibration, video retrieval, video-text retrieval, video grounding, video-adverb retrieval, replay grounding, composed video retrieval (covr), emotion classification.

research paper on computer vision

Superpixels

Remote sensing.

research paper on computer vision

Remote Sensing Image Classification

Change detection for remote sensing images, building change detection for remote sensing images.

research paper on computer vision

Segmentation Of Remote Sensing Imagery

research paper on computer vision

The Semantic Segmentation Of Remote Sensing Imagery

Motion synthesis.

research paper on computer vision

Motion Style Transfer

Temporal human motion composition, video summarization.

research paper on computer vision

Unsupervised Video Summarization

Supervised video summarization, document ai, document understanding, point cloud segmentation, sensor fusion, 3d anomaly detection, video anomaly detection, artifact detection, document layout analysis.

research paper on computer vision

Point cloud reconstruction

research paper on computer vision

3D Semantic Scene Completion

research paper on computer vision

3D Semantic Scene Completion from a single RGB image

Garment reconstruction.

research paper on computer vision

Few-Shot Transfer Learning for Saliency Prediction

research paper on computer vision

Aerial Video Saliency Prediction

Face generation.

research paper on computer vision

Talking Head Generation

Talking face generation.

research paper on computer vision

Face Age Editing

Facial expression generation, kinship face generation, cross-modal retrieval, image-text matching, multilingual cross-modal retrieval.

research paper on computer vision

Zero-shot Composed Person Retrieval

Cross-modal retrieval on rsitmd, video instance segmentation.

research paper on computer vision

Human Detection

research paper on computer vision

Privacy Preserving Deep Learning

Membership inference attack, virtual try-on.

research paper on computer vision

Generalized Few-Shot Semantic Segmentation

Scene flow estimation.

research paper on computer vision

Self-supervised Scene Flow Estimation

Video editing, video temporal consistency, face reconstruction, motion forecasting.

research paper on computer vision

Multi-Person Pose forecasting

research paper on computer vision

Multiple Object Forecasting

3d classification.

research paper on computer vision

Generalized Referring Expression Segmentation

Depth completion.

research paper on computer vision

Object Discovery

Carla map leaderboard, dead-reckoning prediction, gaze estimation.

research paper on computer vision

Texture Synthesis

Image recognition, fine-grained image recognition, license plate recognition, material recognition.

research paper on computer vision

Text-based Image Editing

Text-guided-image-editing.

research paper on computer vision

Zero-Shot Text-to-Image Generation

Concept alignment, conditional text-to-image synthesis, human parsing.

research paper on computer vision

Multi-Human Parsing

Multi-view learning, incomplete multi-view clustering, sign language recognition.

research paper on computer vision

3D Multi-Person Pose Estimation (absolute)

research paper on computer vision

3D Multi-Person Pose Estimation (root-relative)

research paper on computer vision

3D Multi-Person Mesh Recovery

Gait recognition.

research paper on computer vision

Multiview Gait Recognition

Gait recognition in the wild, facial landmark detection.

research paper on computer vision

Unsupervised Facial Landmark Detection

research paper on computer vision

3D Facial Landmark Localization

Pose tracking.

research paper on computer vision

3D Human Pose Tracking

3d character animation from a single photo.

research paper on computer vision

3D Hand Pose Estimation

Interactive segmentation, scene segmentation, weakly supervised segmentation.

research paper on computer vision

Dichotomous Image Segmentation

Interest point detection, homography estimation, activity detection, inverse rendering, event-based vision.

research paper on computer vision

Event-based Optical Flow

research paper on computer vision

Event-Based Video Reconstruction

Event-based motion estimation, disease prediction, disease trajectory forecasting, scene generation.

research paper on computer vision

Breast Cancer Detection

Skin cancer classification.

research paper on computer vision

Breast Cancer Histology Image Classification

Lung cancer diagnosis, classification of breast cancer histology images, object counting, training-free object counting, open-vocabulary object counting, machine unlearning, continual forgetting, temporal localization.

research paper on computer vision

Language-Based Temporal Localization

Temporal defect localization, template matching, 3d object tracking.

research paper on computer vision

3D Single Object Tracking

Multi-label image classification.

research paper on computer vision

Multi-label Image Recognition with Partial Labels

Relation network, visual dialog.

research paper on computer vision

Text-to-Video Generation

Text-to-video editing, subject-driven video generation, intelligent surveillance.

research paper on computer vision

Vehicle Re-Identification

Lidar semantic segmentation, motion segmentation, camera localization.

research paper on computer vision

Camera Relocalization

Disparity estimation.

research paper on computer vision

Text Spotting

research paper on computer vision

Few-Shot Class-Incremental Learning

Class-incremental semantic segmentation, non-exemplar-based class incremental learning, handwritten text recognition, handwritten document recognition, unsupervised text recognition, knowledge distillation.

research paper on computer vision

Data-free Knowledge Distillation

Self-knowledge distillation, text to video retrieval, partially relevant video retrieval, person search, decision making under uncertainty.

research paper on computer vision

Uncertainty Visualization

Moment retrieval.

research paper on computer vision

Zero-shot Moment Retrieval

Shadow detection.

research paper on computer vision

Shadow Detection And Removal

Semi-supervised object detection.

research paper on computer vision

Unconstrained Lip-synchronization

Mixed reality, video inpainting.

research paper on computer vision

Cross-corpus

Micro-expression recognition, micro-expression spotting.

research paper on computer vision

3D Facial Expression Recognition

research paper on computer vision

Smile Recognition

Future prediction, video enhancement.

research paper on computer vision

3D Multi-Object Tracking

Real-time multi-object tracking, multi-animal tracking with identification, trajectory long-tail distribution for muti-object tracking, grounded multiple object tracking, human mesh recovery, overlapped 10-1, overlapped 15-1, overlapped 15-5, disjoint 10-1, disjoint 15-1.

research paper on computer vision

Face Image Quality Assessment

Lightweight face recognition.

research paper on computer vision

Age-Invariant Face Recognition

Synthetic face recognition, face quality assessement, image categorization, fine-grained visual categorization, open vocabulary semantic segmentation, zero-guidance segmentation, deep attention.

research paper on computer vision

Stereo Image Super-Resolution

Burst image super-resolution, satellite image super-resolution, multispectral image super-resolution, physics-informed machine learning, soil moisture estimation, line detection, zero shot segmentation, color constancy.

research paper on computer vision

Few-Shot Camera-Adaptive Color Constancy

Visual recognition.

research paper on computer vision

Fine-Grained Visual Recognition

Image cropping, stereo matching hand.

research paper on computer vision

Video Reconstruction

research paper on computer vision

3D Absolute Human Pose Estimation

research paper on computer vision

Text-to-Face Generation

Hdr reconstruction, multi-exposure image fusion, zero-shot action recognition, video restoration.

research paper on computer vision

Analog Video Restoration

Sign language translation.

research paper on computer vision

Tone Mapping

Natural language transduction, surface normals estimation.

research paper on computer vision

Transparent Object Detection

Transparent objects, cross-domain few-shot learning, image forensics, novel class discovery.

research paper on computer vision

Vision-Language Navigation

research paper on computer vision

Grasp Generation

research paper on computer vision

hand-object pose

research paper on computer vision

3D Canonical Hand Pose Estimation

Image animation.

research paper on computer vision

Breast Cancer Histology Image Classification (20% labels)

Infrared and visible image fusion.

research paper on computer vision

Probabilistic Deep Learning

Unsupervised few-shot image classification, generalized few-shot classification, abnormal event detection in video.

research paper on computer vision

Semi-supervised Anomaly Detection

Steganalysis, texture classification, spoof detection, face presentation attack detection, detecting image manipulation, cross-domain iris presentation attack detection, finger dorsal image spoof detection, computer vision techniques adopted in 3d cryogenic electron microscopy, single particle analysis, cryogenic electron tomography.

research paper on computer vision

Sketch Recognition

research paper on computer vision

Face Sketch Synthesis

Drawing pictures.

research paper on computer vision

Photo-To-Caricature Translation

Iris recognition, pupil dilation, highlight detection, pedestrian attribute recognition.

research paper on computer vision

One-shot visual object segmentation

Automatic post-editing.

research paper on computer vision

Image to 3D

Multi-view 3d reconstruction, object categorization, person retrieval, universal domain adaptation.

research paper on computer vision

Unbiased Scene Graph Generation

research paper on computer vision

Panoptic Scene Graph Generation

Action understanding, blind face restoration.

research paper on computer vision

Document Image Classification

research paper on computer vision

Face Reenactment

research paper on computer vision

Geometric Matching

Image stitching.

research paper on computer vision

Text based Person Retrieval

Human dynamics.

research paper on computer vision

3D Human Dynamics

Meme classification, hateful meme classification, image to video generation.

research paper on computer vision

Unconditional Video Generation

Severity prediction, intubation support prediction, dense captioning, human action generation.

research paper on computer vision

Action Generation

Text-to-image, story visualization, complex scene breaking and synthesis, action quality assessment, cloud detection.

research paper on computer vision

Image Outpainting

research paper on computer vision

Object Segmentation

research paper on computer vision

Camouflaged Object Segmentation

Landslide segmentation, text-line extraction, surgical phase recognition, online surgical phase recognition, offline surgical phase recognition, image fusion, pansharpening.

research paper on computer vision

Semantic SLAM

research paper on computer vision

Object SLAM

Image deconvolution.

research paper on computer vision

Intrinsic Image Decomposition

Diffusion personalization.

research paper on computer vision

Diffusion Personalization Tuning Free

research paper on computer vision

Efficient Diffusion Personalization

Point clouds, point cloud video understanding, point cloud rrepresentation learning, situation recognition, grounded situation recognition, line segment detection, multi-target domain adaptation, table recognition.

research paper on computer vision

Camouflaged Object Segmentation with a Single Task-generic Prompt

Image morphing, image shadow removal, visual prompt tuning, weakly-supervised instance segmentation, image smoothing, fake image detection.

research paper on computer vision

Fake Image Attribution

research paper on computer vision

Robot Pose Estimation

Image steganography, motion detection, person identification, rotated mnist, sports analytics, lane detection.

research paper on computer vision

3D Lane Detection

Layout design, license plate detection.

research paper on computer vision

Video Panoptic Segmentation

Viewpoint estimation.

research paper on computer vision

Drone navigation

Drone-view target localization, contour detection.

research paper on computer vision

Multi-Object Tracking and Segmentation

research paper on computer vision

Occlusion Handling

Zero-shot transfer image classification.

research paper on computer vision

3D Object Reconstruction From A Single Image

research paper on computer vision

CAD Reconstruction

Value prediction, body mass index (bmi) prediction, 3d point cloud linear classification, crop classification, face image quality, photo retouching, motion retargeting, shape representation of 3d point clouds, 3d point cloud reconstruction, bird's-eye view semantic segmentation.

research paper on computer vision

Crop Yield Prediction

Dense pixel correspondence estimation, human part segmentation.

research paper on computer vision

Multiview Learning

Person recognition.

research paper on computer vision

Document Shadow Removal

Symmetry detection, traffic sign detection, video style transfer, referring image matting.

research paper on computer vision

Referring Image Matting (Expression-based)

research paper on computer vision

Referring Image Matting (Keyword-based)

research paper on computer vision

Referring Image Matting (RefMatte-RW100)

Referring image matting (prompt-based), human interaction recognition, one-shot 3d action recognition, mutual gaze, affordance detection.

research paper on computer vision

Image Instance Retrieval

Amodal instance segmentation, image quality estimation.

research paper on computer vision

Road Damage Detection

research paper on computer vision

Space-time Video Super-resolution

Video matting.

research paper on computer vision

Hand Detection

Image forgery detection, image similarity search.

research paper on computer vision

Material Classification

research paper on computer vision

Precipitation Forecasting

Referring expression generation, inverse tone mapping, image/document clustering, self-organized clustering.

research paper on computer vision

Open-World Semi-Supervised Learning

Semi-supervised image classification (cold start), 3d shape modeling.

research paper on computer vision

Action Analysis

Facial editing.

research paper on computer vision

Food Recognition

research paper on computer vision

Holdout Set

Motion magnification.

research paper on computer vision

Open Vocabulary Attribute Detection

Semi-supervised instance segmentation, video segmentation, camera shot boundary detection, open-vocabulary video segmentation, open-world video segmentation, instance search.

research paper on computer vision

Audio Fingerprint

Art analysis, event segmentation, generic event boundary detection, gaze prediction, image retouching, image-variation, jpeg artifact removal, point cloud super resolution, skills assessment.

research paper on computer vision

Sensor Modeling

Binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, lung nodule classification, lung nodule 3d classification, lung nodule detection, lung nodule 3d detection, video prediction, earth surface forecasting, predict future video frames, 3d scene reconstruction.

research paper on computer vision

Zero-Shot Composed Image Retrieval (ZS-CIR)

Handwriting generation, multispectral object detection, pose retrieval, scanpath prediction, scene change detection.

research paper on computer vision

Sketch-to-Image Translation

Skills evaluation, highlight removal, 3d shape reconstruction from a single 2d image.

research paper on computer vision

Shape from Texture

Deception detection, deception detection in videos, handwriting verification, bangla spelling error correction, 3d shape representation.

research paper on computer vision

3D Dense Shape Correspondence

Audio-visual synchronization, birds eye view object detection.

research paper on computer vision

Multiple People Tracking

research paper on computer vision

RGB-D Reconstruction

Seeing beyond the visible, semi-supervised domain generalization, unsupervised semantic segmentation.

research paper on computer vision

Unsupervised Semantic Segmentation with Language-image Pre-training

Multiple object tracking with transformer.

research paper on computer vision

Multiple Object Track and Segmentation

Constrained lip-synchronization, face dubbing, vietnamese visual question answering, explanatory visual question answering.

research paper on computer vision

Video Visual Relation Detection

Human-object relationship detection, 3d open-vocabulary instance segmentation.

research paper on computer vision

Ad-hoc video search

Defocus blur detection, event data classification, image comprehension, image manipulation localization, instance shadow detection, kinship verification, medical image enhancement, network interpretation, open vocabulary panoptic segmentation, single-object discovery, training-free 3d point cloud classification.

research paper on computer vision

Sequential Place Recognition

Autonomous flight (dense forest), autonomous web navigation, multimodal machine translation.

research paper on computer vision

Face to Face Translation

Multimodal lexical translation, 2d semantic segmentation task 3 (25 classes), document enhancement, bokeh effect rendering, drivable area detection, face anonymization, font recognition, horizon line estimation, image imputation.

research paper on computer vision

Long Video Retrieval (Background Removed)

Medical image denoising.

research paper on computer vision

Occlusion Estimation

Physiological computing.

research paper on computer vision

Lake Ice Monitoring

Short-term object interaction anticipation, spatio-temporal video grounding, unsupervised 3d point cloud linear evaluation, video forensics, wireframe parsing, single-image-generation, unsupervised anomaly detection with specified settings -- 30% anomaly, root cause ranking, anomaly detection at 30% anomaly, anomaly detection at various anomaly percentages.

research paper on computer vision

Unsupervised Contextual Anomaly Detection

2d pose estimation, category-agnostic pose estimation, overlapping pose estimation, facial expression recognition, cross-domain facial expression recognition, zero-shot facial expression recognition, landmark tracking, muscle tendon junction identification, action assessment, animated gif generation, generalized referring expression comprehension, image deblocking, motion disentanglement, persuasion strategies, scene text editing, synthetic image detection, traffic accident detection, accident anticipation, unsupervised landmark detection, visual speech recognition, lip to speech synthesis, continual anomaly detection, gaze redirection, weakly supervised action segmentation (transcript), weakly supervised action segmentation (action set)), calving front delineation in synthetic aperture radar imagery, calving front delineation in synthetic aperture radar imagery with fixed training amount.

research paper on computer vision

Handwritten Line Segmentation

Handwritten word segmentation.

research paper on computer vision

General Action Video Anomaly Detection

Physical video anomaly detection, monocular cross-view road scene parsing(road), monocular cross-view road scene parsing(vehicle).

research paper on computer vision

Transparent Object Depth Estimation

3d semantic occupancy prediction, 3d scene editing, 4d panoptic segmentation, age and gender estimation, data ablation.

research paper on computer vision

Occluded Face Detection

Gait identification, historical color image dating, stochastic human motion prediction, image retargeting, image and video forgery detection, infrared image super-resolution, motion captioning, personality trait recognition, personalized segmentation, scene-aware dialogue, spatial relation recognition, spatial token mixer, steganographics, story continuation.

research paper on computer vision

Unsupervised Anomaly Detection with Specified Settings -- 0.1% anomaly

Unsupervised anomaly detection with specified settings -- 1% anomaly, unsupervised anomaly detection with specified settings -- 10% anomaly, unsupervised anomaly detection with specified settings -- 20% anomaly, vehicle speed estimation, visual social relationship recognition, zero-shot text-to-video generation, text-guided-generation, video frame interpolation, 3d video frame interpolation, unsupervised video frame interpolation.

research paper on computer vision

eXtreme-Video-Frame-Interpolation

Continual semantic segmentation, overlapped 5-3, overlapped 25-25, evolving domain generalization, source-free domain generalization, micro-expression generation, micro-expression generation (megc2021), mistake detection, online mistake detection, unsupervised panoptic segmentation, unsupervised zero-shot panoptic segmentation, 3d rotation estimation, camera auto-calibration, defocus estimation, derendering, fingertip detection, hierarchical text segmentation, human-object interaction concept discovery.

research paper on computer vision

One-Shot Face Stylization

Speaker-specific lip to speech synthesis, multi-person pose estimation, neural stylization.

research paper on computer vision

Part-aware Panoptic Segmentation

research paper on computer vision

Population Mapping

Pornography detection, prediction of occupancy grid maps, raw reconstruction, svbrdf estimation, semi-supervised video classification, spectrum cartography, supervised image retrieval, synthetic image attribution, training-free 3d part segmentation, unsupervised image decomposition, video propagation, visual analogies, weakly supervised 3d point cloud segmentation, weakly-supervised panoptic segmentation, drone-based object tracking, brain visual reconstruction, brain visual reconstruction from fmri.

research paper on computer vision

Human-Object Interaction Generation

Image-guided composition, fashion understanding, semi-supervised fashion compatibility.

research paper on computer vision

intensity image denoising

Lifetime image denoising, observation completion, active observation completion, boundary grounding.

research paper on computer vision

Video Narrative Grounding

3d inpainting, 3d scene graph alignment, 4d spatio temporal semantic segmentation.

research paper on computer vision

Age Estimation

research paper on computer vision

Few-shot Age Estimation

Brdf estimation, camouflage segmentation, clothing attribute recognition, damaged building detection, depth image estimation, detecting shadows, dynamic texture recognition.

research paper on computer vision

Disguised Face Verification

Few shot open set object detection, gaze target estimation, generalized zero-shot learning - unseen, hd semantic map learning, human-object interaction anticipation, image deep networks, keypoint detection and image matching, manufacturing quality control, materials imaging, multi-person pose estimation and tracking.

research paper on computer vision

Multi-modal image segmentation

Multi-object discovery, neural radiance caching.

research paper on computer vision

Parking Space Occupancy

research paper on computer vision

Partial Video Copy Detection

research paper on computer vision

Multimodal Patch Matching

Perpetual view generation, procedure learning, prompt-driven zero-shot domain adaptation, repetitive action counting, single-shot hdr reconstruction, on-the-fly sketch based image retrieval, thermal image denoising, trademark retrieval, unsupervised instance segmentation, unsupervised zero-shot instance segmentation, vehicle key-point and orientation estimation.

research paper on computer vision

Video Individual Counting

Video-adverb retrieval (unseen compositions), video-to-image affordance grounding.

research paper on computer vision

Visual Sentiment Prediction

Human-scene contact detection, localization in video forgery, 3d canonicalization.

research paper on computer vision

Cube Engraving Classification

3d surface generation.

research paper on computer vision

Visibility Estimation from Point Cloud

Amodal layout estimation, blink estimation, camera absolute pose regression, change data generation, constrained diffeomorphic image registration, continuous affect estimation, deep feature inversion, document image skew estimation, earthquake prediction, fashion compatibility learning, film removal.

research paper on computer vision

Displaced People Recognition

Finger vein recognition, flooded building segmentation.

research paper on computer vision

Future Hand Prediction

Generative temporal nursing, house generation, human fmri response prediction, hurricane forecasting, ifc entity classification, image declipping, image similarity detection.

research paper on computer vision

Image Text Removal

Image-to-gps verification.

research paper on computer vision

Image-based Automatic Meter Reading

Dial meter reading, indoor scene reconstruction, jpeg decompression.

research paper on computer vision

Kiss Detection

Laminar-turbulent flow localisation.

research paper on computer vision

Landmark Recognition

Brain landmark detection, corpus video moment retrieval, mllm evaluation: aesthetics, medical image deblurring, mental workload estimation, meter reading, micro-gesture recognition, motion expressions guided video segmentation, natural image orientation angle detection, multi-object colocalization, multilingual text-to-image generation, video emotion detection, nwp post-processing, occluded 3d object symmetry detection, open set video captioning, pso-convnets dynamics 1, pso-convnets dynamics 2, partial point cloud matching.

research paper on computer vision

Partially View-aligned Multi-view Learning

research paper on computer vision

Pedestrian Detection

research paper on computer vision

Thermal Infrared Pedestrian Detection

Personality trait recognition by face, physical attribute prediction, point cloud semantic completion, point cloud classification dataset, point- of-no-return (pnr) temporal localization, pose contrastive learning, potrait generation, prostate zones segmentation, pulmorary vessel segmentation, pulmonary artery–vein classification, reference expression generation, safety perception recognition, interspecies facial keypoint transfer, specular reflection mitigation, specular segmentation, state change object detection, surface normals estimation from point clouds, transform a video into a comics, transparency separation, typeface completion.

research paper on computer vision

Unbalanced Segmentation

research paper on computer vision

Unsupervised Long Term Person Re-Identification

Video correspondence flow.

research paper on computer vision

Key-Frame-based Video Super-Resolution (K = 15)

Vietnamese multimodal learning, zero-shot single object tracking, yield mapping in apple orchards, lidar absolute pose regression, opd: single-view 3d openable part detection, self-supervised scene text recognition, video narration captioning, spectral estimation, spectral estimation from a single rgb image, 3d prostate segmentation, aggregate xview3 metric, atomic action recognition, composite action recognition, calving front delineation from synthetic aperture radar imagery, computer vision transduction, crosslingual text-to-image generation, zero-shot dense video captioning, document to image conversion, frame duplication detection, geometrical view, hyperview challenge.

research paper on computer vision

Image Operation Chain Detection

Kinematic based workflow recognition, logo recognition.

research paper on computer vision

MLLM Aesthetic Evaluation

Motion detection in non-stationary scenes, open-set video tagging, satellite orbit determination.

research paper on computer vision

Segmentation Based Workflow Recognition

2d particle picking, small object detection.

research paper on computer vision

Rice Grain Disease Detection

Sperm morphology classification, video & kinematic base workflow recognition, video based workflow recognition, video, kinematic & segmentation base workflow recognition, animal pose estimation.

research paper on computer vision

  • Explore Blog

Data Collection

Building Blocks​

Device Enrollment

Monitoring Dashboards

Video Annotation​

Application Editor​

Device Management

Remote Maintenance

Model Training

Application Library

Deployment Manager

Unified Security Center

AI Model Library

Configuration Manager

IoT Edge Gateway

Privacy-preserving AI

Ready to get started?

  • Why Viso Suite

Top Computer Vision Papers of All Time (Updated 2024)

research paper on computer vision

Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.

Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.

Today’s boom in computer vision (CV) started at the beginning of the 21 st century with the breakthrough of deep learning models and convolutional neural networks (CNN). The main CV methods include image classification, image localization, object detection, and segmentation.

In this article, we dive into some of the most significant research papers that triggered the rapid development of computer vision. We split them into two categories – classical CV approaches, and papers based on deep-learning. We chose the following papers based on their influence, quality, and applicability.

Gradient-based Learning Applied to Document Recognition (1998)

Distinctive image features from scale-invariant keypoints (2004), histograms of oriented gradients for human detection (2005), surf: speeded up robust features (2006), imagenet classification with deep convolutional neural networks (2012), very deep convolutional networks for large-scale image recognition (2014), googlenet – going deeper with convolutions (2014), resnet – deep residual learning for image recognition (2015), faster r-cnn: towards real-time object detection with region proposal networks (2015), yolo: you only look once: unified, real-time object detection (2016), mask r-cnn (2017), efficientnet – rethinking model scaling for convolutional neural networks (2019).

About us:  Viso Suite is the end-to-end computer vision solution for enterprises. With a simple interface and features that give machine learning teams control over the entire ML pipeline, Viso Suite makes it possible to achieve a 3-year ROI of 695%. Book a demo to learn more about how Viso Suite can help solve business problems.

Viso Platform

Classic Computer Vision Papers

The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. They introduced the concept of a trainable Graph Transformer Network (GTN) for handwritten character and word recognition . They researched (non) discriminative gradient-based techniques for training the recognizer without manual segmentation and labeling.

LeNet CNN architecture digits recognition

Characteristics of the model:

  • LeNet-5 CNN contains 6 convolution layers with multiple feature maps (156 trainable parameters).
  • The input is a 32×32 pixel image and the output layer is composed of Euclidean Radial Basis Function units (RBF) one for each class (letter).
  • The training set consists of 30000 examples, and authors achieved a 0.35% error rate on the training set (after 19 passes).

Find the LeNet paper here .

David Lowe (2004), proposed a method for extracting distinctive invariant features from images. He used them to perform reliable matching between different views of an object or scene. The paper introduced Scale Invariant Feature Transform (SIFT), while transforming image data into scale-invariant coordinates relative to local features.

SIFT method keypoints detection

Model characteristics:

  • The method generates large numbers of features that densely cover the image over the full range of scales and locations.
  • The model needs to match at least 3 features from each object – in order to reliably detect small objects in cluttered backgrounds.
  • For image matching and recognition, the model extracts SIFT features from a set of reference images stored in a database.
  • SIFT model matches a new image by individually comparing each feature from the new image to this previous database (Euclidian distance).

Find the SIFT paper here .

The authors Navneet Dalal and Bill Triggs researched the feature sets for robust visual object recognition, by using a linear SVM-based human detection as a test case. They experimented with grids of Histograms of Oriented Gradient (HOG) descriptors that significantly outperform existing feature sets for human detection .

histogram object detection

Authors achievements:

  • The histogram method gave near-perfect separation from the original MIT pedestrian database.
  • For good results – the model requires: fine-scale gradients, fine orientation binning, i.e. high-quality local contrast normalization in overlapping descriptor blocks.
  • Researchers examined a more challenging dataset containing over 1800 annotated human images with many pose variations and backgrounds.
  • In the standard detector, each HOG cell appears four times with different normalizations and improves performance to 89%.

Find the HOG paper here .

Herbert Bay, Tinne Tuytelaars, and Luc Van Goo presented a scale- and rotation-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features). It outperforms previously proposed schemes concerning repeatability, distinctiveness, and robustness, while computing much faster. The authors relied on integral images for image convolutions, furthermore utilizing the leading existing detectors and descriptors.

surf detecting interest points

  • Applied a Hessian matrix-based measure for the detector, and a distribution-based descriptor, simplifying these methods to the essential.
  • Presented experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application.
  • SURF showed strong performance – SURF-128 with an 85.7% recognition rate, followed by U-SURF (83.8%) and SURF (82.6%).

Find the SURF paper here .

Papers Based on Deep-Learning Models

Alex Krizhevsky and his team won the ImageNet Challenge in 2012 by researching deep convolutional neural networks. They trained one of the largest CNNs at that moment over the ImageNet dataset used in the ILSVRC-2010 / 2012 challenges and achieved the best results reported on these datasets. They implemented a highly-optimized GPU of 2D convolution, thus including all required steps in CNN training, and published the results.

alexnet CNN architecture

  • The final CNN contained five convolutional and three fully connected layers, and the depth was quite significant.
  • They found that removing any convolutional layer (each containing less than 1% of the model’s parameters) resulted in inferior performance.
  • The same CNN, with an extra sixth convolutional layer, was used to classify the entire ImageNet Fall 2011 release (15M images, 22K categories).
  • After fine-tuning on ImageNet-2012 it gave an error rate of 16.6%.

Find the ImageNet paper here .

Karen Simonyan and Andrew Zisserman (Oxford University) investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Their main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, specifically focusing on very deep convolutional networks (VGG) . They proved that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.

 image classification CNN results VOC-2007, VOC-2012

  • Their ImageNet Challenge 2014 submission secured the first and second places in the localization and classification tracks respectively.
  • They showed that their representations generalize well to other datasets, where they achieved state-of-the-art results.
  • They made two best-performing ConvNet models publicly available, in addition to the deep visual representations in CV.

Find the VGG paper here .

The Google team (Christian Szegedy, Wei Liu, et al.) proposed a deep convolutional neural network architecture codenamed Inception. They intended to set the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of their architecture was the improved utilization of the computing resources inside the network.

GoogleNet Inception CNN

  • A carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.
  • Their submission for ILSVRC14 was called GoogLeNet, a 22-layer deep network. Its quality was assessed in the context of classification and detection.
  • They added 200 region proposals coming from multi-box increasing the coverage from 92% to 93%.
  • Lastly, they used an ensemble of 6 ConvNets when classifying each region which improved results from 40% to 43.9% accuracy.

Find the GoogLeNet paper here .

Microsoft researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun presented a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously. They reformulated the layers as learning residual functions concerning the layer inputs, instead of learning unreferenced functions.

resnet error rates

  • They evaluated residual nets with a depth of up to 152 layers – 8× deeper than VGG nets, but still having lower complexity.
  • This result won 1st place on the ILSVRC 2015 classification task.
  • The team also analyzed the CIFAR-10 with 100 and 1000 layers, achieving a 28% relative improvement on the COCO object detection dataset.
  • Moreover – in ILSVRC & COCO 2015 competitions, they won 1 st place on the tasks of ImageNet detection, ImageNet localization, COCO detection/segmentation.

Find the ResNet paper here .

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN) with full-image convolutional features with the detection network, therefore enabling nearly cost-free region proposals. Their RPN was a fully convolutional network that simultaneously predicted object bounds and objective scores at each position. Also, they trained the RPN end-to-end to generate high-quality region proposals, which were used by Fast R-CNN for detection.

faster R-CNN object detection

  • Merged RPN and fast R-CNN into a single network by sharing their convolutional features. In addition, they applied neural networks with “ attention” mechanisms .
  • For the very deep VGG-16 model, their detection system had a frame rate of 5fps on a GPU.
  • Achieved state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.
  • In ILSVRC and COCO 2015 competitions, faster R-CNN and RPN were the foundations of the 1st-place winning entries in several tracks.

Find the Faster R-CNN paper here .

Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi developed YOLO, an innovative approach to object detection. Instead of repurposing classifiers to perform detection, the authors framed object detection as a regression problem. In addition, they spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance .

YOLO CNN architecture

  • The base YOLO model processed images in real-time at 45 frames per second.
  • A smaller version of the network, Fast YOLO, processed 155 frames per second, while still achieving double the mAP of other real-time detectors.
  • Compared to state-of-the-art detection systems, YOLO was making more localization errors, but was less likely to predict false positives in the background.
  • YOLO learned very general representations of objects and outperformed other detection methods, including DPM and R-CNN, when generalizing natural images.

Find the YOLO paper here .

Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick (Facebook) presented a conceptually simple, flexible, and general framework for object instance segmentation. Their approach could detect objects in an image, while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN , extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

mask R-CNN framework

  • Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
  • Showed great results in all three tracks of the COCO suite of challenges. Also, it includes instance segmentation, bounding box object detection, and person keypoint detection.
  • Mask R-CNN outperformed all existing, single-model entries on every task, including the COCO 2016 challenge winners.
  • The model served as a solid baseline and eased future research in instance-level recognition.

Find the Mask R-CNN paper here .

The authors (Mingxing Tan, Quoc V. Le) of EfficientNet studied model scaling and identified that carefully balancing network depth, width, and resolution can lead to better performance. They proposed a new scaling method that uniformly scales all dimensions of depth resolution using a simple but effective compound coefficient. They demonstrated the effectiveness of this method in scaling up MobileNet and ResNet .

efficiennet model scaling CNN

  • Designed a new baseline network and scaled it up to obtain a family of models, called EfficientNets. It had much better accuracy and efficiency than previous ConvNets.
  • EfficientNet-B7 achieved state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet.
  • It also transferred well and achieved state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with much fewer parameters.

Find the EfficientNet paper here .

Related Articles

what is behind yolo v5

YOLOv5 Is Here! Is It Real or a Fake?

Everything you need to know about YOLOv5. Why it is controversial and the differences of YOLOv5 to previous YOLO versions.

AI Influencers With Contributions to AI, ML, DL

The 11 Top AI Influencers to Watch in 2024 (Guide)

We discuss the top AI influencers shaping the future of computer vision. From Andrew Ng to Yann LeCun, discover the minds driving innovation.

All-in-one platform to build, deploy, and scale computer vision applications

research paper on computer vision

Join 6,300+ Fellow AI Enthusiasts

Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.

research paper on computer vision

Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.

You can unsubscribe anytime. See our privacy policy .

Build any Computer Vision Application, 10x faster

All-in-one Computer Vision Platform for businesses to build, deploy and scale real-world applications.

Pcw-img

  • Deploy Apps
  • Monitor Apps
  • Manage Apps
  • Help Center

Privacy Overview

research paper on computer vision

Search code, repositories, users, issues, pull requests...

Provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference.

louisfb01/top-10-cv-papers-2021

Folders and files, repository files navigation, the top 10 computer vision papers of 2021, the top 10 computer vision papers in 2021 with video demos, articles, code, and paper reference..

While the world is still recovering, research hasn't slowed its frenetic pace, especially in the field of artificial intelligence. More, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life's quality in the near future. Still, we ought to be careful with which technology we choose to apply.

"Science cannot tell us what we ought to do, only what we can do." - Jean-Paul Sartre, Being and Nothingness

Here are my top 10 of the most interesting research papers of the year in computer vision, in case you missed any of them. In short, it is basically a curated list of the latest breakthroughs in AI and CV with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read, and let me know if I missed any important papers in the comments, or by contacting me directly on LinkedIn!

The complete reference to each paper is listed at the end of this repository.

Maintainer: louisfb01

Subscribe to my newsletter - The latest updates in AI explained every week.

Feel free to message me any interesting paper I may have missed to add to this repository.

Tag me on Twitter @Whats_AI or LinkedIn @Louis (What's AI) Bouchard if you share the list!

Watch the 2021 CV rewind

research paper on computer vision

Missed last year? Check this out: 2020: A Year Full of Amazing AI papers- A Review

👀 If you'd like to support my work and use W&B (for free) to track your ML experiments and make your work reproducible or collaborate with a team, you can try it out by following this guide ! Since most of the code here is PyTorch-based, we thought that a QuickStart guide for using W&B on PyTorch would be most interesting to share.

👉Follow this quick guide , use the same W&B lines in your code or any of the repos below, and have all your experiments automatically tracked in your w&b account! It doesn't take more than 5 minutes to set up and will change your life as it did for me! Here's a more advanced guide for using Hyperparameter Sweeps if interested :)

🙌 Thank you to Weights & Biases for sponsoring this repository and the work I've been doing, and thanks to any of you using this link and trying W&B!

Open In Colab

If you are interested in AI research, here is another great repository for you:

A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.

2021: A Year Full of Amazing AI papers- A Review

The Full List

Dall·e: zero-shot text-to-image generation from openai [1], taming transformers for high-resolution image synthesis [2], swin transformer: hierarchical vision transformer using shifted windows [3], deep nets: what have they ever done for vision [bonus].

  • Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]

Total Relighting: Learning to Relight Portraits for Background Replacement [5]

  • Animating Pictures with Eulerian Motion Fields [6]
  • CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [7]

TimeLens: Event-based Video Frame Interpolation [8]

  • (Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]
  • CityNeRF: Building NeRF at City Scale [10]

Paper references

OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.

research paper on computer vision

  • Short read: OpenAI’s DALL·E: Text-to-Image Generation Explained
  • Paper: Zero-Shot Text-to-Image Generation
  • Code: Code & more information for the discrete VAE used for DALL·E

Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.

research paper on computer vision

  • Short read: Combining the Transformers Expressivity with the CNNs Efficiency for High-Resolution Image Synthesis
  • Paper: Taming Transformers for High-Resolution Image Synthesis
  • Code: Taming Transformers

Will Transformers Replace CNNs in Computer Vision? In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.

research paper on computer vision

  • Short read: Will Transformers Replace CNNs in Computer Vision?
  • Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  • Click here for the code

"I will openly share everything about deep nets for vision applications, their successes, and the limitations we have to address."

research paper on computer vision

  • Short read: What is the state of AI in computer vision?
  • Paper: Deep nets: What have they ever done for vision?

Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]

The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!

research paper on computer vision

  • Short read: Infinite Nature: Fly into an image and explore the landscape
  • Paper: Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image

Properly relight any portrait based on the lighting of the new background you add. Have you ever wanted to change the background of a picture but have it look realistic? If you’ve already tried that, you already know that it isn’t simple. You can’t just take a picture of yourself in your home and change the background for a beach. It just looks bad and not realistic. Anyone will just say “that’s photoshopped” in a second. For movies and professional videos, you need the perfect lighting and artists to reproduce a high-quality image, and that’s super expensive. There’s no way you can do that with your own pictures. Or can you?

research paper on computer vision

  • Short read: Realistic Lighting on Different Backgrounds
  • Paper: Total Relighting: Learning to Relight Portraits for Background Replacement
If you’d like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.

Animating Pictures with Eulerian Motion Fields [6]

This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely still creating amazing-looking videos like this one...

research paper on computer vision

  • Short read: Create Realistic Animated Looping Videos from Pictures
  • Paper: Animating Pictures with Eulerian Motion Fields

CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [7]

Using a modified GAN architecture, they can move objects in the image without affecting the background or the other objects!

research paper on computer vision

  • Short read: CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation
  • Paper: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

TimeLens can understand the movement of the particles in-between the frames of a video to reconstruct what really happened at a speed even our eyes cannot see. In fact, it achieves results that our intelligent phones and no other models could reach before!

research paper on computer vision

  • Short read: How to Make Slow Motion Videos With AI!
  • Paper: TimeLens: Event-based Video Frame Interpolation
Subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!

(Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]

Have you ever dreamed of taking the style of a picture, like this cool TikTok drawing style on the left, and applying it to a new picture of your choice? Well, I did, and it has never been easier to do. In fact, you can even achieve that from only text and can try it right now with this new method and their Google Colab notebook available for everyone (see references). Simply take a picture of the style you want to copy, enter the text you want to generate, and this algorithm will generate a new picture out of it! Just look back at the results above, such a big step forward! The results are extremely impressive, especially if you consider that they were made from a single line of text!

research paper on computer vision

  • Short read: Text-to-Drawing Synthesis With Artistic Control | CLIPDraw & StyleCLIPDraw
  • Paper (CLIPDraw): CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
  • Paper (StyleCLIPDraw): StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
  • CLIPDraw Colab demo
  • StyleCLIPDraw Colab demo

CityNeRF: Building NeRF at City Scale [10]

The model is called CityNeRF and grows from NeRF, which I previously covered on my channel. NeRF is one of the first models using radiance fields and machine learning to construct 3D models out of images. But NeRF is not that efficient and works for a single scale. Here, CityNeRF is applied to satellite and ground-level images at the same time to produce various 3D model scales for any viewpoint. In simple words, they bring NeRF to city-scale. But how?

research paper on computer vision

  • Short read: CityNeRF: 3D Modelling at City Scale!
  • Paper: CityNeRF: Building NeRF at City Scale
  • Click here for the code (will be released soon)
If you would like to read more papers and have a broader view, here is another great repository for you covering 2020: 2020: A Year Full of Amazing AI papers- A Review and feel free to subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!

[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092

[2] Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020.

[3] Liu, Z. et al., 2021, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint https://arxiv.org/abs/2103.14030v1

[bonus] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025 .

[4] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf

[5] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf .

[6] Holynski, Aleksander, et al. “Animating Pictures with Eulerian Motion Fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[7] Michael Niemeyer and Andreas Geiger, (2021), "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields", Published in CVPR 2021.

[8] Stepan Tulyakov*, Daniel Gehrig*, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf

[9] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.

[10] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B. and Lin, D., 2021. CityNeRF: Building NeRF at City Scale.

Sponsor this project

computer vision Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

2D Computer Vision

A survey on generative adversarial networks: variants, applications, and training.

The Generative Models have gained considerable attention in unsupervised learning via a new and practical framework called Generative Adversarial Networks (GAN) due to their outstanding data generation capability. Many GAN models have been proposed, and several practical applications have emerged in various domains of computer vision and machine learning. Despite GANs excellent success, there are still obstacles to stable training. The problems are Nash equilibrium, internal covariate shift, mode collapse, vanishing gradient, and lack of proper evaluation metrics. Therefore, stable training is a crucial issue in different applications for the success of GANs. Herein, we survey several training solutions proposed by different researchers to stabilize GAN training. We discuss (I) the original GAN model and its modified versions, (II) a detailed analysis of various GAN applications in different domains, and (III) a detailed study about the various GAN training obstacles as well as training solutions. Finally, we reveal several issues as well as research outlines to the topic.

Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi

Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.

Feature Matching-based Approaches to Improve the Robustness of Android Visual GUI Testing

In automated Visual GUI Testing (VGT) for Android devices, the available tools often suffer from low robustness to mobile fragmentation, leading to incorrect results when running the same tests on different devices. To soften these issues, we evaluate two feature matching-based approaches for widget detection in VGT scripts, which use, respectively, the complete full-screen snapshot of the application ( Fullscreen ) and the cropped images of its widgets ( Cropped ) as visual locators to match on emulated devices. Our analysis includes validating the portability of different feature-based visual locators over various apps and devices and evaluating their robustness in terms of cross-device portability and correctly executed interactions. We assessed our results through a comparison with two state-of-the-art tools, EyeAutomate and Sikuli. Despite a limited increase in the computational burden, our Fullscreen approach outperformed state-of-the-art tools in terms of correctly identified locators across a wide range of devices and led to a 30% increase in passing tests. Our work shows that VGT tools’ dependability can be improved by bridging the testing and computer vision communities. This connection enables the design of algorithms targeted to domain-specific needs and thus inherently more usable and robust.

Computer vision to recognize construction waste compositions: A novel boundary-aware transformer (BAT) model

Computer vision for autonomous uav flight safety: an overview and a vision-based safe landing pipeline example.

Recent years have seen an unprecedented spread of Unmanned Aerial Vehicles (UAVs, or “drones”), which are highly useful for both civilian and military applications. Flight safety is a crucial issue in UAV navigation, having to ensure accurate compliance with recently legislated rules and regulations. The emerging use of autonomous drones and UAV swarms raises additional issues, making it necessary to transfuse safety- and regulations-awareness to relevant algorithms and architectures. Computer vision plays a pivotal role in such autonomous functionalities. Although the main aspects of autonomous UAV technologies (e.g., path planning, navigation control, landing control, mapping and localization, target detection/tracking) are already mature and well-covered, ensuring safe flying in the vicinity of crowds, avoidance of passing over persons, or guaranteed emergency landing capabilities in case of malfunctions, are generally treated as an afterthought when designing autonomous UAV platforms for unstructured environments. This fact is reflected in the fragmentary coverage of the above issues in current literature. This overview attempts to remedy this situation, from the point of view of computer vision. It examines the field from multiple aspects, including regulations across the world and relevant current technologies. Finally, since very few attempts have been made so far towards a complete UAV safety flight and landing pipeline, an example computer vision-based UAV flight safety pipeline is introduced, taking into account all issues present in current autonomous drones. The content is relevant to any kind of autonomous drone flight (e.g., for movie/TV production, news-gathering, search and rescue, surveillance, inspection, mapping, wildlife monitoring, crowd monitoring/management), making this a topic of broad interest.

Automatic recognition and classification of microseismic waveforms based on computer vision

Promises and pitfalls of using computer vision to make inferences about landscape preferences: evidence from an urban-proximate park system, weight-sharing neural architecture search: a battle to shrink the optimization gap.

Neural architecture search (NAS) has attracted increasing attention. In recent years, individual search methods have been replaced by weight-sharing search methods for higher search efficiency, but the latter methods often suffer lower instability. This article provides a literature review on these methods and owes this issue to the optimization gap . From this perspective, we summarize existing approaches into several categories according to their efforts in bridging the gap, and we analyze both advantages and disadvantages of these methodologies. Finally, we share our opinions on the future directions of NAS and AutoML. Due to the expertise of the authors, this article mainly focuses on the application of NAS to computer vision problems.

Assessing surface drainage conditions at the street and neighborhood scale: A computer vision and flow direction method applied to lidar data

Export citation format, share document.

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Social justice
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

When computer vision works more like a brain, it sees more like people do

Press contact :.

Monotone image of a human eye with grahic representations of a computer network superimposed

Previous image Next image

From cameras to self-driving cars, many of today’s technologies depend on artificial intelligence to extract meaning from visual information. Today’s AI technology has artificial neural networks at its core, and most of the time we can trust these AI computer vision systems to see things the way we do — but sometimes they falter. According to MIT and IBM research scientists, one way to improve computer vision is to instruct the artificial neural networks that they rely on to deliberately mimic the way the brain’s biological neural network processes visual images.

Researchers led by MIT Professor James DiCarlo , the director of MIT’s Quest for Intelligence and member of the MIT-IBM Watson AI Lab, have made a computer vision model more robust by training it to work like a part of the brain that humans and other primates rely on for object recognition. This May, at the International Conference on Learning Representations, the team reported that when they trained an artificial neural network using neural activity patterns in the brain’s inferior temporal (IT) cortex, the artificial neural network was more robustly able to identify objects in images than a model that lacked that neural training. And the model’s interpretations of images more closely matched what humans saw, even when images included minor distortions that made the task more difficult.

Comparing neural circuits

Many of the artificial neural networks used for computer vision already resemble the multilayered brain circuits that process visual information in humans and other primates. Like the brain, they use neuron-like units that work together to process information. As they are trained for a particular task, these layered components collectively and progressively process the visual information to complete the task — determining, for example, that an image depicts a bear or a car or a tree.

DiCarlo and others previously found that when such deep-learning computer vision systems establish efficient ways to solve visual problems, they end up with artificial circuits that work similarly to the neural circuits that process visual information in our own brains. That is, they turn out to be surprisingly good scientific models of the neural mechanisms underlying primate and human vision.

That resemblance is helping neuroscientists deepen their understanding of the brain. By demonstrating ways visual information can be processed to make sense of images, computational models suggest hypotheses about how the brain might accomplish the same task. As developers continue to refine computer vision models, neuroscientists have found new ideas to explore in their own work.

“As vision systems get better at performing in the real world, some of them turn out to be more human-like in their internal processing. That’s useful from an understanding-biology point of view,” says DiCarlo, who is also a professor of brain and cognitive sciences and an investigator at the McGovern Institute for Brain Research.

Engineering a more brain-like AI

While their potential is promising, computer vision systems are not yet perfect models of human vision. DiCarlo suspected one way to improve computer vision may be to incorporate specific brain-like features into these models.

To test this idea, he and his collaborators built a computer vision model using neural data previously collected from vision-processing neurons in the monkey IT cortex — a key part of the primate ventral visual pathway involved in the recognition of objects — while the animals viewed various images. More specifically, Joel Dapello, a Harvard University graduate student and former MIT-IBM Watson AI Lab intern; and Kohitij Kar, assistant professor and Canada Research Chair (Visual Neuroscience) at York University and visiting scientist at MIT; in collaboration with David Cox, IBM Research’s vice president for AI models and IBM director of the MIT-IBM Watson AI Lab; and other researchers at IBM Research and MIT asked an artificial neural network to emulate the behavior of these primate vision-processing neurons while the network learned to identify objects in a standard computer vision task.

“In effect, we said to the network, ‘please solve this standard computer vision task, but please also make the function of one of your inside simulated “neural” layers be as similar as possible to the function of the corresponding biological neural layer,’” DiCarlo explains. “We asked it to do both of those things as best it could.” This forced the artificial neural circuits to find a different way to process visual information than the standard, computer vision approach, he says.

After training the artificial model with biological data, DiCarlo’s team compared its activity to a similarly-sized neural network model trained without neural data, using the standard approach for computer vision. They found that the new, biologically informed model IT layer was — as instructed — a better match for IT neural data.  That is, for every image tested, the population of artificial IT neurons in the model responded more similarly to the corresponding population of biological IT neurons.

The researchers also found that the model IT was also a better match to IT neural data collected from another monkey, even though the model had never seen data from that animal, and even when that comparison was evaluated on that monkey’s IT responses to new images. This indicated that the team’s new, “neurally aligned” computer model may be an improved model of the neurobiological function of the primate IT cortex — an interesting finding, given that it was previously unknown whether the amount of neural data that can be currently collected from the primate visual system is capable of directly guiding model development.

With their new computer model in hand, the team asked whether the “IT neural alignment” procedure also leads to any changes in the overall behavioral performance of the model. Indeed, they found that the neurally-aligned model was more human-like in its behavior — it tended to succeed in correctly categorizing objects in images for which humans also succeed, and it tended to fail when humans also fail.

Adversarial attacks

The team also found that the neurally aligned model was more resistant to “adversarial attacks” that developers use to test computer vision and AI systems. In computer vision, adversarial attacks introduce small distortions into images that are meant to mislead an artificial neural network.

“Say that you have an image that the model identifies as a cat. Because you have the knowledge of the internal workings of the model, you can then design very small changes in the image so that the model suddenly thinks it’s no longer a cat,” DiCarlo explains.

These minor distortions don’t typically fool humans, but computer vision models struggle with these alterations. A person who looks at the subtly distorted cat still reliably and robustly reports that it’s a cat. But standard computer vision models are more likely to mistake the cat for a dog, or even a tree.

“There must be some internal differences in the way our brains process images that lead to our vision being more resistant to those kinds of attacks,” DiCarlo says. And indeed, the team found that when they made their model more neurally aligned, it became more robust, correctly identifying more images in the face of adversarial attacks. The model could still be fooled by stronger “attacks,” but so can people, DiCarlo says. His team is now exploring the limits of adversarial robustness in humans.

A few years ago, DiCarlo’s team found they could also improve a model’s resistance to adversarial attacks by designing the first layer of the artificial network to emulate the early visual processing layer in the brain. One key next step is to combine such approaches — making new models that are simultaneously neurally aligned at multiple visual processing layers.

The new work is further evidence that an exchange of ideas between neuroscience and computer science can drive progress in both fields. “Everybody gets something out of the exciting virtuous cycle between natural/biological intelligence and artificial intelligence,” DiCarlo says. “In this case, computer vision and AI researchers get new ways to achieve robustness, and neuroscientists and cognitive scientists get more accurate mechanistic models of human vision.”

This work was supported by the MIT-IBM Watson AI Lab, Semiconductor Research Corporation, the U.S. Defense Research Projects Agency, the MIT Shoemaker Fellowship, U.S. Office of Naval Research, the Simons Foundation, and Canada Research Chair Program.

Share this news article on:

Related links.

  • Jim DiCarlo
  • McGovern Institute for Brain Research
  • MIT-IBM Watson AI Lab
  • MIT Quest for Intelligence
  • Department of Brain and Cognitive Sciences

Related Topics

  • Brain and cognitive sciences
  • McGovern Institute
  • Artificial intelligence
  • Computer vision
  • Neuroscience
  • Computer modeling
  • Quest for Intelligence

Related Articles

color change pixels of cat

Neuroscientists find a way to make object-recognition models perform better

A computer model of vision created by MIT neuroscientists designed these images that can stimulate very high activity in individual neurons.

Putting vision models to the test

MIT researchers have found that the part of the visual cortex known as the inferotemporal (IT) cortex is required to distinguish between different objects.

How the brain distinguishes between objects

Previous item Next item

More MIT News

Illustration of bok choy has, on left, leaves being attacked by aphids, and on right, leaves burned by the sun’s heat. Two word balloons show the plant is responding with alarm: “!!!”

Plant sensors could act as an early warning system for farmers

Read full story →

A lab technician standing over a piece of equipment, resembling a dryer, with a cloud of vapor coming out of it

A home where world-changing innovations take flight

A man moves three large boxes on a handtruck while a woman standing in back of an open van takes inventory

3 Questions: Enhancing last-mile logistics with machine learning

Four women sit on a stage, one with a raised fist, in front of a projected slide headlined "Women in STEM."

Women in STEM — A celebration of excellence and curiosity

Stylized drawing of a computer monitor with a black screen, surrounded by green beams of light and a completed task list on each side. Behind these objects are two IBM quantum computers, shown as cylinders connected to wires

A blueprint for making quantum computers easier to program

A diagram shows a box of rows of long silver tubes stacked on top of each other. Tiny brown objects representing carbon nanotubes are in between the layers. An inset enlarges the brown objects and they are an array of tree-like scaffolding.

“Nanostitches” enable lighter and tougher composite materials

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

Computer Vision Technology Based on Deep Learning

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Top 10 Computer Vision Papers of 2021

The top 10 computer vision papers in 2021 with video demos, articles, code, and paper reference.

Louis Bouchard

Louis Bouchard

While the world is still recovering, research hasn’t slowed its frenetic pace, especially in the field of artificial intelligence. More, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life’s quality in the near future. Still, we ought to be careful with which technology we choose to apply.

"Science cannot tell us what we ought to do, only what we can do." - Jean-Paul Sartre, Being and Nothingness

Here are my top 10 of the most interesting research papers of the year in computer vision, in case you missed any of them. In short, it is basically a curated list of the latest breakthroughs in AI and CV with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read, and let me know if I missed any important papers in the comments, or by contacting me directly on LinkedIn!

The complete reference to each paper is listed at the end of this article.

Subscribe to my newsletter  — The latest updates in AI explained every week and please feel free to message me any interesting paper I may have missed!

Tag me on Twitter @Whats_AI or LinkedIn @Louis (What’s AI) Bouchard if you share the list!

Missed last year? Check this out: 2020: A Year Full of Amazing AI papers- A Review

👀 If you’d like to support my work and use W&B (for free) to track your ML experiments and make your work reproducible or collaborate with a team, you can try it out by following this guide ! Since most of the code here is PyTorch-based, we thought that a QuickStart guide for using W&B on PyTorch would be most interesting to share.

👉Follow this quick guide , use the same W&B lines in your code or any of the repos below, and have all your experiments automatically tracked in your w&b account! It doesn’t take more than 5 minutes to set up and will change your life as it did for me! Here’s a more advanced guide for using Hyperparameter Sweeps if interested :)

🙌 Thank you to Weights & Biases for sponsoring this repository and the work I’ve been doing, and thanks to any of you using this link and trying W&B!

Access the complete list in a GitHub repository

Watch the 2021 CV rewind

Table of content

Dall·e: zero-shot text-to-image generation from openai [1], taming transformers for high-resolution image synthesis [2], swin transformer: hierarchical vision transformer using shifted windows [3], deep nets: what have they ever done for vision [bonus], infinite nature: perpetual view generation of natural scenes from a single image [4], total relighting: learning to relight portraits for background replacement [5], animating pictures with eulerian motion fields [6], cvpr 2021 best paper award: giraffe — controllable image generation [7], timelens: event-based video frame interpolation [8], (style)clipdraw: coupling content and style in text-to-drawing synthesis [9], citynerf: building nerf at city scale [10], paper references.

OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.

Short Video Explanation

research paper on computer vision

  • Paper: Zero-Shot Text-to-Image Generation
  • Code: Code & more information for the discrete VAE used for DALL·E

Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.

research paper on computer vision

  • Paper: Taming Transformers for High-Resolution Image Synthesis
  • Code: Taming Transformers

Will Transformers Replace CNNs in Computer Vision? In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.

research paper on computer vision

  • Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
  • Click here for the code

“I will openly share everything about deep nets for vision applications, their successes, and the limitations we have to address.”

research paper on computer vision

  • Paper: Deep nets: What have they ever done for vision?

The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!

Short Video Explanation:

  • Paper: Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image

Properly relight any portrait based on the lighting of the new background you add. Have you ever wanted to change the background of a picture but have it look realistic? If you’ve already tried that, you already know that it isn’t simple. You can’t just take a picture of yourself in your home and change the background for a beach. It just looks bad and not realistic. Anyone will just say “that’s photoshopped” in a second. For movies and professional videos, you need the perfect lighting and artists to reproduce a high-quality image, and that’s super expensive. There’s no way you can do that with your own pictures. Or can you?

research paper on computer vision

  • Paper: Total Relighting: Learning to Relight Portraits for Background Replacement
If you’d like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.

This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely still creating amazing-looking videos like this one…

research paper on computer vision

  • Paper: Animating Pictures with Eulerian Motion Fields

Using a modified GAN architecture, they can move objects in the image without affecting the background or the other objects!

research paper on computer vision

  • Paper: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

TimeLens can understand the movement of the particles in-between the frames of a video to reconstruct what really happened at a speed even our eyes cannot see. In fact, it achieves results that our intelligent phones and no other models could reach before!

research paper on computer vision

  • Paper: TimeLens: Event-based Video Frame Interpolation
Subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!

Have you ever dreamed of taking the style of a picture, like this cool TikTok drawing style on the left, and applying it to a new picture of your choice? Well, I did, and it has never been easier to do. In fact, you can even achieve that from only text and can try it right now with this new method and their Google Colab notebook available for everyone (see references). Simply take a picture of the style you want to copy, enter the text you want to generate, and this algorithm will generate a new picture out of it! Just look back at the results above, such a big step forward! The results are extremely impressive, especially if you consider that they were made from a single line of text!

research paper on computer vision

  • Paper (CLIPDraw): CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
  • Paper (StyleCLIPDraw): StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
  • CLIPDraw Colab demo
  • StyleCLIPDraw Colab demo

The model is called CityNeRF and grows from NeRF, which I previously covered on my channel. NeRF is one of the first models using radiance fields and machine learning to construct 3D models out of images. But NeRF is not that efficient and works for a single scale. Here, CityNeRF is applied to satellite and ground-level images at the same time to produce various 3D model scales for any viewpoint. In simple words, they bring NeRF to city-scale. But how?

  • Paper: CityNeRF: Building NeRF at City Scale
  • Click here for the code (will be released soon)
If you would like to read more papers and have a broader view, here is another great repository for you covering 2020: 2020: A Year Full of Amazing AI papers- A Review and feel free to subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!

[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092

[2] Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020.

[3] Liu, Z. et al., 2021, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint https://arxiv.org/abs/2103.14030v1

[bonus] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025 .

[4] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf

[5] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf .

[6] Holynski, Aleksander, et al. “Animating Pictures with Eulerian Motion Fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

[7] Michael Niemeyer and Andreas Geiger, (2021), “GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields”, Published in CVPR 2021.

[8] Stepan Tulyakov*, Daniel Gehrig*, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf

[9] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.

[10] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B. and Lin, D., 2021. CityNeRF: Building NeRF at City Scale.

Sign up for more like this.

This paper is in the following e-collection/theme issue:

Published on 12.4.2024 in Vol 26 (2024)

Application of AI in in Multilevel Pain Assessment Using Facial Images: Systematic Review and Meta-Analysis

Authors of this article:

Author Orcid Image

  • Jian Huo 1 * , MSc   ; 
  • Yan Yu 2 * , MMS   ; 
  • Wei Lin 3 , MMS   ; 
  • Anmin Hu 2, 3, 4 , MMS   ; 
  • Chaoran Wu 2 , MD, PhD  

1 Boston Intelligent Medical Research Center, Shenzhen United Scheme Technology Company Limited, Boston, MA, United States

2 Department of Anesthesia, Shenzhen People's Hospital, The First Affiliated Hospital of Southern University of Science and Technology, Shenzhen Key Medical Discipline, Shenzhen, China

3 Shenzhen United Scheme Technology Company Limited, Shenzhen, China

4 The Second Clinical Medical College, Jinan University, Shenzhen, China

*these authors contributed equally

Corresponding Author:

Chaoran Wu, MD, PhD

Department of Anesthesia

Shenzhen People's Hospital, The First Affiliated Hospital of Southern University of Science and Technology

Shenzhen Key Medical Discipline

No 1017, Dongmen North Road

Shenzhen, 518020

Phone: 86 18100282848

Email: [email protected]

Background: The continuous monitoring and recording of patients’ pain status is a major problem in current research on postoperative pain management. In the large number of original or review articles focusing on different approaches for pain assessment, many researchers have investigated how computer vision (CV) can help by capturing facial expressions. However, there is a lack of proper comparison of results between studies to identify current research gaps.

Objective: The purpose of this systematic review and meta-analysis was to investigate the diagnostic performance of artificial intelligence models for multilevel pain assessment from facial images.

Methods: The PubMed, Embase, IEEE, Web of Science, and Cochrane Library databases were searched for related publications before September 30, 2023. Studies that used facial images alone to estimate multiple pain values were included in the systematic review. A study quality assessment was conducted using the Quality Assessment of Diagnostic Accuracy Studies, 2nd edition tool. The performance of these studies was assessed by metrics including sensitivity, specificity, log diagnostic odds ratio (LDOR), and area under the curve (AUC). The intermodal variability was assessed and presented by forest plots.

Results: A total of 45 reports were included in the systematic review. The reported test accuracies ranged from 0.27-0.99, and the other metrics, including the mean standard error (MSE), mean absolute error (MAE), intraclass correlation coefficient (ICC), and Pearson correlation coefficient (PCC), ranged from 0.31-4.61, 0.24-2.8, 0.19-0.83, and 0.48-0.92, respectively. In total, 6 studies were included in the meta-analysis. Their combined sensitivity was 98% (95% CI 96%-99%), specificity was 98% (95% CI 97%-99%), LDOR was 7.99 (95% CI 6.73-9.31), and AUC was 0.99 (95% CI 0.99-1). The subgroup analysis showed that the diagnostic performance was acceptable, although imbalanced data were still emphasized as a major problem. All studies had at least one domain with a high risk of bias, and for 20% (9/45) of studies, there were no applicability concerns.

Conclusions: This review summarizes recent evidence in automatic multilevel pain estimation from facial expressions and compared the test accuracy of results in a meta-analysis. Promising performance for pain estimation from facial images was established by current CV algorithms. Weaknesses in current studies were also identified, suggesting that larger databases and metrics evaluating multiclass classification performance could improve future studies.

Trial Registration: PROSPERO CRD42023418181; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=418181

Introduction

The definition of pain was revised to “an unpleasant sensory and emotional experience associated with, or resembling that associated with, actual or potential tissue damage” in 2020 [ 1 ]. Acute postoperative pain management is important, as pain intensity and duration are critical influencing factors for the transition of acute pain to chronic postsurgical pain [ 2 ]. To avoid the development of chronic pain, guidelines were promoted and discussed to ensure safe and adequate pain relief for patients, and clinicians were recommended to use a validated pain assessment tool to track patients’ responses [ 3 ]. However, these tools, to some extent, depend on communication between physicians and patients, and continuous data cannot be provided [ 4 ]. The continuous assessment and recording of patient pain intensity will not only reduce caregiver burden but also provide data for chronic pain research. Therefore, automatic and accurate pain measurements are necessary.

Researchers have proposed different approaches to measuring pain intensity. Physiological signals, for example, electroencephalography and electromyography, have been used to estimate pain [ 5 - 7 ]. However, it was reported that current pain assessment from physiological signals has difficulties isolating stress and pain with machine learning techniques, as they share conceptual and physiological similarities [ 8 ]. Recent studies have also investigated pain assessment tools for certain patient subgroups. For example, people with deafness or an intellectual disability may not be able to communicate well with nurses, and an objective pain evaluation would be a better option [ 9 , 10 ]. Measuring pain intensity from patient behaviors, such as facial expressions, is also promising for most patients [ 4 ]. As the most comfortable and convenient method, computer vision techniques require no attachments to patients and can monitor multiple participants using 1 device [ 4 ]. However, pain intensity, which is important for pain research, is often not reported.

With the growing trend of assessing pain intensity using artificial intelligence (AI), it is necessary to summarize current publications to determine the strengths and gaps of current studies. Existing research has reviewed machine learning applications for acute postoperative pain prediction, continuous pain detection, and pain intensity estimation [ 10 - 14 ]. Input modalities, including facial recordings and physiological signals such as electroencephalography and electromyography, were also reviewed [ 5 , 8 ]. There have also been studies focusing on deep learning approaches [ 11 ]. AI was applied in children and infant pain evaluation as well [ 15 , 16 ]. However, no study has focused on pain intensity measurement, and no comparison of test accuracy results has been made.

Current AI applications in pain research can be categorized into 3 types: pain assessment, pain prediction and decision support, and pain self-management [ 14 ]. We consider accurate and automatic pain assessment to be the most important area and the foundation of future pain research. In this study, we performed a systematic review and meta-analysis to assess the diagnostic performance of current publications for multilevel pain evaluation.

This study was registered with PROSPERO (International Prospective Register of Systematic Reviews; CRD42023418181) and carried out strictly following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines [ 17 ] .

Study Eligibility

Studies that reported AI techniques for multiclass pain intensity classification were eligible. Records including nonhuman or infant participants or 2-class pain detection were excluded. Only studies using facial images of the test participants were accepted. Clinically used pain assessment tools, such as the visual analog scale (VAS) and numerical rating scale (NRS), and other pain intensity indicators, were rejected in the meta-analysis. Textbox 1 presents the eligibility criteria.

Study characteristics and inclusion criteria

  • Participants: children and adults aged 12 months or older
  • Setting: no restrictions
  • Index test: artificial intelligence models that measure pain intensity from facial images
  • Reference standard: no restrictions for systematic review; Prkachin and Solomon pain intensity score for meta-analysis
  • Study design: no need to specify

Study characteristics and exclusion criteria

  • Participants: infants aged 12 months or younger and animal subjects
  • Setting: no need to specify
  • Index test: studies that use other information such as physiological signals
  • Reference standard: other pain evaluation tools, e.g., NRS, VAS, were excluded from meta-analysis
  • Study design: reviews

Report characteristics and inclusion criteria

  • Year: published between January 1, 2012, and September 30, 2023
  • Language: English only
  • Publication status: published
  • Test accuracy metrics: no restrictions for systematic reviews; studies that reported contingency tables were included for meta-analysis

Report characteristics and exclusion criteria

  • Year: no need to specify
  • Language: no need to specify
  • Publication status: preprints not accepted
  • Test accuracy metrics: studies that reported insufficient metrics were excluded from meta-analysis

Search Strategy

In this systematic review, databases including PubMed, Embase, IEEE, Web of Science, and the Cochrane Library were searched until December 2022, and no restrictions were applied. Keywords were “artificial intelligence” AND “pain recognition.” Multimedia Appendix 1 shows the detailed search strategy.

Data Extraction

A total of 2 viewers screened titles and abstracts and selected eligible records independently to assess eligibility, and disagreements were solved by discussion with a third collaborator. A consentient data extraction sheet was prespecified and used to summarize study characteristics independently. Table S5 in Multimedia Appendix 1 shows the detailed items and explanations for data extraction. Diagnostic accuracy data were extracted into contingency tables, including true positives, false positives, false negatives, and true negatives. The data were used to calculate the pooled diagnostic performance of the different models. Some studies included multiple models, and these models were considered independent of each other.

Study Quality Assessment

All included studies were independently assessed by 2 viewers using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool [ 18 ]. QUADAS-2 assesses bias risk across 4 domains, which are patient selection, index test, reference standard, and flow and timing. The first 3 domains are also assessed for applicability concerns. In the systematic review, a specific extension of QUADAS-2, namely, QUADAS-AI, was used to specify the signaling questions [ 19 ].

Meta-Analysis

Meta-analyses were conducted between different AI models. Models with different algorithms or training data were considered different. To evaluate the performance differences between models, the contingency tables during model validation were extracted. Studies that did not report enough diagnostic accuracy data were excluded from meta-analysis.

Hierarchical summary receiver operating characteristic (SROC) curves were fitted to evaluate the diagnostic performance of AI models. These curves were plotted with 95% CIs and prediction regions around averaged sensitivity, specificity, and area under the curve estimates. Heterogeneity was assessed visually by forest plots. A funnel plot was constructed to evaluate the risk of bias.

Subgroup meta-analyses were conducted to evaluate the performance differences at both the model level and task level, and subgroups were created based on different tasks and the proportion of positive and negative samples.

All statistical analyses and plots were produced using RStudio (version 4.2.2; R Core Team) and the R package meta4diag (version 2.1.1; Guo J and Riebler A) [ 20 ].

Study Selection and Included Study Characteristics

A flow diagram representing the study selection process is shown in ( Figure 1 ). After removing 1039 duplicates, the titles and abstracts of a total of 5653 papers were screened, and the percentage agreement of title or abstract screening was 97%. After screening, 51 full-text reports were assessed for eligibility, among which 45 reports were included in the systematic review [ 21 - 65 ]. The percentage agreement of the full-text review was 87%. In 40 of the included studies, contingency tables could not be made. Meta-analyses were conducted based on 8 AI models extracted from 6 studies. Individual study characteristics included in the systematic review are provided in Tables 1 and 2 . The facial feature extraction method can be categorized into 2 classes: geometrical features (GFs) and deep features (DFs). One typical method of extracting GFs is to calculate the distance between facial landmarks. DFs are usually extracted by convolution operations. A total of 20 studies included temporal information, but most of them (18) extracted temporal information through the 3D convolution of video sequences. Feature transformation was also commonly applied to reduce the time for training or fuse features extracted by different methods before inputting them into the classifier. For classifiers, support vector machines (SVMs) and convolutional neural networks (CNNs) were mostly used. Table 1 presents the model designs of the included studies.

research paper on computer vision

a No temporal features are shown by – symbol, time information extracted from 2 images at different time by +, and deep temporal features extracted through the convolution of video sequences by ++.

b SVM: support vector machine.

c GF: geometric feature.

d GMM: gaussian mixture model.

e TPS: thin plate spline.

f DML: distance metric learning.

g MDML: multiview distance metric learning.

h AAM: active appearance model.

i RVR: relevance vector regressor.

j PSPI: Prkachin and Solomon pain intensity.

k I-FES: individual facial expressiveness score.

l LSTM: long short-term memory.

m HCRF: hidden conditional random field.

n GLMM: generalized linear mixed model.

o VLAD: vector of locally aggregated descriptor.

p SVR: support vector regression.

q MDS: multidimensional scaling.

r ELM: extreme learning machine.

s Labeled to distinguish different architectures of ensembled deep learning models.

t DCNN: deep convolutional neural network.

u GSM: gaussian scale mixture.

v DOML: distance ordering metric learning.

w LIAN: locality and identity aware network.

x BiLSTM: bidirectional long short-term memory.

a UNBC: University of Northern British Columbia-McMaster shoulder pain expression archive database.

b LOSO: leave one subject out cross-validation.

c ICC: intraclass correlation coefficient.

d CT: contingency table.

e AUC: area under the curve.

f MSE: mean standard error.

g PCC: Pearson correlation coefficient.

h RMSE: root mean standard error.

i MAE: mean absolute error.

j ICC: intraclass coefficient.

k CCC: concordance correlation coefficient.

l Reported both external and internal validation results and summarized as intervals.

Table 2 summarizes the characteristics of model training and validation. Most studies used publicly available databases, for example, the University of Northern British Columbia-McMaster shoulder pain expression archive database [ 57 ]. Table S4 in Multimedia Appendix 1 summarizes the public databases. A total of 7 studies used self-prepared databases. Frames from video sequences were the most used test objects, as 37 studies output frame-level pain intensity, while few measure pain intensity from video sequences or photos. It was common that a study redefined pain levels to have fewer classes than ground-truth labels. For model validation, cross-validation and leave-one-subject-out validation were commonly used. Only 3 studies performed external validation. For reporting test accuracies, different evaluation metrics were used, including sensitivity, specificity, mean absolute error (MAE), mean standard error (MSE), Pearson correlation coefficient (PCC), and intraclass coefficient (ICC).

Methodological Quality of Included Studies

Table S2 in Multimedia Appendix 1 presents the study quality summary, as assessed by QUADAS-2. There was a risk of bias in all studies, specifically in terms of patient selection, caused by 2 issues. First, the training data are highly imbalanced, and any method to adjust the data distribution may introduce bias. Next, the QUADAS-AI correspondence letter [ 19 ] specifies that preprocessing of images that changes the image size or resolution may introduce bias. However, the applicability concern is low, as the images properly represent the feeling of pain. Studies that used cross-fold validation or leave-one-out cross-validation were considered to have a low risk of bias. Although the Prkachin and Solomon pain intensity (PSPI) score was used by most of the studies, its ability to represent individual pain levels was not clinically validated; as such, the risk of bias and applicability concerns were considered high when the PSPI score was used as the index test. As an advantage of computer vision techniques, the time interval between the index tests was short and was assessed as having a low risk of bias. Risk proportions are shown in Figure 2 . For all 315 entries, 39% (124) were assessed as high-risk. In total, 5 studies had the lowest risk of bias, with 6 domains assessed as low risk [ 26 , 27 , 31 , 32 , 59 ].

research paper on computer vision

Pooled Performance of Included Models

In 6 studies included in the meta-analysis, there were 8 different models. The characteristics of these models are summarized in Table S1 in Multimedia Appendix 2 [ 23 , 24 , 26 , 32 , 41 , 57 ]. Classification of PSPI scores greater than 0, 2, 3, 6, and 9 was selected and considered as different tasks to create contingency tables. The test performance is shown in Figure 3 as hierarchical SROC curves; 27 contingency tables were extracted from 8 models. The sensitivity, specificity, and LDOR were calculated, and the combined sensitivity was 98% (95% CI 96%-99%), the specificity was 98% (95% CI 97%-99%), the LDOR was 7.99 (95% CI 6.73-9.31) and the AUC was 0.99 (95% CI 0.99-1).

research paper on computer vision

Subgroup Analysis

In this study, subgroup analysis was conducted to investigate the performance differences within models. A total of 8 models were separated and summarized as a forest plot in Multimedia Appendix 3 [ 23 , 24 , 26 , 32 , 41 , 57 ]. For model 1, the pooled sensitivity, specificity, and LDOR were 95% (95% CI 86%-99%), 99% (95% CI 98%-100%), and 8.38 (95% CI 6.09-11.19), respectively. For model 2, the pooled sensitivity, specificity, and LDOR were 94% (95% CI 84%-99%), 95% (95% CI 88%-99%), and 6.23 (95% CI 3.52-9.04), respectively. For model 3, the pooled sensitivity, specificity, and LDOR were 100% (95% CI 99%-100%), 100% (95% CI 99%-100%), and 11.55% (95% CI 8.82-14.43), respectively. For model 4, the pooled sensitivity, specificity, and LDOR were 83% (95% CI 43%-99%), 94% (95% CI 79%-99%), and 5.14 (95% CI 0.93-9.31), respectively. For model 5, the pooled sensitivity, specificity, and LDOR were 92% (95% CI 68%-99%), 94% (95% CI 78%-99%), and 6.12 (95% CI 1.82-10.16), respectively. For model 6, the pooled sensitivity, specificity, and LDOR were 94% (95% CI 74%-100%), 94% (95% CI 78%-99%), and 6.59 (95% CI 2.21-11.13), respectively. For model 7, the pooled sensitivity, specificity, and LDOR were 98% (95% CI 90%-100%), 97% (95% CI 87%-100%), and 8.31 (95% CI 4.3-12.29), respectively. For model 8, the pooled sensitivity, specificity, and LDOR were 98% (95% CI 93%-100%), 97% (95% CI 88%-100%), and 8.65 (95% CI 4.84-12.67), respectively.

Heterogeneity Analysis

The meta-analysis results indicated that AI models are applicable for estimating pain intensity from facial images. However, extreme heterogeneity existed within the models except for models 3 and 5, which were proposed by Rathee and Ganotra [ 24 ] and Semwal and Londhe [ 32 ]. A funnel plot is presented in Figure 4 . A high risk of bias was observed.

research paper on computer vision

Pain management has long been a critical problem in clinical practice, and the use of AI may be a solution. For acute pain management, automatic measurement of pain can reduce the burden on caregivers and provide timely warnings. For chronic pain management, as specified by Glare et al [ 2 ], further research is needed, and measurements of pain presence, intensity, and quality are one of the issues to be solved for chronic pain studies. Computer vision could improve pain monitoring through real-time detection for clinical use and data recording for prospective pain studies. To our knowledge, this is the first meta-analysis dedicated to AI performance in multilevel pain level classification.

In this study, one model’s performance at specific pain levels was described by stacking multiple classes into one to make each task a binary classification problem. After careful selection in both the medical and engineering databases, we observed promising results of AI in evaluating multilevel pain intensity through facial images, with high sensitivity (98%), specificity (98%), LDOR (7.99), and AUC (0.99). It is reasonable to believe that AI can accurately evaluate pain intensity from facial images. Moreover, the study quality and risk of bias were evaluated using an adapted QUADAS-2 assessment tool, which is a strength of this study.

To investigate the source of heterogeneity, it was assumed that a well-designed model should have familiar size effects regarding different levels, and a subgroup meta-analysis was conducted. The funnel and forest plots exhibited extreme heterogeneity. The model’s performance at specific pain levels was described and summarized by a forest plot. Within-model heterogeneity was observed in Multimedia Appendix 3 [ 23 , 24 , 26 , 32 , 41 , 57 ] except for 2 models. Models 3 and 5 were different in many aspects, including their algorithms and validation methods, but were both trained with a relatively small data set, and the proportion of positive and negative classes was relatively close to 1. Because training with imbalanced data is a critical problem in computer vision studies [ 66 ], for example, in the University of Northern British Columbia-McMaster pain data set, fewer than 10 frames out of 48,398 had a PSPI score greater than 13. Here, we emphasized that imbalanced data sets are one major cause of heterogeneity, resulting in the poorer performance of AI algorithms.

We tentatively propose a method to minimize the effect of training with imbalanced data by stacking multiple classes into one class, which is already presented in studies included in the systematic review [ 26 , 32 , 42 , 57 ]. Common methods to minimize bias include resampling and data augmentation [ 66 ]. This proposed method is used in the meta-analysis to compare the test results of different studies as well. The stacking method is available when classes are only different in intensity. A disadvantage of combined classes is that the model would be insufficient in clinical practice when the number of classes is low. Commonly used pain evaluation tools, such as VAS, have 10 discrete levels. It is recommended that future studies set the number of pain levels to be at least 10 for model training.

This study is limited for several reasons. First, insufficient data were included because different performance metrics (mean standard error and mean average error) were used in most studies, which could not be summarized into a contingency table. To create a contingency table that can be included in a meta-analysis, the study should report the following: the number of objects used in each pain class for model validation, and the accuracy, sensitivity, specificity, and F 1 -score for each pain class. This table cannot be created if a study reports the MAE, PCC, and other commonly used metrics in AI development. Second, a small study effect was observed in the funnel plot, and the heterogeneity could not be minimized. Another limitation is that the PSPI score is not clinically validated and is not the only tool that assesses pain from facial expressions. There are other clinically validated pain intensity assessment methods, such as the Faces Pain Scale-revised, Wong-Baker Faces Pain Rating Scale, and Oucher Scale [ 3 ]. More databases could be created based on the above-mentioned tools. Finally, AI-assisted pain assessments were supposed to cover larger populations, including incommunicable patients, for example, patients with dementia or patients with masked faces. However, only 1 study considered patients with dementia, which was also caused by limited databases [ 50 ].

AI is a promising tool that can help in pain research in the future. In this systematic review and meta-analysis, one approach using computer vision was investigated to measure pain intensity from facial images. Despite some risk of bias and applicability concerns, CV models can achieve excellent test accuracy. Finally, more CV studies in pain estimation, reporting accuracy in contingency tables, and more pain databases are encouraged for future studies. Specifically, the creation of a balanced public database that contains not only healthy but also nonhealthy participants should be prioritized. The recording process would be better in a clinical environment. Then, it is recommended that researchers report the validation results in terms of accuracy, sensitivity, specificity, or contingency tables, as well as the number of objects for each pain class, for the inclusion of a meta-analysis.

Acknowledgments

WL, AH, and CW contributed to the literature search and data extraction. JH and YY wrote the first draft of the manuscript. All authors contributed to the conception and design of the study, the risk of bias evaluation, data analysis and interpretation, and contributed to and approved the final version of the manuscript.

Data Availability

The data sets generated during and analyzed during this study are available in the Figshare repository [ 67 ].

Conflicts of Interest

None declared.

PRISMA checklist, risk of bias summary, search strategy, database summary and reported items and explanations.

Study performance summary.

Forest plot presenting pooled performance of subgroups in meta-analysis.

  • Raja SN, Carr DB, Cohen M, Finnerup NB, Flor H, Gibson S, et al. The revised International Association for the Study of Pain definition of pain: concepts, challenges, and compromises. Pain. 2020;161(9):1976-1982. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Glare P, Aubrey KR, Myles PS. Transition from acute to chronic pain after surgery. Lancet. 2019;393(10180):1537-1546. [ CrossRef ] [ Medline ]
  • Chou R, Gordon DB, de Leon-Casasola OA, Rosenberg JM, Bickler S, Brennan T, et al. Management of postoperative pain: a clinical practice guideline from the American Pain Society, the American Society of Regional Anesthesia and Pain Medicine, and the American Society of Anesthesiologists' Committee on Regional Anesthesia, Executive Committee, and Administrative Council. J Pain. 2016;17(2):131-157. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Hassan T, Seus D, Wollenberg J, Weitz K, Kunz M, Lautenbacher S, et al. Automatic detection of pain from facial expressions: a survey. IEEE Trans Pattern Anal Mach Intell. 2021;43(6):1815-1831. [ CrossRef ] [ Medline ]
  • Mussigmann T, Bardel B, Lefaucheur JP. Resting-State Electroencephalography (EEG) biomarkers of chronic neuropathic pain. A systematic review. Neuroimage. 2022;258:119351. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Moscato S, Cortelli P, Chiari L. Physiological responses to pain in cancer patients: a systematic review. Comput Methods Programs Biomed. 2022;217:106682. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Thiam P, Hihn H, Braun DA, Kestler HA, Schwenker F. Multi-modal pain intensity assessment based on physiological signals: a deep learning perspective. Front Physiol. 2021;12:720464. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Rojas RF, Brown N, Waddington G, Goecke R. A systematic review of neurophysiological sensing for the assessment of acute pain. NPJ Digit Med. 2023;6(1):76. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Mansutti I, Tomé-Pires C, Chiappinotto S, Palese A. Facilitating pain assessment and communication in people with deafness: a systematic review. BMC Public Health. 2023;23(1):1594. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • El-Tallawy SN, Ahmed RS, Nagiub MS. Pain management in the most vulnerable intellectual disability: a review. Pain Ther. 2023;12(4):939-961. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Gkikas S, Tsiknakis M. Automatic assessment of pain based on deep learning methods: a systematic review. Comput Methods Programs Biomed. 2023;231:107365. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Borna S, Haider CR, Maita KC, Torres RA, Avila FR, Garcia JP, et al. A review of voice-based pain detection in adults using artificial intelligence. Bioengineering (Basel). 2023;10(4):500. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • De Sario GD, Haider CR, Maita KC, Torres-Guzman RA, Emam OS, Avila FR, et al. Using AI to detect pain through facial expressions: a review. Bioengineering (Basel). 2023;10(5):548. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Zhang M, Zhu L, Lin SY, Herr K, Chi CL, Demir I, et al. Using artificial intelligence to improve pain assessment and pain management: a scoping review. J Am Med Inform Assoc. 2023;30(3):570-587. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Hughes JD, Chivers P, Hoti K. The clinical suitability of an artificial intelligence-enabled pain assessment tool for use in infants: feasibility and usability evaluation study. J Med Internet Res. 2023;25:e41992. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Fang J, Wu W, Liu J, Zhang S. Deep learning-guided postoperative pain assessment in children. Pain. 2023;164(9):2029-2035. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Whiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529-536. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Sounderajah V, Ashrafian H, Rose S, Shah NH, Ghassemi M, Golub R, et al. A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat Med. 2021;27(10):1663-1665. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Guo J, Riebler A. meta4diag: Bayesian bivariate meta-analysis of diagnostic test studies for routine practice. J Stat Soft. 2018;83(1):1-31. [ CrossRef ]
  • Hammal Z, Cohn JF. Automatic detection of pain intensity. Proc ACM Int Conf Multimodal Interact. 2012;2012:47-52. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Adibuzzaman M, Ostberg C, Ahamed S, Povinelli R, Sindhu B, Love R, et al. Assessment of pain using facial pictures taken with a smartphone. 2015. Presented at: 2015 IEEE 39th Annual Computer Software and Applications Conference; July 01-05, 2015;726-731; Taichung, Taiwan. [ CrossRef ]
  • Majumder A, Dutta S, Behera L, Subramanian VK. Shoulder pain intensity recognition using Gaussian mixture models. 2015. Presented at: 2015 IEEE International WIE Conference on Electrical and Computer Engineering (WIECON-ECE); December 19-20, 2015;130-134; Dhaka, Bangladesh. [ CrossRef ]
  • Rathee N, Ganotra D. A novel approach for pain intensity detection based on facial feature deformations. J Vis Commun Image Represent. 2015;33:247-254. [ CrossRef ]
  • Sikka K, Ahmed AA, Diaz D, Goodwin MS, Craig KD, Bartlett MS, et al. Automated assessment of children's postoperative pain using computer vision. Pediatrics. 2015;136(1):e124-e131. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Rathee N, Ganotra D. Multiview distance metric learning on facial feature descriptors for automatic pain intensity detection. Comput Vis Image Und. 2016;147:77-86. [ CrossRef ]
  • Zhou J, Hong X, Su F, Zhao G. Recurrent convolutional neural network regression for continuous pain intensity estimation in video. 2016. Presented at: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); June 26-July 01, 2016; Las Vegas, NV. [ CrossRef ]
  • Egede J, Valstar M, Martinez B. Fusing deep learned and hand-crafted features of appearance, shape, and dynamics for automatic pain estimation. 2017. Presented at: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017); May 30-June 03, 2017;689-696; Washington, DC. [ CrossRef ]
  • Martinez DL, Rudovic O, Picard R. Personalized automatic estimation of self-reported pain intensity from facial expressions. 2017. Presented at: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); July 21-26, 2017;2318-2327; Honolulu, HI. [ CrossRef ]
  • Bourou D, Pampouchidou A, Tsiknakis M, Marias K, Simos P. Video-based pain level assessment: feature selection and inter-subject variability modeling. 2018. Presented at: 2018 41st International Conference on Telecommunications and Signal Processing (TSP); July 04-06, 2018;1-6; Athens, Greece. [ CrossRef ]
  • Haque MA, Bautista RB, Noroozi F, Kulkarni K, Laursen C, Irani R. Deep multimodal pain recognition: a database and comparison of spatio-temporal visual modalities. 2018. Presented at: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018); May 15-19, 2018;250-257; Xi'an, China. [ CrossRef ]
  • Semwal A, Londhe ND. Automated pain severity detection using convolutional neural network. 2018. Presented at: 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS); December 21-22, 2018;66-70; Belgaum, India. [ CrossRef ]
  • Tavakolian M, Hadid A. Deep binary representation of facial expressions: a novel framework for automatic pain intensity recognition. 2018. Presented at: 2018 25th IEEE International Conference on Image Processing (ICIP); October 07-10, 2018;1952-1956; Athens, Greece. [ CrossRef ]
  • Tavakolian M, Hadid A. Deep spatiotemporal representation of the face for automatic pain intensity estimation. 2018. Presented at: 2018 24th International Conference on Pattern Recognition (ICPR); August 20-24, 2018;350-354; Beijing, China. [ CrossRef ]
  • Wang J, Sun H. Pain intensity estimation using deep spatiotemporal and handcrafted features. IEICE Trans Inf & Syst. 2018;E101.D(6):1572-1580. [ CrossRef ]
  • Bargshady G, Soar J, Zhou X, Deo RC, Whittaker F, Wang H. A joint deep neural network model for pain recognition from face. 2019. Presented at: 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS); February 23-25, 2019;52-56; Singapore. [ CrossRef ]
  • Casti P, Mencattini A, Comes MC, Callari G, Di Giuseppe D, Natoli S, et al. Calibration of vision-based measurement of pain intensity with multiple expert observers. IEEE Trans Instrum Meas. 2019;68(7):2442-2450. [ CrossRef ]
  • Lee JS, Wang CW. Facial pain intensity estimation for ICU patient with partial occlusion coming from treatment. 2019. Presented at: BIBE 2019; The Third International Conference on Biological Information and Biomedical Engineering; June 20-22, 2019;1-4; Hangzhou, China.
  • Saha AK, Ahsan GMT, Gani MO, Ahamed SI. Personalized pain study platform using evidence-based continuous learning tool. 2019. Presented at: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC); July 15-19, 2019;490-495; Milwaukee, WI. [ CrossRef ]
  • Tavakolian M, Hadid A. A spatiotemporal convolutional neural network for automatic pain intensity estimation from facial dynamics. Int J Comput Vis. 2019;127(10):1413-1425. [ FREE Full text ] [ CrossRef ]
  • Bargshady G, Zhou X, Deo RC, Soar J, Whittaker F, Wang H. Ensemble neural network approach detecting pain intensity from facial expressions. Artif Intell Med. 2020;109:101954. [ CrossRef ] [ Medline ]
  • Bargshady G, Zhou X, Deo RC, Soar J, Whittaker F, Wang H. Enhanced deep learning algorithm development to detect pain intensity from facial expression images. Expert Syst Appl. 2020;149:113305. [ CrossRef ]
  • Dragomir MC, Florea C, Pupezescu V. Automatic subject independent pain intensity estimation using a deep learning approach. 2020. Presented at: 2020 International Conference on e-Health and Bioengineering (EHB); October 29-30, 2020;1-4; Iasi, Romania. [ CrossRef ]
  • Huang D, Xia Z, Mwesigye J, Feng X. Pain-attentive network: a deep spatio-temporal attention model for pain estimation. Multimed Tools Appl. 2020;79(37-38):28329-28354. [ CrossRef ]
  • Mallol-Ragolta A, Liu S, Cummins N, Schuller B. A curriculum learning approach for pain intensity recognition from facial expressions. 2020. Presented at: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020); November 16-20, 2020;829-833; Buenos Aires, Argentina. [ CrossRef ]
  • Peng X, Huang D, Zhang H. Pain intensity recognition via multi‐scale deep network. IET Image Process. 2020;14(8):1645-1652. [ FREE Full text ] [ CrossRef ]
  • Tavakolian M, Lopez MB, Liu L. Self-supervised pain intensity estimation from facial videos via statistical spatiotemporal distillation. Pattern Recognit Lett. 2020;140:26-33. [ CrossRef ]
  • Xu X, de Sa VR. Exploring multidimensional measurements for pain evaluation using facial action units. 2020. Presented at: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020); November 16-20, 2020;786-792; Buenos Aires, Argentina. [ CrossRef ]
  • Pikulkaew K, Boonchieng W, Boonchieng E, Chouvatut V. 2D facial expression and movement of motion for pain identification with deep learning methods. IEEE Access. 2021;9:109903-109914. [ CrossRef ]
  • Rezaei S, Moturu A, Zhao S, Prkachin KM, Hadjistavropoulos T, Taati B. Unobtrusive pain monitoring in older adults with dementia using pairwise and contrastive training. IEEE J Biomed Health Inform. 2021;25(5):1450-1462. [ CrossRef ] [ Medline ]
  • Semwal A, Londhe ND. S-PANET: a shallow convolutional neural network for pain severity assessment in uncontrolled environment. 2021. Presented at: 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC); January 27-30, 2021;0800-0806; Las Vegas, NV. [ CrossRef ]
  • Semwal A, Londhe ND. ECCNet: an ensemble of compact convolution neural network for pain severity assessment from face images. 2021. Presented at: 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence); January 28-29, 2021;761-766; Noida, India. [ CrossRef ]
  • Szczapa B, Daoudi M, Berretti S, Pala P, Del Bimbo A, Hammal Z. Automatic estimation of self-reported pain by interpretable representations of motion dynamics. 2021. Presented at: 2020 25th International Conference on Pattern Recognition (ICPR); January 10-15, 2021;2544-2550; Milan, Italy. [ CrossRef ]
  • Ting J, Yang YC, Fu LC, Tsai CL, Huang CH. Distance ordering: a deep supervised metric learning for pain intensity estimation. 2021. Presented at: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA); December 13-16, 2021;1083-1088; Pasadena, CA. [ CrossRef ]
  • Xin X, Li X, Yang S, Lin X, Zheng X. Pain expression assessment based on a locality and identity aware network. IET Image Process. 2021;15(12):2948-2958. [ FREE Full text ] [ CrossRef ]
  • Alghamdi T, Alaghband G. Facial expressions based automatic pain assessment system. Appl Sci. 2022;12(13):6423. [ FREE Full text ] [ CrossRef ]
  • Barua PD, Baygin N, Dogan S, Baygin M, Arunkumar N, Fujita H, et al. Automated detection of pain levels using deep feature extraction from shutter blinds-based dynamic-sized horizontal patches with facial images. Sci Rep. 2022;12(1):17297. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Fontaine D, Vielzeuf V, Genestier P, Limeux P, Santucci-Sivilotto S, Mory E, et al. Artificial intelligence to evaluate postoperative pain based on facial expression recognition. Eur J Pain. 2022;26(6):1282-1291. [ CrossRef ] [ Medline ]
  • Hosseini E, Fang R, Zhang R, Chuah CN, Orooji M, Rafatirad S, et al. Convolution neural network for pain intensity assessment from facial expression. 2022. Presented at: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); July 11-15, 2022;2697-2702; Glasgow, Scotland. [ CrossRef ]
  • Huang Y, Qing L, Xu S, Wang L, Peng Y. HybNet: a hybrid network structure for pain intensity estimation. Vis Comput. 2021;38(3):871-882. [ CrossRef ]
  • Islamadina R, Saddami K, Oktiana M, Abidin TF, Muharar R, Arnia F. Performance of deep learning benchmark models on thermal imagery of pain through facial expressions. 2022. Presented at: 2022 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT); November 03-05, 2022;374-379; Solo, Indonesia. [ CrossRef ]
  • Swetha L, Praiscia A, Juliet S. Pain assessment model using facial recognition. 2022. Presented at: 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS); May 25-27, 2022;1-5; Madurai, India. [ CrossRef ]
  • Wu CL, Liu SF, Yu TL, Shih SJ, Chang CH, Mao SFY, et al. Deep learning-based pain classifier based on the facial expression in critically ill patients. Front Med (Lausanne). 2022;9:851690. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ismail L, Waseem MD. Towards a deep learning pain-level detection deployment at UAE for patient-centric-pain management and diagnosis support: framework and performance evaluation. Procedia Comput Sci. 2023;220:339-347. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Vu MT, Beurton-Aimar M. Learning to focus on region-of-interests for pain intensity estimation. 2023. Presented at: 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG); January 05-08, 2023;1-6; Waikoloa Beach, HI. [ CrossRef ]
  • Kaur H, Pannu HS, Malhi AK. A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput Surv. 2019;52(4):1-36. [ CrossRef ]
  • Data for meta-analysis of pain assessment from facial images. Figshare. 2023. URL: https:/​/figshare.​com/​articles/​dataset/​Data_for_Meta-Analysis_of_Pain_Assessment_from_Facial_Images/​24531466/​1 [accessed 2024-03-22]

Abbreviations

Edited by A Mavragani; submitted 26.07.23; peer-reviewed by M Arab-Zozani, M Zhang; comments to author 18.09.23; revised version received 08.10.23; accepted 28.02.24; published 12.04.24.

©Jian Huo, Yan Yu, Wei Lin, Anmin Hu, Chaoran Wu. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 12.04.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: ferret-ui: grounded mobile ui understanding with multimodal llms.

Abstract: Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Computer Vision for 3D Perception A review

    research paper on computer vision

  2. (PDF) Computer Vision based Assistive Technology for Blind and Visually

    research paper on computer vision

  3. (PDF) Current research opportunities of image processing and computer

    research paper on computer vision

  4. (PDF) Computer Vision for Image Understanding: A Comprehensive Review

    research paper on computer vision

  5. (PDF) Computer Vision Syndrome

    research paper on computer vision

  6. (PDF) Practical computer vision: Example techniques and challenges

    research paper on computer vision

VIDEO

  1. Matching array elements in f#

  2. Computer Vision

  3. F# Tutorial: Ignoring asynchronous returns

  4. F# Tutorial: Passing parameters to event listeners

  5. F# Tutorial: Using the List.fold function

  6. Arrays with function-based initialisation in F#

COMMENTS

  1. Computer Vision

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... You can create a new account if you don't have one. Browse SoTA > Computer Vision Computer Vision. 4612 benchmarks • 1418 tasks • 2962 datasets • 46379 papers with code Semantic Segmentation ... 5156 papers with code

  2. Deep learning in computer vision: A critical review of emerging

    The features of big data could be captured by DL automatically and efficiently. The current applications of DL include computer vision (CV), natural language processing (NLP), video/speech recognition (V/SP), and finance and banking (F&B). Chai and Li (2019) provided a survey of DL on NLP and the advances on V/SP. The survey emphasized the ...

  3. Top Computer Vision Papers of All Time (Updated 2024)

    We explore the groundbreaking research that has shaped the field of computer vision with our list of the top papers of all time. ... Classic Computer Vision Papers Gradient-based Learning Applied to Document Recognition (1998) The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. ...

  4. Computer Vision and Pattern Recognition

    Journal-ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) arXiv:2404.06332 [pdf, other] Title: X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model

  5. Machine Learning in Computer Vision

    The machine learning and computer vision research is still evolving [1]. Computer vision is an essential part of Internet of Things, Industrial Internet of Things, and brain human interfaces. The complex human activities are recognized and monitored in multimedia streams using machine learning and computer vison.

  6. IET Computer Vision

    IET Computer Vision is a fully open access journal that introduces new horizons and sets the agenda for future avenues of research in a wide range of areas of computer vision. We are a fully open access journal that welcomes research articles reporting novel methodologies and significant results of interest.

  7. The Top 10 Computer Vision Papers of 2021

    A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference. - louisfb01/top-10-cv-papers-2021 ... If you'd like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.

  8. CVIU

    The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image ...

  9. computer vision Latest Research Papers

    The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method's efficacy.

  10. [2212.05153] Algorithmic progress in computer vision

    Algorithmic progress in computer vision. Ege Erdil, Tamay Besiroglu. We investigate algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision. We estimate a model, informed by work on neural scaling laws, and infer a decomposition of progress into the scaling of compute, data, and algorithms.

  11. Must-Read Papers in Computer Vision for the 2020s *Updated*

    A ConvNet for the 2020s. Figure 2. Comparisons of ConvNeXt with transformers. Image retrieved from original paper. On the other hand, while ViTs have shown to be superior in numerous papers for vision tasks, one work stood out in analysing the fundamentals of convolutional networks (ConvNet). Liu et al. focused on "modernising" a ...

  12. (PDF) ARTIFICIAL INTELLIGENCE IN COMPUTER VISION

    The research work was done during the period from 2019 till 2022 in ISCTE taking in consideration artificial intelligence for computer vision [48] concepts and software engineering practices [49 ...

  13. The application of deep learning in computer vision

    As the deep learning exhibits strong advantages in the feature extraction, it has been widely used in the field of computer vision and among others, and gradually replaced traditional machine learning algorithms. This paper first reviews the main ideas of deep learning, and displays several related frequently-used algorithms for computer vision. Afterwards, the current research status of ...

  14. Computer Vision and Deep Learning: Applications in Image Analysis

    Research Design: This review paper adopts a systematic approach to explore the applications of. computer vision and deep learning in i mage analysis. The research design i nvolves an extensive ...

  15. When computer vision works more like a brain, it sees more like people

    Scientists from MIT and IBM Research made a computer vision model more robust by training it to work like a part of the brain that humans and other primates rely on for object recognition. ... Paper. Paper: "Aligning Model and Macaque Inferior Temporal Cortex Representations Improves Model-to-Human Behavioral Alignment and Adversarial Robustness"

  16. Deep learning based computer vision approaches for ...

    For this study, we have collected more than 100 research papers from scientific databases, including PubMed, Web of Science, and Scopus, in the area of deep learning-based computer vision. ... Research in computer vision is growing at a faster pace in the agriculture domain. Building a robust computer vision system requires quality data ...

  17. Computer Vision and Image Processing: A Paper Review

    This paper provides contribution of recent development on reviews related to computer vision, image processing, and their related studies. We categorized the computer vision mainstream into four ...

  18. Computer Vision Technology Based on Deep Learning

    With the development of artificial intelligence, computer vision technology that simulates human vision has received widespread attention. Based on the current commonly used method of computer vision technology-deep learning, this paper outlines the development of deep learning models, and determines the inflection point of the development of the introduction of convolutional neural networks ...

  19. Top 10 Computer Vision Papers 2020

    A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more…. Paper references. [1] Akkaynak, Derya & Treibitz, Tali. (2019). Sea-Thru: A Method for Removing Water From Underwater Images. 1682-1691. 10.1109/CVPR.2019.00178.

  20. PDF Industry and Academic Research in Computer Vision

    impact on computer vision research is largely unknown due to the lack of relevant data and formal studies. Therefore, the goal of this study is two-fold: to quantify the share of industry-sponsored research in the field of computer vision and to understand whether industry presence has a measurable effect on the way the field is developing.

  21. Top 10 Computer Vision Papers of 2021

    Animating Pictures with Eulerian Motion Fields [6] CVPR 2021 Best Paper Award: GIRAFFE — Controllable Image Generation [7] TimeLens: Event-based Video Frame Interpolation [8] (Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9] CityNeRF: Building NeRF at City Scale [10] Paper references.

  22. [2101.01169] Transformers in Vision: A Survey

    Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different ...

  23. (PDF) Computer Vision

    Abstract. This is a dense introduction to the field of computer vision. It covers all three approaches, the classical engineering approach based on contours and regions; the local-features ...

  24. Journal of Medical Internet Research

    Background: The continuous monitoring and recording of patients' pain status is a major problem in current research on postoperative pain management. In the large number of original or review articles focusing on different approaches for pain assessment, many researchers have investigated how computer vision (CV) can help by capturing facial expressions.

  25. PDF Applications of Computer Vision in Autonomous Vehicles: Methods

    future research. This paper will help the reader to understand autonomous vehicles from the perspectives of academia and ... choose IEEE Xplore as the main repository for papers in computer vision and autonomous driving, as it is the most influential academic publisher in computer science, electrical engineering, electronics, and relevant ...

  26. Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

    Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities ...