Subscribe to the PwC Newsletter
Join the community, computer vision, semantic segmentation.
Tumor Segmentation
Panoptic Segmentation
3D Semantic Segmentation
Weakly-Supervised Semantic Segmentation
Representation learning.
Disentanglement
Graph representation learning, sentence embeddings.
Network Embedding
Classification.
Text Classification
Graph Classification
Audio Classification
Medical Image Classification
Object detection.
3D Object Detection
Real-Time Object Detection
RGB Salient Object Detection
Few-Shot Object Detection
Image classification.
Out of Distribution (OOD) Detection
Few-Shot Image Classification
Fine-Grained Image Classification
Semi-Supervised Image Classification
2d object detection.
Edge Detection
Thermal image segmentation.
Open Vocabulary Object Detection
Reinforcement learning (rl), off-policy evaluation, multi-objective reinforcement learning, 3d point cloud reinforcement learning, deep hashing, table retrieval, domain adaptation.
Unsupervised Domain Adaptation
Domain Generalization
Test-time Adaptation
Source-free domain adaptation, image generation.
Image-to-Image Translation
Image Inpainting
Text-to-Image Generation
Conditional Image Generation
Data augmentation.
Image Augmentation
Text Augmentation
Autonomous vehicles.
Autonomous Driving
Self-Driving Cars
Simultaneous Localization and Mapping
Autonomous Navigation
Image Denoising
Color Image Denoising
Sar Image Despeckling
Grayscale image denoising, meta-learning.
Few-Shot Learning
Sample Probing
Universal meta-learning, contrastive learning.
Super-Resolution
Image Super-Resolution
Video Super-Resolution
Multi-Frame Super-Resolution
Reference-based Super-Resolution
Pose estimation.
3D Human Pose Estimation
Keypoint Detection
3D Pose Estimation
6D Pose Estimation
Self-supervised learning.
Point Cloud Pre-training
Unsupervised video clustering, 2d semantic segmentation, image segmentation, text style transfer.
Scene Parsing
Reflection Removal
Visual question answering (vqa).
Visual Question Answering
Machine Reading Comprehension
Chart Question Answering
Embodied Question Answering
Depth Estimation
3D Reconstruction
Neural Rendering
3D Face Reconstruction
3D Shape Reconstruction
Sentiment analysis.
Aspect-Based Sentiment Analysis (ABSA)
Multimodal Sentiment Analysis
Aspect Sentiment Triplet Extraction
Twitter Sentiment Analysis
Anomaly detection.
Unsupervised Anomaly Detection
One-Class Classification
Supervised anomaly detection, anomaly detection in surveillance videos.
Temporal Action Localization
Video Understanding
Video Object Segmentation
Video generation.
Action Classification
Activity recognition.
Action Recognition
Human Activity Recognition
Egocentric activity recognition.
Group Activity Recognition
One-Shot Learning
Few-Shot Semantic Segmentation
Cross-domain few-shot.
Unsupervised Few-Shot Learning
3d object super-resolution, medical image segmentation.
Lesion Segmentation
Brain Tumor Segmentation
Cell Segmentation
Brain Segmentation
Monocular depth estimation.
Stereo Depth Estimation
Depth and camera motion.
3D Depth Estimation
Exposure fairness, optical character recognition (ocr).
Active Learning
Handwriting Recognition
Handwritten digit recognition, irregular text recognition, instance segmentation.
Referring Expression Segmentation
3D Instance Segmentation
Real-time Instance Segmentation
Unsupervised Object Segmentation
Facial recognition and modelling.
Face Recognition
Face Swapping
Face Detection
Facial Expression Recognition (FER)
Face Verification
Object tracking.
Multi-Object Tracking
Visual Object Tracking
Multiple Object Tracking
Cell Tracking
Zero-shot learning.
Generalized Zero-Shot Learning
Compositional Zero-Shot Learning
Multi-label zero-shot learning, quantization, data free quantization, unet quantization.
Action Recognition In Videos
3D Action Recognition
Self-supervised action recognition, few shot action recognition, continual learning.
Class Incremental Learning
Continual named entity recognition, unsupervised class-incremental learning.
Scene Understanding
Scene Text Recognition
Scene Graph Generation
Scene Recognition
Adversarial attack.
Backdoor Attack
Adversarial Text
Adversarial attack detection, real-world adversarial attack, image retrieval.
Sketch-Based Image Retrieval
Content-Based Image Retrieval
Composed Image Retrieval (CoIR)
Medical Image Retrieval
Active object detection, dimensionality reduction.
Supervised dimensionality reduction
Online nonnegative cp decomposition, emotion recognition.
Speech Emotion Recognition
Emotion Recognition in Conversation
Multimodal Emotion Recognition
Emotion-cause pair extraction.
Monocular 3D Object Detection
3D Object Detection From Stereo Images
Multiview Detection
Robust 3d object detection, style transfer.
Image Stylization
Font style transfer, style generalization, face transfer, optical flow estimation.
Video Stabilization
Image reconstruction.
MRI Reconstruction
Action localization.
Action Segmentation
Spatio-temporal action localization, person re-identification.
Unsupervised Person Re-Identification
Video-based person re-identification, generalizable person re-identification, cloth-changing person re-identification, image captioning.
3D dense captioning
Controllable image captioning, aesthetic image captioning.
Relational Captioning
Visual relationship detection, lighting estimation.
3D Room Layouts From A Single RGB Panorama
Road scene understanding, image restoration.
Demosaicking
Spectral reconstruction, underwater image restoration.
JPEG Artifact Correction
Action detection.
Skeleton Based Action Recognition
Online Action Detection
Audio-visual active speaker detection, metric learning.
Object Recognition
3D Object Recognition
Continuous object recognition.
Depiction Invariant Object Recognition
Monocular 3D Human Pose Estimation
Pose prediction.
3D Multi-Person Pose Estimation
3d human pose and shape estimation, multi-label classification.
Missing Labels
Extreme multi-label classification, medical code prediction, hierarchical multi-label classification, image enhancement.
Low-Light Image Enhancement
Image relighting, de-aliasing, continuous control.
Steering Control
Drone controller.
Semi-Supervised Video Object Segmentation
Unsupervised Video Object Segmentation
Referring Video Object Segmentation
Video Salient Object Detection
3d face modelling.
Trajectory Prediction
Trajectory Forecasting
Human motion prediction, out-of-sight trajectory prediction.
Multivariate Time Series Imputation
Object localization.
Weakly-Supervised Object Localization
Image-based localization, unsupervised object localization, monocular 3d object localization, novel view synthesis.
Novel LiDAR View Synthesis
Gournd video synthesis from satellite image
Image quality assessment, no-reference image quality assessment, blind image quality assessment.
Aesthetics Quality Assessment
Stereoscopic image quality assessment.
Blind Image Deblurring
Single-image blind deblurring, out-of-distribution detection, video semantic segmentation.
Camera shot segmentation
Cloud removal.
Facial Inpainting
Fine-Grained Image Inpainting
10-shot image generation, gan image forensics, instruction following, visual instruction following, saliency detection.
Saliency Prediction
Co-Salient Object Detection
Video saliency detection, unsupervised saliency detection, change detection.
Semi-supervised Change Detection
Image compression.
Feature Compression
Jpeg compression artifact reduction.
Lossy-Compression Artifact Reduction
Color image compression artifact reduction, explainable artificial intelligence, explainable models, explanation fidelity evaluation, fad curve analysis, image registration.
Unsupervised Image Registration
Visual reasoning.
Visual Commonsense Reasoning
Ensemble learning, salient object detection, saliency ranking, prompt engineering.
Visual Prompting
Visual tracking.
Point Tracking
Rgb-t tracking, real-time visual tracking.
RF-based Visual Tracking
2d classification.
Neural Network Compression
Music Source Separation
Cell detection.
Plant Phenotyping
Open-set classification, motion estimation, 3d point cloud classification.
3D Object Classification
Few-Shot 3D Point Cloud Classification
Zero-shot transfer 3d point cloud classification, image manipulation detection.
Generalized Zero Shot skeletal action recognition
Zero shot skeletal action recognition, activity prediction, motion prediction, cyber attack detection, sequential skip prediction, point cloud registration.
Image to Point Cloud Registration
Whole slide images.
Robust 3D Semantic Segmentation
Real-Time 3D Semantic Segmentation
Unsupervised 3D Semantic Segmentation
Furniture segmentation, gesture recognition.
Hand Gesture Recognition
Hand-Gesture Recognition
RF-based Gesture Recognition
Text detection, 3d point cloud interpolation, video captioning.
Dense Video Captioning
Boundary captioning, visual text correction, audio-visual video captioning, medical diagnosis.
Alzheimer's Disease Detection
Retinal OCT Disease Classification
Blood cell count, thoracic disease classification, video question answering.
Zero-Shot Video Question Answer
Few-shot video question answering, visual grounding.
Person-centric Visual Grounding
Phrase Extraction and Grounding (PEG)
Visual odometry.
Face Anti-Spoofing
Monocular visual odometry.
Hand Pose Estimation
Hand Segmentation
Gesture-to-gesture translation, rain removal.
Single Image Deraining
Image clustering.
Online Clustering
Face Clustering
Multi-view subspace clustering, multi-modal subspace clustering, colorization.
Line Art Colorization
Point-interactive Image Colorization
Color Mismatch Correction
Image Dehazing
Single Image Dehazing
Robot navigation.
PointGoal Navigation
Social navigation.
Sequential Place Learning
Image manipulation.
Unsupervised Image-To-Image Translation
Synthetic-to-Real Translation
Multimodal Unsupervised Image-To-Image Translation
Cross-View Image-to-Image Translation
Fundus to Angiography Generation
Stereo matching, visual localization.
Visual Place Recognition
Indoor Localization
3d place recognition, image editing, rolling shutter correction, shadow removal, joint deblur and frame interpolation, multimodal fashion image editing, multimodel-guided image editing, conformal prediction.
Crowd Counting
Visual Crowd Analysis
Group detection in crowds, human-object interaction detection.
Affordance Recognition
Object reconstruction.
3D Object Reconstruction
Deepfake detection.
Synthetic Speech Detection
Human detection of deepfakes, multimodal forgery detection, point cloud classification, jet tagging, few-shot point cloud classification, image matching.
Semantic correspondence
Patch matching, set matching.
Matching Disparate Images
Image deblurring, low-light image deblurring and enhancement, document text classification, learning with noisy labels, multi-label classification of biomedical texts, political salient issue orientation detection.
Weakly Supervised Action Localization
Weakly-supervised temporal action localization.
Temporal Action Proposal Generation
Activity recognition in videos, earth observation, hyperspectral.
Hyperspectral Image Classification
Hyperspectral unmixing, hyperspectral image segmentation, classification of hyperspectral images, video quality assessment, video alignment, temporal sentence grounding, long-video activity recognition, 2d human pose estimation, action anticipation.
3D Face Animation
Semi-supervised human pose estimation, scene classification.
Point Cloud Generation
Point cloud completion, referring expression, compressive sensing, keyword spotting.
Small-Footprint Keyword Spotting
Visual keyword spotting, reconstruction, 3d human reconstruction.
Single-View 3D Reconstruction
4d reconstruction, single-image-based hdr reconstruction, scene text detection.
Curved Text Detection
Multi-oriented scene text detection, boundary detection.
Junction Detection
Image matting.
Semantic Image Matting
Camera calibration, video retrieval, video-text retrieval, video grounding, video-adverb retrieval, replay grounding, composed video retrieval (covr), emotion classification.
Superpixels
Remote sensing.
Remote Sensing Image Classification
Change detection for remote sensing images, building change detection for remote sensing images.
Segmentation Of Remote Sensing Imagery
The Semantic Segmentation Of Remote Sensing Imagery
Motion synthesis.
Motion Style Transfer
Temporal human motion composition, video summarization.
Unsupervised Video Summarization
Supervised video summarization, document ai, document understanding, point cloud segmentation, sensor fusion, 3d anomaly detection, video anomaly detection, artifact detection, document layout analysis.
Point cloud reconstruction
3D Semantic Scene Completion
3D Semantic Scene Completion from a single RGB image
Garment reconstruction.
Few-Shot Transfer Learning for Saliency Prediction
Aerial Video Saliency Prediction
Face generation.
Talking Head Generation
Talking face generation.
Face Age Editing
Facial expression generation, kinship face generation, cross-modal retrieval, image-text matching, multilingual cross-modal retrieval.
Zero-shot Composed Person Retrieval
Cross-modal retrieval on rsitmd, video instance segmentation.
Human Detection
Privacy Preserving Deep Learning
Membership inference attack, virtual try-on.
Generalized Few-Shot Semantic Segmentation
Scene flow estimation.
Self-supervised Scene Flow Estimation
Video editing, video temporal consistency, face reconstruction, motion forecasting.
Multi-Person Pose forecasting
Multiple Object Forecasting
3d classification.
Generalized Referring Expression Segmentation
Depth completion.
Object Discovery
Carla map leaderboard, dead-reckoning prediction, gaze estimation.
Texture Synthesis
Image recognition, fine-grained image recognition, license plate recognition, material recognition.
Text-based Image Editing
Text-guided-image-editing.
Zero-Shot Text-to-Image Generation
Concept alignment, conditional text-to-image synthesis, human parsing.
Multi-Human Parsing
Multi-view learning, incomplete multi-view clustering, sign language recognition.
3D Multi-Person Pose Estimation (absolute)
3D Multi-Person Pose Estimation (root-relative)
3D Multi-Person Mesh Recovery
Gait recognition.
Multiview Gait Recognition
Gait recognition in the wild, facial landmark detection.
Unsupervised Facial Landmark Detection
3D Facial Landmark Localization
Pose tracking.
3D Human Pose Tracking
3d character animation from a single photo.
3D Hand Pose Estimation
Interactive segmentation, scene segmentation, weakly supervised segmentation.
Dichotomous Image Segmentation
Interest point detection, homography estimation, activity detection, inverse rendering, event-based vision.
Event-based Optical Flow
Event-Based Video Reconstruction
Event-based motion estimation, disease prediction, disease trajectory forecasting, scene generation.
Breast Cancer Detection
Skin cancer classification.
Breast Cancer Histology Image Classification
Lung cancer diagnosis, classification of breast cancer histology images, object counting, training-free object counting, open-vocabulary object counting, machine unlearning, continual forgetting, temporal localization.
Language-Based Temporal Localization
Temporal defect localization, template matching, 3d object tracking.
3D Single Object Tracking
Multi-label image classification.
Multi-label Image Recognition with Partial Labels
Relation network, visual dialog.
Text-to-Video Generation
Text-to-video editing, subject-driven video generation, intelligent surveillance.
Vehicle Re-Identification
Lidar semantic segmentation, motion segmentation, camera localization.
Camera Relocalization
Disparity estimation.
Text Spotting
Few-Shot Class-Incremental Learning
Class-incremental semantic segmentation, non-exemplar-based class incremental learning, handwritten text recognition, handwritten document recognition, unsupervised text recognition, knowledge distillation.
Data-free Knowledge Distillation
Self-knowledge distillation, text to video retrieval, partially relevant video retrieval, person search, decision making under uncertainty.
Uncertainty Visualization
Moment retrieval.
Zero-shot Moment Retrieval
Shadow detection.
Shadow Detection And Removal
Semi-supervised object detection.
Unconstrained Lip-synchronization
Mixed reality, video inpainting.
Cross-corpus
Micro-expression recognition, micro-expression spotting.
3D Facial Expression Recognition
Smile Recognition
Future prediction, video enhancement.
3D Multi-Object Tracking
Real-time multi-object tracking, multi-animal tracking with identification, trajectory long-tail distribution for muti-object tracking, grounded multiple object tracking, human mesh recovery, overlapped 10-1, overlapped 15-1, overlapped 15-5, disjoint 10-1, disjoint 15-1.
Face Image Quality Assessment
Lightweight face recognition.
Age-Invariant Face Recognition
Synthetic face recognition, face quality assessement, image categorization, fine-grained visual categorization, open vocabulary semantic segmentation, zero-guidance segmentation, deep attention.
Stereo Image Super-Resolution
Burst image super-resolution, satellite image super-resolution, multispectral image super-resolution, physics-informed machine learning, soil moisture estimation, line detection, zero shot segmentation, color constancy.
Few-Shot Camera-Adaptive Color Constancy
Visual recognition.
Fine-Grained Visual Recognition
Image cropping, stereo matching hand.
Video Reconstruction
3D Absolute Human Pose Estimation
Text-to-Face Generation
Hdr reconstruction, multi-exposure image fusion, zero-shot action recognition, video restoration.
Analog Video Restoration
Sign language translation.
Tone Mapping
Natural language transduction, surface normals estimation.
Transparent Object Detection
Transparent objects, cross-domain few-shot learning, image forensics, novel class discovery.
Vision-Language Navigation
Grasp Generation
hand-object pose
3D Canonical Hand Pose Estimation
Image animation.
Breast Cancer Histology Image Classification (20% labels)
Infrared and visible image fusion.
Probabilistic Deep Learning
Unsupervised few-shot image classification, generalized few-shot classification, abnormal event detection in video.
Semi-supervised Anomaly Detection
Steganalysis, texture classification, spoof detection, face presentation attack detection, detecting image manipulation, cross-domain iris presentation attack detection, finger dorsal image spoof detection, computer vision techniques adopted in 3d cryogenic electron microscopy, single particle analysis, cryogenic electron tomography.
Sketch Recognition
Face Sketch Synthesis
Drawing pictures.
Photo-To-Caricature Translation
Iris recognition, pupil dilation, highlight detection, pedestrian attribute recognition.
One-shot visual object segmentation
Automatic post-editing.
Image to 3D
Multi-view 3d reconstruction, object categorization, person retrieval, universal domain adaptation.
Unbiased Scene Graph Generation
Panoptic Scene Graph Generation
Action understanding, blind face restoration.
Document Image Classification
Face Reenactment
Geometric Matching
Image stitching.
Text based Person Retrieval
Human dynamics.
3D Human Dynamics
Meme classification, hateful meme classification, image to video generation.
Unconditional Video Generation
Severity prediction, intubation support prediction, dense captioning, human action generation.
Action Generation
Text-to-image, story visualization, complex scene breaking and synthesis, action quality assessment, cloud detection.
Image Outpainting
Object Segmentation
Camouflaged Object Segmentation
Landslide segmentation, text-line extraction, surgical phase recognition, online surgical phase recognition, offline surgical phase recognition, image fusion, pansharpening.
Semantic SLAM
Object SLAM
Image deconvolution.
Intrinsic Image Decomposition
Diffusion personalization.
Diffusion Personalization Tuning Free
Efficient Diffusion Personalization
Point clouds, point cloud video understanding, point cloud rrepresentation learning, situation recognition, grounded situation recognition, line segment detection, multi-target domain adaptation, table recognition.
Camouflaged Object Segmentation with a Single Task-generic Prompt
Image morphing, image shadow removal, visual prompt tuning, weakly-supervised instance segmentation, image smoothing, fake image detection.
Fake Image Attribution
Robot Pose Estimation
Image steganography, motion detection, person identification, rotated mnist, sports analytics, lane detection.
3D Lane Detection
Layout design, license plate detection.
Video Panoptic Segmentation
Viewpoint estimation.
Drone navigation
Drone-view target localization, contour detection.
Multi-Object Tracking and Segmentation
Occlusion Handling
Zero-shot transfer image classification.
3D Object Reconstruction From A Single Image
CAD Reconstruction
Value prediction, body mass index (bmi) prediction, 3d point cloud linear classification, crop classification, face image quality, photo retouching, motion retargeting, shape representation of 3d point clouds, 3d point cloud reconstruction, bird's-eye view semantic segmentation.
Crop Yield Prediction
Dense pixel correspondence estimation, human part segmentation.
Multiview Learning
Person recognition.
Document Shadow Removal
Symmetry detection, traffic sign detection, video style transfer, referring image matting.
Referring Image Matting (Expression-based)
Referring Image Matting (Keyword-based)
Referring Image Matting (RefMatte-RW100)
Referring image matting (prompt-based), human interaction recognition, one-shot 3d action recognition, mutual gaze, affordance detection.
Image Instance Retrieval
Amodal instance segmentation, image quality estimation.
Road Damage Detection
Space-time Video Super-resolution
Video matting.
Hand Detection
Image forgery detection, image similarity search.
Material Classification
Precipitation Forecasting
Referring expression generation, inverse tone mapping, image/document clustering, self-organized clustering.
Open-World Semi-Supervised Learning
Semi-supervised image classification (cold start), 3d shape modeling.
Action Analysis
Facial editing.
Food Recognition
Holdout Set
Motion magnification.
Open Vocabulary Attribute Detection
Semi-supervised instance segmentation, video segmentation, camera shot boundary detection, open-vocabulary video segmentation, open-world video segmentation, instance search.
Audio Fingerprint
Art analysis, event segmentation, generic event boundary detection, gaze prediction, image retouching, image-variation, jpeg artifact removal, point cloud super resolution, skills assessment.
Sensor Modeling
Binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, lung nodule classification, lung nodule 3d classification, lung nodule detection, lung nodule 3d detection, video prediction, earth surface forecasting, predict future video frames, 3d scene reconstruction.
Zero-Shot Composed Image Retrieval (ZS-CIR)
Handwriting generation, multispectral object detection, pose retrieval, scanpath prediction, scene change detection.
Sketch-to-Image Translation
Skills evaluation, highlight removal, 3d shape reconstruction from a single 2d image.
Shape from Texture
Deception detection, deception detection in videos, handwriting verification, bangla spelling error correction, 3d shape representation.
3D Dense Shape Correspondence
Audio-visual synchronization, birds eye view object detection.
Multiple People Tracking
RGB-D Reconstruction
Seeing beyond the visible, semi-supervised domain generalization, unsupervised semantic segmentation.
Unsupervised Semantic Segmentation with Language-image Pre-training
Multiple object tracking with transformer.
Multiple Object Track and Segmentation
Constrained lip-synchronization, face dubbing, vietnamese visual question answering, explanatory visual question answering.
Video Visual Relation Detection
Human-object relationship detection, 3d open-vocabulary instance segmentation.
Ad-hoc video search
Defocus blur detection, event data classification, image comprehension, image manipulation localization, instance shadow detection, kinship verification, medical image enhancement, network interpretation, open vocabulary panoptic segmentation, single-object discovery, training-free 3d point cloud classification.
Sequential Place Recognition
Autonomous flight (dense forest), autonomous web navigation, multimodal machine translation.
Face to Face Translation
Multimodal lexical translation, 2d semantic segmentation task 3 (25 classes), document enhancement, bokeh effect rendering, drivable area detection, face anonymization, font recognition, horizon line estimation, image imputation.
Long Video Retrieval (Background Removed)
Medical image denoising.
Occlusion Estimation
Physiological computing.
Lake Ice Monitoring
Short-term object interaction anticipation, spatio-temporal video grounding, unsupervised 3d point cloud linear evaluation, video forensics, wireframe parsing, single-image-generation, unsupervised anomaly detection with specified settings -- 30% anomaly, root cause ranking, anomaly detection at 30% anomaly, anomaly detection at various anomaly percentages.
Unsupervised Contextual Anomaly Detection
2d pose estimation, category-agnostic pose estimation, overlapping pose estimation, facial expression recognition, cross-domain facial expression recognition, zero-shot facial expression recognition, landmark tracking, muscle tendon junction identification, action assessment, animated gif generation, generalized referring expression comprehension, image deblocking, motion disentanglement, persuasion strategies, scene text editing, synthetic image detection, traffic accident detection, accident anticipation, unsupervised landmark detection, visual speech recognition, lip to speech synthesis, continual anomaly detection, gaze redirection, weakly supervised action segmentation (transcript), weakly supervised action segmentation (action set)), calving front delineation in synthetic aperture radar imagery, calving front delineation in synthetic aperture radar imagery with fixed training amount.
Handwritten Line Segmentation
Handwritten word segmentation.
General Action Video Anomaly Detection
Physical video anomaly detection, monocular cross-view road scene parsing(road), monocular cross-view road scene parsing(vehicle).
Transparent Object Depth Estimation
3d semantic occupancy prediction, 3d scene editing, 4d panoptic segmentation, age and gender estimation, data ablation.
Occluded Face Detection
Gait identification, historical color image dating, stochastic human motion prediction, image retargeting, image and video forgery detection, infrared image super-resolution, motion captioning, personality trait recognition, personalized segmentation, scene-aware dialogue, spatial relation recognition, spatial token mixer, steganographics, story continuation.
Unsupervised Anomaly Detection with Specified Settings -- 0.1% anomaly
Unsupervised anomaly detection with specified settings -- 1% anomaly, unsupervised anomaly detection with specified settings -- 10% anomaly, unsupervised anomaly detection with specified settings -- 20% anomaly, vehicle speed estimation, visual social relationship recognition, zero-shot text-to-video generation, text-guided-generation, video frame interpolation, 3d video frame interpolation, unsupervised video frame interpolation.
eXtreme-Video-Frame-Interpolation
Continual semantic segmentation, overlapped 5-3, overlapped 25-25, evolving domain generalization, source-free domain generalization, micro-expression generation, micro-expression generation (megc2021), mistake detection, online mistake detection, unsupervised panoptic segmentation, unsupervised zero-shot panoptic segmentation, 3d rotation estimation, camera auto-calibration, defocus estimation, derendering, fingertip detection, hierarchical text segmentation, human-object interaction concept discovery.
One-Shot Face Stylization
Speaker-specific lip to speech synthesis, multi-person pose estimation, neural stylization.
Part-aware Panoptic Segmentation
Population Mapping
Pornography detection, prediction of occupancy grid maps, raw reconstruction, svbrdf estimation, semi-supervised video classification, spectrum cartography, supervised image retrieval, synthetic image attribution, training-free 3d part segmentation, unsupervised image decomposition, video propagation, visual analogies, weakly supervised 3d point cloud segmentation, weakly-supervised panoptic segmentation, drone-based object tracking, brain visual reconstruction, brain visual reconstruction from fmri.
Human-Object Interaction Generation
Image-guided composition, fashion understanding, semi-supervised fashion compatibility.
intensity image denoising
Lifetime image denoising, observation completion, active observation completion, boundary grounding.
Video Narrative Grounding
3d inpainting, 3d scene graph alignment, 4d spatio temporal semantic segmentation.
Age Estimation
Few-shot Age Estimation
Brdf estimation, camouflage segmentation, clothing attribute recognition, damaged building detection, depth image estimation, detecting shadows, dynamic texture recognition.
Disguised Face Verification
Few shot open set object detection, gaze target estimation, generalized zero-shot learning - unseen, hd semantic map learning, human-object interaction anticipation, image deep networks, keypoint detection and image matching, manufacturing quality control, materials imaging, multi-person pose estimation and tracking.
Multi-modal image segmentation
Multi-object discovery, neural radiance caching.
Parking Space Occupancy
Partial Video Copy Detection
Multimodal Patch Matching
Perpetual view generation, procedure learning, prompt-driven zero-shot domain adaptation, repetitive action counting, single-shot hdr reconstruction, on-the-fly sketch based image retrieval, thermal image denoising, trademark retrieval, unsupervised instance segmentation, unsupervised zero-shot instance segmentation, vehicle key-point and orientation estimation.
Video Individual Counting
Video-adverb retrieval (unseen compositions), video-to-image affordance grounding.
Visual Sentiment Prediction
Human-scene contact detection, localization in video forgery, 3d canonicalization.
Cube Engraving Classification
3d surface generation.
Visibility Estimation from Point Cloud
Amodal layout estimation, blink estimation, camera absolute pose regression, change data generation, constrained diffeomorphic image registration, continuous affect estimation, deep feature inversion, document image skew estimation, earthquake prediction, fashion compatibility learning, film removal.
Displaced People Recognition
Finger vein recognition, flooded building segmentation.
Future Hand Prediction
Generative temporal nursing, house generation, human fmri response prediction, hurricane forecasting, ifc entity classification, image declipping, image similarity detection.
Image Text Removal
Image-to-gps verification.
Image-based Automatic Meter Reading
Dial meter reading, indoor scene reconstruction, jpeg decompression.
Kiss Detection
Laminar-turbulent flow localisation.
Landmark Recognition
Brain landmark detection, corpus video moment retrieval, mllm evaluation: aesthetics, medical image deblurring, mental workload estimation, meter reading, micro-gesture recognition, motion expressions guided video segmentation, natural image orientation angle detection, multi-object colocalization, multilingual text-to-image generation, video emotion detection, nwp post-processing, occluded 3d object symmetry detection, open set video captioning, pso-convnets dynamics 1, pso-convnets dynamics 2, partial point cloud matching.
Partially View-aligned Multi-view Learning
Pedestrian Detection
Thermal Infrared Pedestrian Detection
Personality trait recognition by face, physical attribute prediction, point cloud semantic completion, point cloud classification dataset, point- of-no-return (pnr) temporal localization, pose contrastive learning, potrait generation, prostate zones segmentation, pulmorary vessel segmentation, pulmonary artery–vein classification, reference expression generation, safety perception recognition, interspecies facial keypoint transfer, specular reflection mitigation, specular segmentation, state change object detection, surface normals estimation from point clouds, transform a video into a comics, transparency separation, typeface completion.
Unbalanced Segmentation
Unsupervised Long Term Person Re-Identification
Video correspondence flow.
Key-Frame-based Video Super-Resolution (K = 15)
Vietnamese multimodal learning, zero-shot single object tracking, yield mapping in apple orchards, lidar absolute pose regression, opd: single-view 3d openable part detection, self-supervised scene text recognition, video narration captioning, spectral estimation, spectral estimation from a single rgb image, 3d prostate segmentation, aggregate xview3 metric, atomic action recognition, composite action recognition, calving front delineation from synthetic aperture radar imagery, computer vision transduction, crosslingual text-to-image generation, zero-shot dense video captioning, document to image conversion, frame duplication detection, geometrical view, hyperview challenge.
Image Operation Chain Detection
Kinematic based workflow recognition, logo recognition.
MLLM Aesthetic Evaluation
Motion detection in non-stationary scenes, open-set video tagging, satellite orbit determination.
Segmentation Based Workflow Recognition
2d particle picking, small object detection.
Rice Grain Disease Detection
Sperm morphology classification, video & kinematic base workflow recognition, video based workflow recognition, video, kinematic & segmentation base workflow recognition, animal pose estimation.
- Explore Blog
Data Collection
Building Blocks
Device Enrollment
Monitoring Dashboards
Video Annotation
Application Editor
Device Management
Remote Maintenance
Model Training
Application Library
Deployment Manager
Unified Security Center
AI Model Library
Configuration Manager
IoT Edge Gateway
Privacy-preserving AI
Ready to get started?
- Why Viso Suite
Top Computer Vision Papers of All Time (Updated 2024)
Viso Suite is the all-in-one solution for teams to build, deliver, scale computer vision applications.
Viso Suite is the world’s only end-to-end computer vision platform. Request a demo.
Today’s boom in computer vision (CV) started at the beginning of the 21 st century with the breakthrough of deep learning models and convolutional neural networks (CNN). The main CV methods include image classification, image localization, object detection, and segmentation.
In this article, we dive into some of the most significant research papers that triggered the rapid development of computer vision. We split them into two categories – classical CV approaches, and papers based on deep-learning. We chose the following papers based on their influence, quality, and applicability.
Gradient-based Learning Applied to Document Recognition (1998)
Distinctive image features from scale-invariant keypoints (2004), histograms of oriented gradients for human detection (2005), surf: speeded up robust features (2006), imagenet classification with deep convolutional neural networks (2012), very deep convolutional networks for large-scale image recognition (2014), googlenet – going deeper with convolutions (2014), resnet – deep residual learning for image recognition (2015), faster r-cnn: towards real-time object detection with region proposal networks (2015), yolo: you only look once: unified, real-time object detection (2016), mask r-cnn (2017), efficientnet – rethinking model scaling for convolutional neural networks (2019).
About us: Viso Suite is the end-to-end computer vision solution for enterprises. With a simple interface and features that give machine learning teams control over the entire ML pipeline, Viso Suite makes it possible to achieve a 3-year ROI of 695%. Book a demo to learn more about how Viso Suite can help solve business problems.
Classic Computer Vision Papers
The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. They introduced the concept of a trainable Graph Transformer Network (GTN) for handwritten character and word recognition . They researched (non) discriminative gradient-based techniques for training the recognizer without manual segmentation and labeling.
Characteristics of the model:
- LeNet-5 CNN contains 6 convolution layers with multiple feature maps (156 trainable parameters).
- The input is a 32×32 pixel image and the output layer is composed of Euclidean Radial Basis Function units (RBF) one for each class (letter).
- The training set consists of 30000 examples, and authors achieved a 0.35% error rate on the training set (after 19 passes).
Find the LeNet paper here .
David Lowe (2004), proposed a method for extracting distinctive invariant features from images. He used them to perform reliable matching between different views of an object or scene. The paper introduced Scale Invariant Feature Transform (SIFT), while transforming image data into scale-invariant coordinates relative to local features.
Model characteristics:
- The method generates large numbers of features that densely cover the image over the full range of scales and locations.
- The model needs to match at least 3 features from each object – in order to reliably detect small objects in cluttered backgrounds.
- For image matching and recognition, the model extracts SIFT features from a set of reference images stored in a database.
- SIFT model matches a new image by individually comparing each feature from the new image to this previous database (Euclidian distance).
Find the SIFT paper here .
The authors Navneet Dalal and Bill Triggs researched the feature sets for robust visual object recognition, by using a linear SVM-based human detection as a test case. They experimented with grids of Histograms of Oriented Gradient (HOG) descriptors that significantly outperform existing feature sets for human detection .
Authors achievements:
- The histogram method gave near-perfect separation from the original MIT pedestrian database.
- For good results – the model requires: fine-scale gradients, fine orientation binning, i.e. high-quality local contrast normalization in overlapping descriptor blocks.
- Researchers examined a more challenging dataset containing over 1800 annotated human images with many pose variations and backgrounds.
- In the standard detector, each HOG cell appears four times with different normalizations and improves performance to 89%.
Find the HOG paper here .
Herbert Bay, Tinne Tuytelaars, and Luc Van Goo presented a scale- and rotation-invariant interest point detector and descriptor, called SURF (Speeded Up Robust Features). It outperforms previously proposed schemes concerning repeatability, distinctiveness, and robustness, while computing much faster. The authors relied on integral images for image convolutions, furthermore utilizing the leading existing detectors and descriptors.
- Applied a Hessian matrix-based measure for the detector, and a distribution-based descriptor, simplifying these methods to the essential.
- Presented experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application.
- SURF showed strong performance – SURF-128 with an 85.7% recognition rate, followed by U-SURF (83.8%) and SURF (82.6%).
Find the SURF paper here .
Papers Based on Deep-Learning Models
Alex Krizhevsky and his team won the ImageNet Challenge in 2012 by researching deep convolutional neural networks. They trained one of the largest CNNs at that moment over the ImageNet dataset used in the ILSVRC-2010 / 2012 challenges and achieved the best results reported on these datasets. They implemented a highly-optimized GPU of 2D convolution, thus including all required steps in CNN training, and published the results.
- The final CNN contained five convolutional and three fully connected layers, and the depth was quite significant.
- They found that removing any convolutional layer (each containing less than 1% of the model’s parameters) resulted in inferior performance.
- The same CNN, with an extra sixth convolutional layer, was used to classify the entire ImageNet Fall 2011 release (15M images, 22K categories).
- After fine-tuning on ImageNet-2012 it gave an error rate of 16.6%.
Find the ImageNet paper here .
Karen Simonyan and Andrew Zisserman (Oxford University) investigated the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Their main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, specifically focusing on very deep convolutional networks (VGG) . They proved that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.
- Their ImageNet Challenge 2014 submission secured the first and second places in the localization and classification tracks respectively.
- They showed that their representations generalize well to other datasets, where they achieved state-of-the-art results.
- They made two best-performing ConvNet models publicly available, in addition to the deep visual representations in CV.
Find the VGG paper here .
The Google team (Christian Szegedy, Wei Liu, et al.) proposed a deep convolutional neural network architecture codenamed Inception. They intended to set the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of their architecture was the improved utilization of the computing resources inside the network.
- A carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.
- Their submission for ILSVRC14 was called GoogLeNet, a 22-layer deep network. Its quality was assessed in the context of classification and detection.
- They added 200 region proposals coming from multi-box increasing the coverage from 92% to 93%.
- Lastly, they used an ensemble of 6 ConvNets when classifying each region which improved results from 40% to 43.9% accuracy.
Find the GoogLeNet paper here .
Microsoft researchers Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun presented a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously. They reformulated the layers as learning residual functions concerning the layer inputs, instead of learning unreferenced functions.
- They evaluated residual nets with a depth of up to 152 layers – 8× deeper than VGG nets, but still having lower complexity.
- This result won 1st place on the ILSVRC 2015 classification task.
- The team also analyzed the CIFAR-10 with 100 and 1000 layers, achieving a 28% relative improvement on the COCO object detection dataset.
- Moreover – in ILSVRC & COCO 2015 competitions, they won 1 st place on the tasks of ImageNet detection, ImageNet localization, COCO detection/segmentation.
Find the ResNet paper here .
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun introduced the Region Proposal Network (RPN) with full-image convolutional features with the detection network, therefore enabling nearly cost-free region proposals. Their RPN was a fully convolutional network that simultaneously predicted object bounds and objective scores at each position. Also, they trained the RPN end-to-end to generate high-quality region proposals, which were used by Fast R-CNN for detection.
- Merged RPN and fast R-CNN into a single network by sharing their convolutional features. In addition, they applied neural networks with “ attention” mechanisms .
- For the very deep VGG-16 model, their detection system had a frame rate of 5fps on a GPU.
- Achieved state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image.
- In ILSVRC and COCO 2015 competitions, faster R-CNN and RPN were the foundations of the 1st-place winning entries in several tracks.
Find the Faster R-CNN paper here .
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi developed YOLO, an innovative approach to object detection. Instead of repurposing classifiers to perform detection, the authors framed object detection as a regression problem. In addition, they spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance .
- The base YOLO model processed images in real-time at 45 frames per second.
- A smaller version of the network, Fast YOLO, processed 155 frames per second, while still achieving double the mAP of other real-time detectors.
- Compared to state-of-the-art detection systems, YOLO was making more localization errors, but was less likely to predict false positives in the background.
- YOLO learned very general representations of objects and outperformed other detection methods, including DPM and R-CNN, when generalizing natural images.
Find the YOLO paper here .
Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick (Facebook) presented a conceptually simple, flexible, and general framework for object instance segmentation. Their approach could detect objects in an image, while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN , extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
- Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps.
- Showed great results in all three tracks of the COCO suite of challenges. Also, it includes instance segmentation, bounding box object detection, and person keypoint detection.
- Mask R-CNN outperformed all existing, single-model entries on every task, including the COCO 2016 challenge winners.
- The model served as a solid baseline and eased future research in instance-level recognition.
Find the Mask R-CNN paper here .
The authors (Mingxing Tan, Quoc V. Le) of EfficientNet studied model scaling and identified that carefully balancing network depth, width, and resolution can lead to better performance. They proposed a new scaling method that uniformly scales all dimensions of depth resolution using a simple but effective compound coefficient. They demonstrated the effectiveness of this method in scaling up MobileNet and ResNet .
- Designed a new baseline network and scaled it up to obtain a family of models, called EfficientNets. It had much better accuracy and efficiency than previous ConvNets.
- EfficientNet-B7 achieved state-of-the-art 84.3% top-1 accuracy on ImageNet, while being 8.4x smaller and 6.1x faster on inference than the best existing ConvNet.
- It also transferred well and achieved state-of-the-art accuracy on CIFAR-100 (91.7%), Flowers (98.8%), and 3 other transfer learning datasets, with much fewer parameters.
Find the EfficientNet paper here .
Related Articles
YOLOv5 Is Here! Is It Real or a Fake?
Everything you need to know about YOLOv5. Why it is controversial and the differences of YOLOv5 to previous YOLO versions.
The 11 Top AI Influencers to Watch in 2024 (Guide)
We discuss the top AI influencers shaping the future of computer vision. From Andrew Ng to Yann LeCun, discover the minds driving innovation.
All-in-one platform to build, deploy, and scale computer vision applications
Join 6,300+ Fellow AI Enthusiasts
Get expert news and updates straight to your inbox. Subscribe to the Viso Blog.
Get expert AI news 2x a month. Subscribe to the most read Computer Vision Blog.
You can unsubscribe anytime. See our privacy policy .
Build any Computer Vision Application, 10x faster
All-in-one Computer Vision Platform for businesses to build, deploy and scale real-world applications.
- Deploy Apps
- Monitor Apps
- Manage Apps
- Help Center
Privacy Overview
Search code, repositories, users, issues, pull requests...
Provide feedback.
We read every piece of feedback, and take your input very seriously.
Saved searches
Use saved searches to filter your results more quickly.
To see all available qualifiers, see our documentation .
- Notifications
A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference.
louisfb01/top-10-cv-papers-2021
Folders and files, repository files navigation, the top 10 computer vision papers of 2021, the top 10 computer vision papers in 2021 with video demos, articles, code, and paper reference..
While the world is still recovering, research hasn't slowed its frenetic pace, especially in the field of artificial intelligence. More, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life's quality in the near future. Still, we ought to be careful with which technology we choose to apply.
"Science cannot tell us what we ought to do, only what we can do." - Jean-Paul Sartre, Being and Nothingness
Here are my top 10 of the most interesting research papers of the year in computer vision, in case you missed any of them. In short, it is basically a curated list of the latest breakthroughs in AI and CV with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read, and let me know if I missed any important papers in the comments, or by contacting me directly on LinkedIn!
The complete reference to each paper is listed at the end of this repository.
Maintainer: louisfb01
Subscribe to my newsletter - The latest updates in AI explained every week.
Feel free to message me any interesting paper I may have missed to add to this repository.
Tag me on Twitter @Whats_AI or LinkedIn @Louis (What's AI) Bouchard if you share the list!
Watch the 2021 CV rewind
Missed last year? Check this out: 2020: A Year Full of Amazing AI papers- A Review
👀 If you'd like to support my work and use W&B (for free) to track your ML experiments and make your work reproducible or collaborate with a team, you can try it out by following this guide ! Since most of the code here is PyTorch-based, we thought that a QuickStart guide for using W&B on PyTorch would be most interesting to share.
👉Follow this quick guide , use the same W&B lines in your code or any of the repos below, and have all your experiments automatically tracked in your w&b account! It doesn't take more than 5 minutes to set up and will change your life as it did for me! Here's a more advanced guide for using Hyperparameter Sweeps if interested :)
🙌 Thank you to Weights & Biases for sponsoring this repository and the work I've been doing, and thanks to any of you using this link and trying W&B!
If you are interested in AI research, here is another great repository for you:
A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more in-depth article, and code.
2021: A Year Full of Amazing AI papers- A Review
The Full List
Dall·e: zero-shot text-to-image generation from openai [1], taming transformers for high-resolution image synthesis [2], swin transformer: hierarchical vision transformer using shifted windows [3], deep nets: what have they ever done for vision [bonus].
- Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]
Total Relighting: Learning to Relight Portraits for Background Replacement [5]
- Animating Pictures with Eulerian Motion Fields [6]
- CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [7]
TimeLens: Event-based Video Frame Interpolation [8]
- (Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]
- CityNeRF: Building NeRF at City Scale [10]
Paper references
OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.
- Short read: OpenAI’s DALL·E: Text-to-Image Generation Explained
- Paper: Zero-Shot Text-to-Image Generation
- Code: Code & more information for the discrete VAE used for DALL·E
Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.
- Short read: Combining the Transformers Expressivity with the CNNs Efficiency for High-Resolution Image Synthesis
- Paper: Taming Transformers for High-Resolution Image Synthesis
- Code: Taming Transformers
Will Transformers Replace CNNs in Computer Vision? In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.
- Short read: Will Transformers Replace CNNs in Computer Vision?
- Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- Click here for the code
"I will openly share everything about deep nets for vision applications, their successes, and the limitations we have to address."
- Short read: What is the state of AI in computer vision?
- Paper: Deep nets: What have they ever done for vision?
Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image [4]
The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!
- Short read: Infinite Nature: Fly into an image and explore the landscape
- Paper: Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image
Properly relight any portrait based on the lighting of the new background you add. Have you ever wanted to change the background of a picture but have it look realistic? If you’ve already tried that, you already know that it isn’t simple. You can’t just take a picture of yourself in your home and change the background for a beach. It just looks bad and not realistic. Anyone will just say “that’s photoshopped” in a second. For movies and professional videos, you need the perfect lighting and artists to reproduce a high-quality image, and that’s super expensive. There’s no way you can do that with your own pictures. Or can you?
- Short read: Realistic Lighting on Different Backgrounds
- Paper: Total Relighting: Learning to Relight Portraits for Background Replacement
If you’d like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.
Animating Pictures with Eulerian Motion Fields [6]
This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely still creating amazing-looking videos like this one...
- Short read: Create Realistic Animated Looping Videos from Pictures
- Paper: Animating Pictures with Eulerian Motion Fields
CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation [7]
Using a modified GAN architecture, they can move objects in the image without affecting the background or the other objects!
- Short read: CVPR 2021 Best Paper Award: GIRAFFE - Controllable Image Generation
- Paper: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields
TimeLens can understand the movement of the particles in-between the frames of a video to reconstruct what really happened at a speed even our eyes cannot see. In fact, it achieves results that our intelligent phones and no other models could reach before!
- Short read: How to Make Slow Motion Videos With AI!
- Paper: TimeLens: Event-based Video Frame Interpolation
Subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!
(Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9]
Have you ever dreamed of taking the style of a picture, like this cool TikTok drawing style on the left, and applying it to a new picture of your choice? Well, I did, and it has never been easier to do. In fact, you can even achieve that from only text and can try it right now with this new method and their Google Colab notebook available for everyone (see references). Simply take a picture of the style you want to copy, enter the text you want to generate, and this algorithm will generate a new picture out of it! Just look back at the results above, such a big step forward! The results are extremely impressive, especially if you consider that they were made from a single line of text!
- Short read: Text-to-Drawing Synthesis With Artistic Control | CLIPDraw & StyleCLIPDraw
- Paper (CLIPDraw): CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
- Paper (StyleCLIPDraw): StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
- CLIPDraw Colab demo
- StyleCLIPDraw Colab demo
CityNeRF: Building NeRF at City Scale [10]
The model is called CityNeRF and grows from NeRF, which I previously covered on my channel. NeRF is one of the first models using radiance fields and machine learning to construct 3D models out of images. But NeRF is not that efficient and works for a single scale. Here, CityNeRF is applied to satellite and ground-level images at the same time to produce various 3D model scales for any viewpoint. In simple words, they bring NeRF to city-scale. But how?
- Short read: CityNeRF: 3D Modelling at City Scale!
- Paper: CityNeRF: Building NeRF at City Scale
- Click here for the code (will be released soon)
If you would like to read more papers and have a broader view, here is another great repository for you covering 2020: 2020: A Year Full of Amazing AI papers- A Review and feel free to subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!
[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092
[2] Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020.
[3] Liu, Z. et al., 2021, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint https://arxiv.org/abs/2103.14030v1
[bonus] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025 .
[4] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf
[5] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf .
[6] Holynski, Aleksander, et al. “Animating Pictures with Eulerian Motion Fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[7] Michael Niemeyer and Andreas Geiger, (2021), "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields", Published in CVPR 2021.
[8] Stepan Tulyakov*, Daniel Gehrig*, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf
[9] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.
[10] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B. and Lin, D., 2021. CityNeRF: Building NeRF at City Scale.
Sponsor this project
computer vision Recently Published Documents
Total documents.
- Latest Documents
- Most Cited Documents
- Contributed Authors
- Related Sources
- Related Keywords
2D Computer Vision
A survey on generative adversarial networks: variants, applications, and training.
The Generative Models have gained considerable attention in unsupervised learning via a new and practical framework called Generative Adversarial Networks (GAN) due to their outstanding data generation capability. Many GAN models have been proposed, and several practical applications have emerged in various domains of computer vision and machine learning. Despite GANs excellent success, there are still obstacles to stable training. The problems are Nash equilibrium, internal covariate shift, mode collapse, vanishing gradient, and lack of proper evaluation metrics. Therefore, stable training is a crucial issue in different applications for the success of GANs. Herein, we survey several training solutions proposed by different researchers to stabilize GAN training. We discuss (I) the original GAN model and its modified versions, (II) a detailed analysis of various GAN applications in different domains, and (III) a detailed study about the various GAN training obstacles as well as training solutions. Finally, we reveal several issues as well as research outlines to the topic.
Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi
Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.
Feature Matching-based Approaches to Improve the Robustness of Android Visual GUI Testing
In automated Visual GUI Testing (VGT) for Android devices, the available tools often suffer from low robustness to mobile fragmentation, leading to incorrect results when running the same tests on different devices. To soften these issues, we evaluate two feature matching-based approaches for widget detection in VGT scripts, which use, respectively, the complete full-screen snapshot of the application ( Fullscreen ) and the cropped images of its widgets ( Cropped ) as visual locators to match on emulated devices. Our analysis includes validating the portability of different feature-based visual locators over various apps and devices and evaluating their robustness in terms of cross-device portability and correctly executed interactions. We assessed our results through a comparison with two state-of-the-art tools, EyeAutomate and Sikuli. Despite a limited increase in the computational burden, our Fullscreen approach outperformed state-of-the-art tools in terms of correctly identified locators across a wide range of devices and led to a 30% increase in passing tests. Our work shows that VGT tools’ dependability can be improved by bridging the testing and computer vision communities. This connection enables the design of algorithms targeted to domain-specific needs and thus inherently more usable and robust.
Computer vision to recognize construction waste compositions: A novel boundary-aware transformer (BAT) model
Computer vision for autonomous uav flight safety: an overview and a vision-based safe landing pipeline example.
Recent years have seen an unprecedented spread of Unmanned Aerial Vehicles (UAVs, or “drones”), which are highly useful for both civilian and military applications. Flight safety is a crucial issue in UAV navigation, having to ensure accurate compliance with recently legislated rules and regulations. The emerging use of autonomous drones and UAV swarms raises additional issues, making it necessary to transfuse safety- and regulations-awareness to relevant algorithms and architectures. Computer vision plays a pivotal role in such autonomous functionalities. Although the main aspects of autonomous UAV technologies (e.g., path planning, navigation control, landing control, mapping and localization, target detection/tracking) are already mature and well-covered, ensuring safe flying in the vicinity of crowds, avoidance of passing over persons, or guaranteed emergency landing capabilities in case of malfunctions, are generally treated as an afterthought when designing autonomous UAV platforms for unstructured environments. This fact is reflected in the fragmentary coverage of the above issues in current literature. This overview attempts to remedy this situation, from the point of view of computer vision. It examines the field from multiple aspects, including regulations across the world and relevant current technologies. Finally, since very few attempts have been made so far towards a complete UAV safety flight and landing pipeline, an example computer vision-based UAV flight safety pipeline is introduced, taking into account all issues present in current autonomous drones. The content is relevant to any kind of autonomous drone flight (e.g., for movie/TV production, news-gathering, search and rescue, surveillance, inspection, mapping, wildlife monitoring, crowd monitoring/management), making this a topic of broad interest.
Automatic recognition and classification of microseismic waveforms based on computer vision
Promises and pitfalls of using computer vision to make inferences about landscape preferences: evidence from an urban-proximate park system, weight-sharing neural architecture search: a battle to shrink the optimization gap.
Neural architecture search (NAS) has attracted increasing attention. In recent years, individual search methods have been replaced by weight-sharing search methods for higher search efficiency, but the latter methods often suffer lower instability. This article provides a literature review on these methods and owes this issue to the optimization gap . From this perspective, we summarize existing approaches into several categories according to their efforts in bridging the gap, and we analyze both advantages and disadvantages of these methodologies. Finally, we share our opinions on the future directions of NAS and AutoML. Due to the expertise of the authors, this article mainly focuses on the application of NAS to computer vision problems.
Assessing surface drainage conditions at the street and neighborhood scale: A computer vision and flow direction method applied to lidar data
Export citation format, share document.
Suggestions or feedback?
MIT News | Massachusetts Institute of Technology
- Machine learning
- Social justice
- Black holes
- Classes and programs
Departments
- Aeronautics and Astronautics
- Brain and Cognitive Sciences
- Architecture
- Political Science
- Mechanical Engineering
Centers, Labs, & Programs
- Abdul Latif Jameel Poverty Action Lab (J-PAL)
- Picower Institute for Learning and Memory
- Lincoln Laboratory
- School of Architecture + Planning
- School of Engineering
- School of Humanities, Arts, and Social Sciences
- Sloan School of Management
- School of Science
- MIT Schwarzman College of Computing
When computer vision works more like a brain, it sees more like people do
Press contact :.
Previous image Next image
From cameras to self-driving cars, many of today’s technologies depend on artificial intelligence to extract meaning from visual information. Today’s AI technology has artificial neural networks at its core, and most of the time we can trust these AI computer vision systems to see things the way we do — but sometimes they falter. According to MIT and IBM research scientists, one way to improve computer vision is to instruct the artificial neural networks that they rely on to deliberately mimic the way the brain’s biological neural network processes visual images.
Researchers led by MIT Professor James DiCarlo , the director of MIT’s Quest for Intelligence and member of the MIT-IBM Watson AI Lab, have made a computer vision model more robust by training it to work like a part of the brain that humans and other primates rely on for object recognition. This May, at the International Conference on Learning Representations, the team reported that when they trained an artificial neural network using neural activity patterns in the brain’s inferior temporal (IT) cortex, the artificial neural network was more robustly able to identify objects in images than a model that lacked that neural training. And the model’s interpretations of images more closely matched what humans saw, even when images included minor distortions that made the task more difficult.
Comparing neural circuits
Many of the artificial neural networks used for computer vision already resemble the multilayered brain circuits that process visual information in humans and other primates. Like the brain, they use neuron-like units that work together to process information. As they are trained for a particular task, these layered components collectively and progressively process the visual information to complete the task — determining, for example, that an image depicts a bear or a car or a tree.
DiCarlo and others previously found that when such deep-learning computer vision systems establish efficient ways to solve visual problems, they end up with artificial circuits that work similarly to the neural circuits that process visual information in our own brains. That is, they turn out to be surprisingly good scientific models of the neural mechanisms underlying primate and human vision.
That resemblance is helping neuroscientists deepen their understanding of the brain. By demonstrating ways visual information can be processed to make sense of images, computational models suggest hypotheses about how the brain might accomplish the same task. As developers continue to refine computer vision models, neuroscientists have found new ideas to explore in their own work.
“As vision systems get better at performing in the real world, some of them turn out to be more human-like in their internal processing. That’s useful from an understanding-biology point of view,” says DiCarlo, who is also a professor of brain and cognitive sciences and an investigator at the McGovern Institute for Brain Research.
Engineering a more brain-like AI
While their potential is promising, computer vision systems are not yet perfect models of human vision. DiCarlo suspected one way to improve computer vision may be to incorporate specific brain-like features into these models.
To test this idea, he and his collaborators built a computer vision model using neural data previously collected from vision-processing neurons in the monkey IT cortex — a key part of the primate ventral visual pathway involved in the recognition of objects — while the animals viewed various images. More specifically, Joel Dapello, a Harvard University graduate student and former MIT-IBM Watson AI Lab intern; and Kohitij Kar, assistant professor and Canada Research Chair (Visual Neuroscience) at York University and visiting scientist at MIT; in collaboration with David Cox, IBM Research’s vice president for AI models and IBM director of the MIT-IBM Watson AI Lab; and other researchers at IBM Research and MIT asked an artificial neural network to emulate the behavior of these primate vision-processing neurons while the network learned to identify objects in a standard computer vision task.
“In effect, we said to the network, ‘please solve this standard computer vision task, but please also make the function of one of your inside simulated “neural” layers be as similar as possible to the function of the corresponding biological neural layer,’” DiCarlo explains. “We asked it to do both of those things as best it could.” This forced the artificial neural circuits to find a different way to process visual information than the standard, computer vision approach, he says.
After training the artificial model with biological data, DiCarlo’s team compared its activity to a similarly-sized neural network model trained without neural data, using the standard approach for computer vision. They found that the new, biologically informed model IT layer was — as instructed — a better match for IT neural data. That is, for every image tested, the population of artificial IT neurons in the model responded more similarly to the corresponding population of biological IT neurons.
The researchers also found that the model IT was also a better match to IT neural data collected from another monkey, even though the model had never seen data from that animal, and even when that comparison was evaluated on that monkey’s IT responses to new images. This indicated that the team’s new, “neurally aligned” computer model may be an improved model of the neurobiological function of the primate IT cortex — an interesting finding, given that it was previously unknown whether the amount of neural data that can be currently collected from the primate visual system is capable of directly guiding model development.
With their new computer model in hand, the team asked whether the “IT neural alignment” procedure also leads to any changes in the overall behavioral performance of the model. Indeed, they found that the neurally-aligned model was more human-like in its behavior — it tended to succeed in correctly categorizing objects in images for which humans also succeed, and it tended to fail when humans also fail.
Adversarial attacks
The team also found that the neurally aligned model was more resistant to “adversarial attacks” that developers use to test computer vision and AI systems. In computer vision, adversarial attacks introduce small distortions into images that are meant to mislead an artificial neural network.
“Say that you have an image that the model identifies as a cat. Because you have the knowledge of the internal workings of the model, you can then design very small changes in the image so that the model suddenly thinks it’s no longer a cat,” DiCarlo explains.
These minor distortions don’t typically fool humans, but computer vision models struggle with these alterations. A person who looks at the subtly distorted cat still reliably and robustly reports that it’s a cat. But standard computer vision models are more likely to mistake the cat for a dog, or even a tree.
“There must be some internal differences in the way our brains process images that lead to our vision being more resistant to those kinds of attacks,” DiCarlo says. And indeed, the team found that when they made their model more neurally aligned, it became more robust, correctly identifying more images in the face of adversarial attacks. The model could still be fooled by stronger “attacks,” but so can people, DiCarlo says. His team is now exploring the limits of adversarial robustness in humans.
A few years ago, DiCarlo’s team found they could also improve a model’s resistance to adversarial attacks by designing the first layer of the artificial network to emulate the early visual processing layer in the brain. One key next step is to combine such approaches — making new models that are simultaneously neurally aligned at multiple visual processing layers.
The new work is further evidence that an exchange of ideas between neuroscience and computer science can drive progress in both fields. “Everybody gets something out of the exciting virtuous cycle between natural/biological intelligence and artificial intelligence,” DiCarlo says. “In this case, computer vision and AI researchers get new ways to achieve robustness, and neuroscientists and cognitive scientists get more accurate mechanistic models of human vision.”
This work was supported by the MIT-IBM Watson AI Lab, Semiconductor Research Corporation, the U.S. Defense Research Projects Agency, the MIT Shoemaker Fellowship, U.S. Office of Naval Research, the Simons Foundation, and Canada Research Chair Program.
Share this news article on:
Related links.
- Jim DiCarlo
- McGovern Institute for Brain Research
- MIT-IBM Watson AI Lab
- MIT Quest for Intelligence
- Department of Brain and Cognitive Sciences
Related Topics
- Brain and cognitive sciences
- McGovern Institute
- Artificial intelligence
- Computer vision
- Neuroscience
- Computer modeling
- Quest for Intelligence
Related Articles
Neuroscientists find a way to make object-recognition models perform better
Putting vision models to the test
How the brain distinguishes between objects
Previous item Next item
More MIT News
Plant sensors could act as an early warning system for farmers
Read full story →
A home where world-changing innovations take flight
3 Questions: Enhancing last-mile logistics with machine learning
Women in STEM — A celebration of excellence and curiosity
A blueprint for making quantum computers easier to program
“Nanostitches” enable lighter and tougher composite materials
- More news on MIT News homepage →
Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA
- Map (opens in new window)
- Events (opens in new window)
- People (opens in new window)
- Careers (opens in new window)
- Accessibility
- Social Media Hub
- MIT on Facebook
- MIT on YouTube
- MIT on Instagram
Computer Vision Technology Based on Deep Learning
Ieee account.
- Change Username/Password
- Update Address
Purchase Details
- Payment Options
- Order History
- View Purchased Documents
Profile Information
- Communications Preferences
- Profession and Education
- Technical Interests
- US & Canada: +1 800 678 4333
- Worldwide: +1 732 981 0060
- Contact & Support
- About IEEE Xplore
- Accessibility
- Terms of Use
- Nondiscrimination Policy
- Privacy & Opting Out of Cookies
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
Top 10 Computer Vision Papers of 2021
The top 10 computer vision papers in 2021 with video demos, articles, code, and paper reference.
Louis Bouchard
While the world is still recovering, research hasn’t slowed its frenetic pace, especially in the field of artificial intelligence. More, many important aspects were highlighted this year, like the ethical aspects, important biases, governance, transparency and much more. Artificial intelligence and our understanding of the human brain and its link to AI are constantly evolving, showing promising applications improving our life’s quality in the near future. Still, we ought to be careful with which technology we choose to apply.
"Science cannot tell us what we ought to do, only what we can do." - Jean-Paul Sartre, Being and Nothingness
Here are my top 10 of the most interesting research papers of the year in computer vision, in case you missed any of them. In short, it is basically a curated list of the latest breakthroughs in AI and CV with a clear video explanation, link to a more in-depth article, and code (if applicable). Enjoy the read, and let me know if I missed any important papers in the comments, or by contacting me directly on LinkedIn!
The complete reference to each paper is listed at the end of this article.
Subscribe to my newsletter — The latest updates in AI explained every week and please feel free to message me any interesting paper I may have missed!
Tag me on Twitter @Whats_AI or LinkedIn @Louis (What’s AI) Bouchard if you share the list!
Missed last year? Check this out: 2020: A Year Full of Amazing AI papers- A Review
👀 If you’d like to support my work and use W&B (for free) to track your ML experiments and make your work reproducible or collaborate with a team, you can try it out by following this guide ! Since most of the code here is PyTorch-based, we thought that a QuickStart guide for using W&B on PyTorch would be most interesting to share.
👉Follow this quick guide , use the same W&B lines in your code or any of the repos below, and have all your experiments automatically tracked in your w&b account! It doesn’t take more than 5 minutes to set up and will change your life as it did for me! Here’s a more advanced guide for using Hyperparameter Sweeps if interested :)
🙌 Thank you to Weights & Biases for sponsoring this repository and the work I’ve been doing, and thanks to any of you using this link and trying W&B!
Access the complete list in a GitHub repository
Watch the 2021 CV rewind
Table of content
Dall·e: zero-shot text-to-image generation from openai [1], taming transformers for high-resolution image synthesis [2], swin transformer: hierarchical vision transformer using shifted windows [3], deep nets: what have they ever done for vision [bonus], infinite nature: perpetual view generation of natural scenes from a single image [4], total relighting: learning to relight portraits for background replacement [5], animating pictures with eulerian motion fields [6], cvpr 2021 best paper award: giraffe — controllable image generation [7], timelens: event-based video frame interpolation [8], (style)clipdraw: coupling content and style in text-to-drawing synthesis [9], citynerf: building nerf at city scale [10], paper references.
OpenAI successfully trained a network able to generate images from text captions. It is very similar to GPT-3 and Image GPT and produces amazing results.
Short Video Explanation
- Paper: Zero-Shot Text-to-Image Generation
- Code: Code & more information for the discrete VAE used for DALL·E
Tl;DR: They combined the efficiency of GANs and convolutional approaches with the expressivity of transformers to produce a powerful and time-efficient method for semantically-guided high-quality image synthesis.
- Paper: Taming Transformers for High-Resolution Image Synthesis
- Code: Taming Transformers
Will Transformers Replace CNNs in Computer Vision? In less than 5 minutes, you will know how the transformer architecture can be applied to computer vision with a new paper called the Swin Transformer.
- Paper: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
- Click here for the code
“I will openly share everything about deep nets for vision applications, their successes, and the limitations we have to address.”
- Paper: Deep nets: What have they ever done for vision?
The next step for view synthesis: Perpetual View Generation, where the goal is to take an image to fly into it and explore the landscape!
Short Video Explanation:
- Paper: Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image
Properly relight any portrait based on the lighting of the new background you add. Have you ever wanted to change the background of a picture but have it look realistic? If you’ve already tried that, you already know that it isn’t simple. You can’t just take a picture of yourself in your home and change the background for a beach. It just looks bad and not realistic. Anyone will just say “that’s photoshopped” in a second. For movies and professional videos, you need the perfect lighting and artists to reproduce a high-quality image, and that’s super expensive. There’s no way you can do that with your own pictures. Or can you?
- Paper: Total Relighting: Learning to Relight Portraits for Background Replacement
If you’d like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.
This model takes a picture, understands which particles are supposed to be moving, and realistically animates them in an infinite loop while conserving the rest of the picture entirely still creating amazing-looking videos like this one…
- Paper: Animating Pictures with Eulerian Motion Fields
Using a modified GAN architecture, they can move objects in the image without affecting the background or the other objects!
- Paper: GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields
TimeLens can understand the movement of the particles in-between the frames of a video to reconstruct what really happened at a speed even our eyes cannot see. In fact, it achieves results that our intelligent phones and no other models could reach before!
- Paper: TimeLens: Event-based Video Frame Interpolation
Subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!
Have you ever dreamed of taking the style of a picture, like this cool TikTok drawing style on the left, and applying it to a new picture of your choice? Well, I did, and it has never been easier to do. In fact, you can even achieve that from only text and can try it right now with this new method and their Google Colab notebook available for everyone (see references). Simply take a picture of the style you want to copy, enter the text you want to generate, and this algorithm will generate a new picture out of it! Just look back at the results above, such a big step forward! The results are extremely impressive, especially if you consider that they were made from a single line of text!
- Paper (CLIPDraw): CLIPDraw: exploring text-to-drawing synthesis through language-image encoders
- Paper (StyleCLIPDraw): StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
- CLIPDraw Colab demo
- StyleCLIPDraw Colab demo
The model is called CityNeRF and grows from NeRF, which I previously covered on my channel. NeRF is one of the first models using radiance fields and machine learning to construct 3D models out of images. But NeRF is not that efficient and works for a single scale. Here, CityNeRF is applied to satellite and ground-level images at the same time to produce various 3D model scales for any viewpoint. In simple words, they bring NeRF to city-scale. But how?
- Paper: CityNeRF: Building NeRF at City Scale
- Click here for the code (will be released soon)
If you would like to read more papers and have a broader view, here is another great repository for you covering 2020: 2020: A Year Full of Amazing AI papers- A Review and feel free to subscribe to my weekly newsletter and stay up-to-date with new publications in AI for 2022!
[1] A. Ramesh et al., Zero-shot text-to-image generation, 2021. arXiv:2102.12092
[2] Taming Transformers for High-Resolution Image Synthesis, Esser et al., 2020.
[3] Liu, Z. et al., 2021, “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, arXiv preprint https://arxiv.org/abs/2103.14030v1
[bonus] Yuille, A.L., and Liu, C., 2021. Deep nets: What have they ever done for vision?. International Journal of Computer Vision, 129(3), pp.781–802, https://arxiv.org/abs/1805.04025 .
[4] Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N. and Kanazawa, A., 2020. Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image, https://arxiv.org/pdf/2012.09855.pdf
[5] Pandey et al., 2021, Total Relighting: Learning to Relight Portraits for Background Replacement, doi: 10.1145/3450626.3459872, https://augmentedperception.github.io/total_relighting/total_relighting_paper.pdf .
[6] Holynski, Aleksander, et al. “Animating Pictures with Eulerian Motion Fields.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[7] Michael Niemeyer and Andreas Geiger, (2021), “GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields”, Published in CVPR 2021.
[8] Stepan Tulyakov*, Daniel Gehrig*, Stamatios Georgoulis, Julius Erbach, Mathias Gehrig, Yuanyou Li, Davide Scaramuzza, TimeLens: Event-based Video Frame Interpolation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, 2021, http://rpg.ifi.uzh.ch/docs/CVPR21_Gehrig.pdf
[9] a) CLIPDraw: exploring text-to-drawing synthesis through language-image encoders b) StyleCLIPDraw: Schaldenbrand, P., Liu, Z. and Oh, J., 2021. StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis.
[10] Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B. and Lin, D., 2021. CityNeRF: Building NeRF at City Scale.
Sign up for more like this.
This paper is in the following e-collection/theme issue:
Published on 12.4.2024 in Vol 26 (2024)
Application of AI in in Multilevel Pain Assessment Using Facial Images: Systematic Review and Meta-Analysis
Authors of this article:
- Jian Huo 1 * , MSc ;
- Yan Yu 2 * , MMS ;
- Wei Lin 3 , MMS ;
- Anmin Hu 2, 3, 4 , MMS ;
- Chaoran Wu 2 , MD, PhD
1 Boston Intelligent Medical Research Center, Shenzhen United Scheme Technology Company Limited, Boston, MA, United States
2 Department of Anesthesia, Shenzhen People's Hospital, The First Affiliated Hospital of Southern University of Science and Technology, Shenzhen Key Medical Discipline, Shenzhen, China
3 Shenzhen United Scheme Technology Company Limited, Shenzhen, China
4 The Second Clinical Medical College, Jinan University, Shenzhen, China
*these authors contributed equally
Corresponding Author:
Chaoran Wu, MD, PhD
Department of Anesthesia
Shenzhen People's Hospital, The First Affiliated Hospital of Southern University of Science and Technology
Shenzhen Key Medical Discipline
No 1017, Dongmen North Road
Shenzhen, 518020
Phone: 86 18100282848
Email: [email protected]
Background: The continuous monitoring and recording of patients’ pain status is a major problem in current research on postoperative pain management. In the large number of original or review articles focusing on different approaches for pain assessment, many researchers have investigated how computer vision (CV) can help by capturing facial expressions. However, there is a lack of proper comparison of results between studies to identify current research gaps.
Objective: The purpose of this systematic review and meta-analysis was to investigate the diagnostic performance of artificial intelligence models for multilevel pain assessment from facial images.
Methods: The PubMed, Embase, IEEE, Web of Science, and Cochrane Library databases were searched for related publications before September 30, 2023. Studies that used facial images alone to estimate multiple pain values were included in the systematic review. A study quality assessment was conducted using the Quality Assessment of Diagnostic Accuracy Studies, 2nd edition tool. The performance of these studies was assessed by metrics including sensitivity, specificity, log diagnostic odds ratio (LDOR), and area under the curve (AUC). The intermodal variability was assessed and presented by forest plots.
Results: A total of 45 reports were included in the systematic review. The reported test accuracies ranged from 0.27-0.99, and the other metrics, including the mean standard error (MSE), mean absolute error (MAE), intraclass correlation coefficient (ICC), and Pearson correlation coefficient (PCC), ranged from 0.31-4.61, 0.24-2.8, 0.19-0.83, and 0.48-0.92, respectively. In total, 6 studies were included in the meta-analysis. Their combined sensitivity was 98% (95% CI 96%-99%), specificity was 98% (95% CI 97%-99%), LDOR was 7.99 (95% CI 6.73-9.31), and AUC was 0.99 (95% CI 0.99-1). The subgroup analysis showed that the diagnostic performance was acceptable, although imbalanced data were still emphasized as a major problem. All studies had at least one domain with a high risk of bias, and for 20% (9/45) of studies, there were no applicability concerns.
Conclusions: This review summarizes recent evidence in automatic multilevel pain estimation from facial expressions and compared the test accuracy of results in a meta-analysis. Promising performance for pain estimation from facial images was established by current CV algorithms. Weaknesses in current studies were also identified, suggesting that larger databases and metrics evaluating multiclass classification performance could improve future studies.
Trial Registration: PROSPERO CRD42023418181; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=418181
Introduction
The definition of pain was revised to “an unpleasant sensory and emotional experience associated with, or resembling that associated with, actual or potential tissue damage” in 2020 [ 1 ]. Acute postoperative pain management is important, as pain intensity and duration are critical influencing factors for the transition of acute pain to chronic postsurgical pain [ 2 ]. To avoid the development of chronic pain, guidelines were promoted and discussed to ensure safe and adequate pain relief for patients, and clinicians were recommended to use a validated pain assessment tool to track patients’ responses [ 3 ]. However, these tools, to some extent, depend on communication between physicians and patients, and continuous data cannot be provided [ 4 ]. The continuous assessment and recording of patient pain intensity will not only reduce caregiver burden but also provide data for chronic pain research. Therefore, automatic and accurate pain measurements are necessary.
Researchers have proposed different approaches to measuring pain intensity. Physiological signals, for example, electroencephalography and electromyography, have been used to estimate pain [ 5 - 7 ]. However, it was reported that current pain assessment from physiological signals has difficulties isolating stress and pain with machine learning techniques, as they share conceptual and physiological similarities [ 8 ]. Recent studies have also investigated pain assessment tools for certain patient subgroups. For example, people with deafness or an intellectual disability may not be able to communicate well with nurses, and an objective pain evaluation would be a better option [ 9 , 10 ]. Measuring pain intensity from patient behaviors, such as facial expressions, is also promising for most patients [ 4 ]. As the most comfortable and convenient method, computer vision techniques require no attachments to patients and can monitor multiple participants using 1 device [ 4 ]. However, pain intensity, which is important for pain research, is often not reported.
With the growing trend of assessing pain intensity using artificial intelligence (AI), it is necessary to summarize current publications to determine the strengths and gaps of current studies. Existing research has reviewed machine learning applications for acute postoperative pain prediction, continuous pain detection, and pain intensity estimation [ 10 - 14 ]. Input modalities, including facial recordings and physiological signals such as electroencephalography and electromyography, were also reviewed [ 5 , 8 ]. There have also been studies focusing on deep learning approaches [ 11 ]. AI was applied in children and infant pain evaluation as well [ 15 , 16 ]. However, no study has focused on pain intensity measurement, and no comparison of test accuracy results has been made.
Current AI applications in pain research can be categorized into 3 types: pain assessment, pain prediction and decision support, and pain self-management [ 14 ]. We consider accurate and automatic pain assessment to be the most important area and the foundation of future pain research. In this study, we performed a systematic review and meta-analysis to assess the diagnostic performance of current publications for multilevel pain evaluation.
This study was registered with PROSPERO (International Prospective Register of Systematic Reviews; CRD42023418181) and carried out strictly following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines [ 17 ] .
Study Eligibility
Studies that reported AI techniques for multiclass pain intensity classification were eligible. Records including nonhuman or infant participants or 2-class pain detection were excluded. Only studies using facial images of the test participants were accepted. Clinically used pain assessment tools, such as the visual analog scale (VAS) and numerical rating scale (NRS), and other pain intensity indicators, were rejected in the meta-analysis. Textbox 1 presents the eligibility criteria.
Study characteristics and inclusion criteria
- Participants: children and adults aged 12 months or older
- Setting: no restrictions
- Index test: artificial intelligence models that measure pain intensity from facial images
- Reference standard: no restrictions for systematic review; Prkachin and Solomon pain intensity score for meta-analysis
- Study design: no need to specify
Study characteristics and exclusion criteria
- Participants: infants aged 12 months or younger and animal subjects
- Setting: no need to specify
- Index test: studies that use other information such as physiological signals
- Reference standard: other pain evaluation tools, e.g., NRS, VAS, were excluded from meta-analysis
- Study design: reviews
Report characteristics and inclusion criteria
- Year: published between January 1, 2012, and September 30, 2023
- Language: English only
- Publication status: published
- Test accuracy metrics: no restrictions for systematic reviews; studies that reported contingency tables were included for meta-analysis
Report characteristics and exclusion criteria
- Year: no need to specify
- Language: no need to specify
- Publication status: preprints not accepted
- Test accuracy metrics: studies that reported insufficient metrics were excluded from meta-analysis
Search Strategy
In this systematic review, databases including PubMed, Embase, IEEE, Web of Science, and the Cochrane Library were searched until December 2022, and no restrictions were applied. Keywords were “artificial intelligence” AND “pain recognition.” Multimedia Appendix 1 shows the detailed search strategy.
Data Extraction
A total of 2 viewers screened titles and abstracts and selected eligible records independently to assess eligibility, and disagreements were solved by discussion with a third collaborator. A consentient data extraction sheet was prespecified and used to summarize study characteristics independently. Table S5 in Multimedia Appendix 1 shows the detailed items and explanations for data extraction. Diagnostic accuracy data were extracted into contingency tables, including true positives, false positives, false negatives, and true negatives. The data were used to calculate the pooled diagnostic performance of the different models. Some studies included multiple models, and these models were considered independent of each other.
Study Quality Assessment
All included studies were independently assessed by 2 viewers using the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2) tool [ 18 ]. QUADAS-2 assesses bias risk across 4 domains, which are patient selection, index test, reference standard, and flow and timing. The first 3 domains are also assessed for applicability concerns. In the systematic review, a specific extension of QUADAS-2, namely, QUADAS-AI, was used to specify the signaling questions [ 19 ].
Meta-Analysis
Meta-analyses were conducted between different AI models. Models with different algorithms or training data were considered different. To evaluate the performance differences between models, the contingency tables during model validation were extracted. Studies that did not report enough diagnostic accuracy data were excluded from meta-analysis.
Hierarchical summary receiver operating characteristic (SROC) curves were fitted to evaluate the diagnostic performance of AI models. These curves were plotted with 95% CIs and prediction regions around averaged sensitivity, specificity, and area under the curve estimates. Heterogeneity was assessed visually by forest plots. A funnel plot was constructed to evaluate the risk of bias.
Subgroup meta-analyses were conducted to evaluate the performance differences at both the model level and task level, and subgroups were created based on different tasks and the proportion of positive and negative samples.
All statistical analyses and plots were produced using RStudio (version 4.2.2; R Core Team) and the R package meta4diag (version 2.1.1; Guo J and Riebler A) [ 20 ].
Study Selection and Included Study Characteristics
A flow diagram representing the study selection process is shown in ( Figure 1 ). After removing 1039 duplicates, the titles and abstracts of a total of 5653 papers were screened, and the percentage agreement of title or abstract screening was 97%. After screening, 51 full-text reports were assessed for eligibility, among which 45 reports were included in the systematic review [ 21 - 65 ]. The percentage agreement of the full-text review was 87%. In 40 of the included studies, contingency tables could not be made. Meta-analyses were conducted based on 8 AI models extracted from 6 studies. Individual study characteristics included in the systematic review are provided in Tables 1 and 2 . The facial feature extraction method can be categorized into 2 classes: geometrical features (GFs) and deep features (DFs). One typical method of extracting GFs is to calculate the distance between facial landmarks. DFs are usually extracted by convolution operations. A total of 20 studies included temporal information, but most of them (18) extracted temporal information through the 3D convolution of video sequences. Feature transformation was also commonly applied to reduce the time for training or fuse features extracted by different methods before inputting them into the classifier. For classifiers, support vector machines (SVMs) and convolutional neural networks (CNNs) were mostly used. Table 1 presents the model designs of the included studies.
a No temporal features are shown by – symbol, time information extracted from 2 images at different time by +, and deep temporal features extracted through the convolution of video sequences by ++.
b SVM: support vector machine.
c GF: geometric feature.
d GMM: gaussian mixture model.
e TPS: thin plate spline.
f DML: distance metric learning.
g MDML: multiview distance metric learning.
h AAM: active appearance model.
i RVR: relevance vector regressor.
j PSPI: Prkachin and Solomon pain intensity.
k I-FES: individual facial expressiveness score.
l LSTM: long short-term memory.
m HCRF: hidden conditional random field.
n GLMM: generalized linear mixed model.
o VLAD: vector of locally aggregated descriptor.
p SVR: support vector regression.
q MDS: multidimensional scaling.
r ELM: extreme learning machine.
s Labeled to distinguish different architectures of ensembled deep learning models.
t DCNN: deep convolutional neural network.
u GSM: gaussian scale mixture.
v DOML: distance ordering metric learning.
w LIAN: locality and identity aware network.
x BiLSTM: bidirectional long short-term memory.
a UNBC: University of Northern British Columbia-McMaster shoulder pain expression archive database.
b LOSO: leave one subject out cross-validation.
c ICC: intraclass correlation coefficient.
d CT: contingency table.
e AUC: area under the curve.
f MSE: mean standard error.
g PCC: Pearson correlation coefficient.
h RMSE: root mean standard error.
i MAE: mean absolute error.
j ICC: intraclass coefficient.
k CCC: concordance correlation coefficient.
l Reported both external and internal validation results and summarized as intervals.
Table 2 summarizes the characteristics of model training and validation. Most studies used publicly available databases, for example, the University of Northern British Columbia-McMaster shoulder pain expression archive database [ 57 ]. Table S4 in Multimedia Appendix 1 summarizes the public databases. A total of 7 studies used self-prepared databases. Frames from video sequences were the most used test objects, as 37 studies output frame-level pain intensity, while few measure pain intensity from video sequences or photos. It was common that a study redefined pain levels to have fewer classes than ground-truth labels. For model validation, cross-validation and leave-one-subject-out validation were commonly used. Only 3 studies performed external validation. For reporting test accuracies, different evaluation metrics were used, including sensitivity, specificity, mean absolute error (MAE), mean standard error (MSE), Pearson correlation coefficient (PCC), and intraclass coefficient (ICC).
Methodological Quality of Included Studies
Table S2 in Multimedia Appendix 1 presents the study quality summary, as assessed by QUADAS-2. There was a risk of bias in all studies, specifically in terms of patient selection, caused by 2 issues. First, the training data are highly imbalanced, and any method to adjust the data distribution may introduce bias. Next, the QUADAS-AI correspondence letter [ 19 ] specifies that preprocessing of images that changes the image size or resolution may introduce bias. However, the applicability concern is low, as the images properly represent the feeling of pain. Studies that used cross-fold validation or leave-one-out cross-validation were considered to have a low risk of bias. Although the Prkachin and Solomon pain intensity (PSPI) score was used by most of the studies, its ability to represent individual pain levels was not clinically validated; as such, the risk of bias and applicability concerns were considered high when the PSPI score was used as the index test. As an advantage of computer vision techniques, the time interval between the index tests was short and was assessed as having a low risk of bias. Risk proportions are shown in Figure 2 . For all 315 entries, 39% (124) were assessed as high-risk. In total, 5 studies had the lowest risk of bias, with 6 domains assessed as low risk [ 26 , 27 , 31 , 32 , 59 ].
Pooled Performance of Included Models
In 6 studies included in the meta-analysis, there were 8 different models. The characteristics of these models are summarized in Table S1 in Multimedia Appendix 2 [ 23 , 24 , 26 , 32 , 41 , 57 ]. Classification of PSPI scores greater than 0, 2, 3, 6, and 9 was selected and considered as different tasks to create contingency tables. The test performance is shown in Figure 3 as hierarchical SROC curves; 27 contingency tables were extracted from 8 models. The sensitivity, specificity, and LDOR were calculated, and the combined sensitivity was 98% (95% CI 96%-99%), the specificity was 98% (95% CI 97%-99%), the LDOR was 7.99 (95% CI 6.73-9.31) and the AUC was 0.99 (95% CI 0.99-1).
Subgroup Analysis
In this study, subgroup analysis was conducted to investigate the performance differences within models. A total of 8 models were separated and summarized as a forest plot in Multimedia Appendix 3 [ 23 , 24 , 26 , 32 , 41 , 57 ]. For model 1, the pooled sensitivity, specificity, and LDOR were 95% (95% CI 86%-99%), 99% (95% CI 98%-100%), and 8.38 (95% CI 6.09-11.19), respectively. For model 2, the pooled sensitivity, specificity, and LDOR were 94% (95% CI 84%-99%), 95% (95% CI 88%-99%), and 6.23 (95% CI 3.52-9.04), respectively. For model 3, the pooled sensitivity, specificity, and LDOR were 100% (95% CI 99%-100%), 100% (95% CI 99%-100%), and 11.55% (95% CI 8.82-14.43), respectively. For model 4, the pooled sensitivity, specificity, and LDOR were 83% (95% CI 43%-99%), 94% (95% CI 79%-99%), and 5.14 (95% CI 0.93-9.31), respectively. For model 5, the pooled sensitivity, specificity, and LDOR were 92% (95% CI 68%-99%), 94% (95% CI 78%-99%), and 6.12 (95% CI 1.82-10.16), respectively. For model 6, the pooled sensitivity, specificity, and LDOR were 94% (95% CI 74%-100%), 94% (95% CI 78%-99%), and 6.59 (95% CI 2.21-11.13), respectively. For model 7, the pooled sensitivity, specificity, and LDOR were 98% (95% CI 90%-100%), 97% (95% CI 87%-100%), and 8.31 (95% CI 4.3-12.29), respectively. For model 8, the pooled sensitivity, specificity, and LDOR were 98% (95% CI 93%-100%), 97% (95% CI 88%-100%), and 8.65 (95% CI 4.84-12.67), respectively.
Heterogeneity Analysis
The meta-analysis results indicated that AI models are applicable for estimating pain intensity from facial images. However, extreme heterogeneity existed within the models except for models 3 and 5, which were proposed by Rathee and Ganotra [ 24 ] and Semwal and Londhe [ 32 ]. A funnel plot is presented in Figure 4 . A high risk of bias was observed.
Pain management has long been a critical problem in clinical practice, and the use of AI may be a solution. For acute pain management, automatic measurement of pain can reduce the burden on caregivers and provide timely warnings. For chronic pain management, as specified by Glare et al [ 2 ], further research is needed, and measurements of pain presence, intensity, and quality are one of the issues to be solved for chronic pain studies. Computer vision could improve pain monitoring through real-time detection for clinical use and data recording for prospective pain studies. To our knowledge, this is the first meta-analysis dedicated to AI performance in multilevel pain level classification.
In this study, one model’s performance at specific pain levels was described by stacking multiple classes into one to make each task a binary classification problem. After careful selection in both the medical and engineering databases, we observed promising results of AI in evaluating multilevel pain intensity through facial images, with high sensitivity (98%), specificity (98%), LDOR (7.99), and AUC (0.99). It is reasonable to believe that AI can accurately evaluate pain intensity from facial images. Moreover, the study quality and risk of bias were evaluated using an adapted QUADAS-2 assessment tool, which is a strength of this study.
To investigate the source of heterogeneity, it was assumed that a well-designed model should have familiar size effects regarding different levels, and a subgroup meta-analysis was conducted. The funnel and forest plots exhibited extreme heterogeneity. The model’s performance at specific pain levels was described and summarized by a forest plot. Within-model heterogeneity was observed in Multimedia Appendix 3 [ 23 , 24 , 26 , 32 , 41 , 57 ] except for 2 models. Models 3 and 5 were different in many aspects, including their algorithms and validation methods, but were both trained with a relatively small data set, and the proportion of positive and negative classes was relatively close to 1. Because training with imbalanced data is a critical problem in computer vision studies [ 66 ], for example, in the University of Northern British Columbia-McMaster pain data set, fewer than 10 frames out of 48,398 had a PSPI score greater than 13. Here, we emphasized that imbalanced data sets are one major cause of heterogeneity, resulting in the poorer performance of AI algorithms.
We tentatively propose a method to minimize the effect of training with imbalanced data by stacking multiple classes into one class, which is already presented in studies included in the systematic review [ 26 , 32 , 42 , 57 ]. Common methods to minimize bias include resampling and data augmentation [ 66 ]. This proposed method is used in the meta-analysis to compare the test results of different studies as well. The stacking method is available when classes are only different in intensity. A disadvantage of combined classes is that the model would be insufficient in clinical practice when the number of classes is low. Commonly used pain evaluation tools, such as VAS, have 10 discrete levels. It is recommended that future studies set the number of pain levels to be at least 10 for model training.
This study is limited for several reasons. First, insufficient data were included because different performance metrics (mean standard error and mean average error) were used in most studies, which could not be summarized into a contingency table. To create a contingency table that can be included in a meta-analysis, the study should report the following: the number of objects used in each pain class for model validation, and the accuracy, sensitivity, specificity, and F 1 -score for each pain class. This table cannot be created if a study reports the MAE, PCC, and other commonly used metrics in AI development. Second, a small study effect was observed in the funnel plot, and the heterogeneity could not be minimized. Another limitation is that the PSPI score is not clinically validated and is not the only tool that assesses pain from facial expressions. There are other clinically validated pain intensity assessment methods, such as the Faces Pain Scale-revised, Wong-Baker Faces Pain Rating Scale, and Oucher Scale [ 3 ]. More databases could be created based on the above-mentioned tools. Finally, AI-assisted pain assessments were supposed to cover larger populations, including incommunicable patients, for example, patients with dementia or patients with masked faces. However, only 1 study considered patients with dementia, which was also caused by limited databases [ 50 ].
AI is a promising tool that can help in pain research in the future. In this systematic review and meta-analysis, one approach using computer vision was investigated to measure pain intensity from facial images. Despite some risk of bias and applicability concerns, CV models can achieve excellent test accuracy. Finally, more CV studies in pain estimation, reporting accuracy in contingency tables, and more pain databases are encouraged for future studies. Specifically, the creation of a balanced public database that contains not only healthy but also nonhealthy participants should be prioritized. The recording process would be better in a clinical environment. Then, it is recommended that researchers report the validation results in terms of accuracy, sensitivity, specificity, or contingency tables, as well as the number of objects for each pain class, for the inclusion of a meta-analysis.
Acknowledgments
WL, AH, and CW contributed to the literature search and data extraction. JH and YY wrote the first draft of the manuscript. All authors contributed to the conception and design of the study, the risk of bias evaluation, data analysis and interpretation, and contributed to and approved the final version of the manuscript.
Data Availability
The data sets generated during and analyzed during this study are available in the Figshare repository [ 67 ].
Conflicts of Interest
None declared.
PRISMA checklist, risk of bias summary, search strategy, database summary and reported items and explanations.
Study performance summary.
Forest plot presenting pooled performance of subgroups in meta-analysis.
- Raja SN, Carr DB, Cohen M, Finnerup NB, Flor H, Gibson S, et al. The revised International Association for the Study of Pain definition of pain: concepts, challenges, and compromises. Pain. 2020;161(9):1976-1982. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Glare P, Aubrey KR, Myles PS. Transition from acute to chronic pain after surgery. Lancet. 2019;393(10180):1537-1546. [ CrossRef ] [ Medline ]
- Chou R, Gordon DB, de Leon-Casasola OA, Rosenberg JM, Bickler S, Brennan T, et al. Management of postoperative pain: a clinical practice guideline from the American Pain Society, the American Society of Regional Anesthesia and Pain Medicine, and the American Society of Anesthesiologists' Committee on Regional Anesthesia, Executive Committee, and Administrative Council. J Pain. 2016;17(2):131-157. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Hassan T, Seus D, Wollenberg J, Weitz K, Kunz M, Lautenbacher S, et al. Automatic detection of pain from facial expressions: a survey. IEEE Trans Pattern Anal Mach Intell. 2021;43(6):1815-1831. [ CrossRef ] [ Medline ]
- Mussigmann T, Bardel B, Lefaucheur JP. Resting-State Electroencephalography (EEG) biomarkers of chronic neuropathic pain. A systematic review. Neuroimage. 2022;258:119351. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Moscato S, Cortelli P, Chiari L. Physiological responses to pain in cancer patients: a systematic review. Comput Methods Programs Biomed. 2022;217:106682. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Thiam P, Hihn H, Braun DA, Kestler HA, Schwenker F. Multi-modal pain intensity assessment based on physiological signals: a deep learning perspective. Front Physiol. 2021;12:720464. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Rojas RF, Brown N, Waddington G, Goecke R. A systematic review of neurophysiological sensing for the assessment of acute pain. NPJ Digit Med. 2023;6(1):76. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Mansutti I, Tomé-Pires C, Chiappinotto S, Palese A. Facilitating pain assessment and communication in people with deafness: a systematic review. BMC Public Health. 2023;23(1):1594. [ FREE Full text ] [ CrossRef ] [ Medline ]
- El-Tallawy SN, Ahmed RS, Nagiub MS. Pain management in the most vulnerable intellectual disability: a review. Pain Ther. 2023;12(4):939-961. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Gkikas S, Tsiknakis M. Automatic assessment of pain based on deep learning methods: a systematic review. Comput Methods Programs Biomed. 2023;231:107365. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Borna S, Haider CR, Maita KC, Torres RA, Avila FR, Garcia JP, et al. A review of voice-based pain detection in adults using artificial intelligence. Bioengineering (Basel). 2023;10(4):500. [ FREE Full text ] [ CrossRef ] [ Medline ]
- De Sario GD, Haider CR, Maita KC, Torres-Guzman RA, Emam OS, Avila FR, et al. Using AI to detect pain through facial expressions: a review. Bioengineering (Basel). 2023;10(5):548. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Zhang M, Zhu L, Lin SY, Herr K, Chi CL, Demir I, et al. Using artificial intelligence to improve pain assessment and pain management: a scoping review. J Am Med Inform Assoc. 2023;30(3):570-587. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Hughes JD, Chivers P, Hoti K. The clinical suitability of an artificial intelligence-enabled pain assessment tool for use in infants: feasibility and usability evaluation study. J Med Internet Res. 2023;25:e41992. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Fang J, Wu W, Liu J, Zhang S. Deep learning-guided postoperative pain assessment in children. Pain. 2023;164(9):2029-2035. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Whiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529-536. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Sounderajah V, Ashrafian H, Rose S, Shah NH, Ghassemi M, Golub R, et al. A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat Med. 2021;27(10):1663-1665. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Guo J, Riebler A. meta4diag: Bayesian bivariate meta-analysis of diagnostic test studies for routine practice. J Stat Soft. 2018;83(1):1-31. [ CrossRef ]
- Hammal Z, Cohn JF. Automatic detection of pain intensity. Proc ACM Int Conf Multimodal Interact. 2012;2012:47-52. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Adibuzzaman M, Ostberg C, Ahamed S, Povinelli R, Sindhu B, Love R, et al. Assessment of pain using facial pictures taken with a smartphone. 2015. Presented at: 2015 IEEE 39th Annual Computer Software and Applications Conference; July 01-05, 2015;726-731; Taichung, Taiwan. [ CrossRef ]
- Majumder A, Dutta S, Behera L, Subramanian VK. Shoulder pain intensity recognition using Gaussian mixture models. 2015. Presented at: 2015 IEEE International WIE Conference on Electrical and Computer Engineering (WIECON-ECE); December 19-20, 2015;130-134; Dhaka, Bangladesh. [ CrossRef ]
- Rathee N, Ganotra D. A novel approach for pain intensity detection based on facial feature deformations. J Vis Commun Image Represent. 2015;33:247-254. [ CrossRef ]
- Sikka K, Ahmed AA, Diaz D, Goodwin MS, Craig KD, Bartlett MS, et al. Automated assessment of children's postoperative pain using computer vision. Pediatrics. 2015;136(1):e124-e131. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Rathee N, Ganotra D. Multiview distance metric learning on facial feature descriptors for automatic pain intensity detection. Comput Vis Image Und. 2016;147:77-86. [ CrossRef ]
- Zhou J, Hong X, Su F, Zhao G. Recurrent convolutional neural network regression for continuous pain intensity estimation in video. 2016. Presented at: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); June 26-July 01, 2016; Las Vegas, NV. [ CrossRef ]
- Egede J, Valstar M, Martinez B. Fusing deep learned and hand-crafted features of appearance, shape, and dynamics for automatic pain estimation. 2017. Presented at: 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017); May 30-June 03, 2017;689-696; Washington, DC. [ CrossRef ]
- Martinez DL, Rudovic O, Picard R. Personalized automatic estimation of self-reported pain intensity from facial expressions. 2017. Presented at: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); July 21-26, 2017;2318-2327; Honolulu, HI. [ CrossRef ]
- Bourou D, Pampouchidou A, Tsiknakis M, Marias K, Simos P. Video-based pain level assessment: feature selection and inter-subject variability modeling. 2018. Presented at: 2018 41st International Conference on Telecommunications and Signal Processing (TSP); July 04-06, 2018;1-6; Athens, Greece. [ CrossRef ]
- Haque MA, Bautista RB, Noroozi F, Kulkarni K, Laursen C, Irani R. Deep multimodal pain recognition: a database and comparison of spatio-temporal visual modalities. 2018. Presented at: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018); May 15-19, 2018;250-257; Xi'an, China. [ CrossRef ]
- Semwal A, Londhe ND. Automated pain severity detection using convolutional neural network. 2018. Presented at: 2018 International Conference on Computational Techniques, Electronics and Mechanical Systems (CTEMS); December 21-22, 2018;66-70; Belgaum, India. [ CrossRef ]
- Tavakolian M, Hadid A. Deep binary representation of facial expressions: a novel framework for automatic pain intensity recognition. 2018. Presented at: 2018 25th IEEE International Conference on Image Processing (ICIP); October 07-10, 2018;1952-1956; Athens, Greece. [ CrossRef ]
- Tavakolian M, Hadid A. Deep spatiotemporal representation of the face for automatic pain intensity estimation. 2018. Presented at: 2018 24th International Conference on Pattern Recognition (ICPR); August 20-24, 2018;350-354; Beijing, China. [ CrossRef ]
- Wang J, Sun H. Pain intensity estimation using deep spatiotemporal and handcrafted features. IEICE Trans Inf & Syst. 2018;E101.D(6):1572-1580. [ CrossRef ]
- Bargshady G, Soar J, Zhou X, Deo RC, Whittaker F, Wang H. A joint deep neural network model for pain recognition from face. 2019. Presented at: 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS); February 23-25, 2019;52-56; Singapore. [ CrossRef ]
- Casti P, Mencattini A, Comes MC, Callari G, Di Giuseppe D, Natoli S, et al. Calibration of vision-based measurement of pain intensity with multiple expert observers. IEEE Trans Instrum Meas. 2019;68(7):2442-2450. [ CrossRef ]
- Lee JS, Wang CW. Facial pain intensity estimation for ICU patient with partial occlusion coming from treatment. 2019. Presented at: BIBE 2019; The Third International Conference on Biological Information and Biomedical Engineering; June 20-22, 2019;1-4; Hangzhou, China.
- Saha AK, Ahsan GMT, Gani MO, Ahamed SI. Personalized pain study platform using evidence-based continuous learning tool. 2019. Presented at: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC); July 15-19, 2019;490-495; Milwaukee, WI. [ CrossRef ]
- Tavakolian M, Hadid A. A spatiotemporal convolutional neural network for automatic pain intensity estimation from facial dynamics. Int J Comput Vis. 2019;127(10):1413-1425. [ FREE Full text ] [ CrossRef ]
- Bargshady G, Zhou X, Deo RC, Soar J, Whittaker F, Wang H. Ensemble neural network approach detecting pain intensity from facial expressions. Artif Intell Med. 2020;109:101954. [ CrossRef ] [ Medline ]
- Bargshady G, Zhou X, Deo RC, Soar J, Whittaker F, Wang H. Enhanced deep learning algorithm development to detect pain intensity from facial expression images. Expert Syst Appl. 2020;149:113305. [ CrossRef ]
- Dragomir MC, Florea C, Pupezescu V. Automatic subject independent pain intensity estimation using a deep learning approach. 2020. Presented at: 2020 International Conference on e-Health and Bioengineering (EHB); October 29-30, 2020;1-4; Iasi, Romania. [ CrossRef ]
- Huang D, Xia Z, Mwesigye J, Feng X. Pain-attentive network: a deep spatio-temporal attention model for pain estimation. Multimed Tools Appl. 2020;79(37-38):28329-28354. [ CrossRef ]
- Mallol-Ragolta A, Liu S, Cummins N, Schuller B. A curriculum learning approach for pain intensity recognition from facial expressions. 2020. Presented at: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020); November 16-20, 2020;829-833; Buenos Aires, Argentina. [ CrossRef ]
- Peng X, Huang D, Zhang H. Pain intensity recognition via multi‐scale deep network. IET Image Process. 2020;14(8):1645-1652. [ FREE Full text ] [ CrossRef ]
- Tavakolian M, Lopez MB, Liu L. Self-supervised pain intensity estimation from facial videos via statistical spatiotemporal distillation. Pattern Recognit Lett. 2020;140:26-33. [ CrossRef ]
- Xu X, de Sa VR. Exploring multidimensional measurements for pain evaluation using facial action units. 2020. Presented at: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020); November 16-20, 2020;786-792; Buenos Aires, Argentina. [ CrossRef ]
- Pikulkaew K, Boonchieng W, Boonchieng E, Chouvatut V. 2D facial expression and movement of motion for pain identification with deep learning methods. IEEE Access. 2021;9:109903-109914. [ CrossRef ]
- Rezaei S, Moturu A, Zhao S, Prkachin KM, Hadjistavropoulos T, Taati B. Unobtrusive pain monitoring in older adults with dementia using pairwise and contrastive training. IEEE J Biomed Health Inform. 2021;25(5):1450-1462. [ CrossRef ] [ Medline ]
- Semwal A, Londhe ND. S-PANET: a shallow convolutional neural network for pain severity assessment in uncontrolled environment. 2021. Presented at: 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC); January 27-30, 2021;0800-0806; Las Vegas, NV. [ CrossRef ]
- Semwal A, Londhe ND. ECCNet: an ensemble of compact convolution neural network for pain severity assessment from face images. 2021. Presented at: 2021 11th International Conference on Cloud Computing, Data Science & Engineering (Confluence); January 28-29, 2021;761-766; Noida, India. [ CrossRef ]
- Szczapa B, Daoudi M, Berretti S, Pala P, Del Bimbo A, Hammal Z. Automatic estimation of self-reported pain by interpretable representations of motion dynamics. 2021. Presented at: 2020 25th International Conference on Pattern Recognition (ICPR); January 10-15, 2021;2544-2550; Milan, Italy. [ CrossRef ]
- Ting J, Yang YC, Fu LC, Tsai CL, Huang CH. Distance ordering: a deep supervised metric learning for pain intensity estimation. 2021. Presented at: 2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA); December 13-16, 2021;1083-1088; Pasadena, CA. [ CrossRef ]
- Xin X, Li X, Yang S, Lin X, Zheng X. Pain expression assessment based on a locality and identity aware network. IET Image Process. 2021;15(12):2948-2958. [ FREE Full text ] [ CrossRef ]
- Alghamdi T, Alaghband G. Facial expressions based automatic pain assessment system. Appl Sci. 2022;12(13):6423. [ FREE Full text ] [ CrossRef ]
- Barua PD, Baygin N, Dogan S, Baygin M, Arunkumar N, Fujita H, et al. Automated detection of pain levels using deep feature extraction from shutter blinds-based dynamic-sized horizontal patches with facial images. Sci Rep. 2022;12(1):17297. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Fontaine D, Vielzeuf V, Genestier P, Limeux P, Santucci-Sivilotto S, Mory E, et al. Artificial intelligence to evaluate postoperative pain based on facial expression recognition. Eur J Pain. 2022;26(6):1282-1291. [ CrossRef ] [ Medline ]
- Hosseini E, Fang R, Zhang R, Chuah CN, Orooji M, Rafatirad S, et al. Convolution neural network for pain intensity assessment from facial expression. 2022. Presented at: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC); July 11-15, 2022;2697-2702; Glasgow, Scotland. [ CrossRef ]
- Huang Y, Qing L, Xu S, Wang L, Peng Y. HybNet: a hybrid network structure for pain intensity estimation. Vis Comput. 2021;38(3):871-882. [ CrossRef ]
- Islamadina R, Saddami K, Oktiana M, Abidin TF, Muharar R, Arnia F. Performance of deep learning benchmark models on thermal imagery of pain through facial expressions. 2022. Presented at: 2022 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT); November 03-05, 2022;374-379; Solo, Indonesia. [ CrossRef ]
- Swetha L, Praiscia A, Juliet S. Pain assessment model using facial recognition. 2022. Presented at: 2022 6th International Conference on Intelligent Computing and Control Systems (ICICCS); May 25-27, 2022;1-5; Madurai, India. [ CrossRef ]
- Wu CL, Liu SF, Yu TL, Shih SJ, Chang CH, Mao SFY, et al. Deep learning-based pain classifier based on the facial expression in critically ill patients. Front Med (Lausanne). 2022;9:851690. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Ismail L, Waseem MD. Towards a deep learning pain-level detection deployment at UAE for patient-centric-pain management and diagnosis support: framework and performance evaluation. Procedia Comput Sci. 2023;220:339-347. [ FREE Full text ] [ CrossRef ] [ Medline ]
- Vu MT, Beurton-Aimar M. Learning to focus on region-of-interests for pain intensity estimation. 2023. Presented at: 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG); January 05-08, 2023;1-6; Waikoloa Beach, HI. [ CrossRef ]
- Kaur H, Pannu HS, Malhi AK. A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput Surv. 2019;52(4):1-36. [ CrossRef ]
- Data for meta-analysis of pain assessment from facial images. Figshare. 2023. URL: https://figshare.com/articles/dataset/Data_for_Meta-Analysis_of_Pain_Assessment_from_Facial_Images/24531466/1 [accessed 2024-03-22]
Abbreviations
Edited by A Mavragani; submitted 26.07.23; peer-reviewed by M Arab-Zozani, M Zhang; comments to author 18.09.23; revised version received 08.10.23; accepted 28.02.24; published 12.04.24.
©Jian Huo, Yan Yu, Wei Lin, Anmin Hu, Chaoran Wu. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 12.04.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Help | Advanced Search
Computer Science > Computer Vision and Pattern Recognition
Title: ferret-ui: grounded mobile ui understanding with multimodal llms.
Abstract: Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.
Submission history
Access paper:.
- Other Formats
References & Citations
- Google Scholar
- Semantic Scholar
BibTeX formatted citation
Bibliographic and Citation Tools
Code, data and media associated with this article, recommenders and search tools.
- Institution
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
IMAGES
VIDEO
COMMENTS
Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... You can create a new account if you don't have one. Browse SoTA > Computer Vision Computer Vision. 4612 benchmarks • 1418 tasks • 2962 datasets • 46379 papers with code Semantic Segmentation ... 5156 papers with code
The features of big data could be captured by DL automatically and efficiently. The current applications of DL include computer vision (CV), natural language processing (NLP), video/speech recognition (V/SP), and finance and banking (F&B). Chai and Li (2019) provided a survey of DL on NLP and the advances on V/SP. The survey emphasized the ...
We explore the groundbreaking research that has shaped the field of computer vision with our list of the top papers of all time. ... Classic Computer Vision Papers Gradient-based Learning Applied to Document Recognition (1998) The authors Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner published the LeNet paper in 1998. ...
Journal-ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 Subjects: Computer Vision and Pattern Recognition (cs.CV) arXiv:2404.06332 [pdf, other] Title: X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model
The machine learning and computer vision research is still evolving [1]. Computer vision is an essential part of Internet of Things, Industrial Internet of Things, and brain human interfaces. The complex human activities are recognized and monitored in multimedia streams using machine learning and computer vison.
IET Computer Vision is a fully open access journal that introduces new horizons and sets the agenda for future avenues of research in a wide range of areas of computer vision. We are a fully open access journal that welcomes research articles reporting novel methodologies and significant results of interest.
A curated list of the top 10 computer vision papers in 2021 with video demos, articles, code and paper reference. - louisfb01/top-10-cv-papers-2021 ... If you'd like to read more research papers as well, I recommend you read my article where I share my best tips for finding and reading more research papers.
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image ...
The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method's efficacy.
Algorithmic progress in computer vision. Ege Erdil, Tamay Besiroglu. We investigate algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision. We estimate a model, informed by work on neural scaling laws, and infer a decomposition of progress into the scaling of compute, data, and algorithms.
A ConvNet for the 2020s. Figure 2. Comparisons of ConvNeXt with transformers. Image retrieved from original paper. On the other hand, while ViTs have shown to be superior in numerous papers for vision tasks, one work stood out in analysing the fundamentals of convolutional networks (ConvNet). Liu et al. focused on "modernising" a ...
The research work was done during the period from 2019 till 2022 in ISCTE taking in consideration artificial intelligence for computer vision [48] concepts and software engineering practices [49 ...
As the deep learning exhibits strong advantages in the feature extraction, it has been widely used in the field of computer vision and among others, and gradually replaced traditional machine learning algorithms. This paper first reviews the main ideas of deep learning, and displays several related frequently-used algorithms for computer vision. Afterwards, the current research status of ...
Research Design: This review paper adopts a systematic approach to explore the applications of. computer vision and deep learning in i mage analysis. The research design i nvolves an extensive ...
Scientists from MIT and IBM Research made a computer vision model more robust by training it to work like a part of the brain that humans and other primates rely on for object recognition. ... Paper. Paper: "Aligning Model and Macaque Inferior Temporal Cortex Representations Improves Model-to-Human Behavioral Alignment and Adversarial Robustness"
For this study, we have collected more than 100 research papers from scientific databases, including PubMed, Web of Science, and Scopus, in the area of deep learning-based computer vision. ... Research in computer vision is growing at a faster pace in the agriculture domain. Building a robust computer vision system requires quality data ...
This paper provides contribution of recent development on reviews related to computer vision, image processing, and their related studies. We categorized the computer vision mainstream into four ...
With the development of artificial intelligence, computer vision technology that simulates human vision has received widespread attention. Based on the current commonly used method of computer vision technology-deep learning, this paper outlines the development of deep learning models, and determines the inflection point of the development of the introduction of convolutional neural networks ...
A curated list of the latest breakthroughs in AI by release date with a clear video explanation, link to a more…. Paper references. [1] Akkaynak, Derya & Treibitz, Tali. (2019). Sea-Thru: A Method for Removing Water From Underwater Images. 1682-1691. 10.1109/CVPR.2019.00178.
impact on computer vision research is largely unknown due to the lack of relevant data and formal studies. Therefore, the goal of this study is two-fold: to quantify the share of industry-sponsored research in the field of computer vision and to understand whether industry presence has a measurable effect on the way the field is developing.
Animating Pictures with Eulerian Motion Fields [6] CVPR 2021 Best Paper Award: GIRAFFE — Controllable Image Generation [7] TimeLens: Event-based Video Frame Interpolation [8] (Style)CLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis [9] CityNeRF: Building NeRF at City Scale [10] Paper references.
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different ...
Abstract. This is a dense introduction to the field of computer vision. It covers all three approaches, the classical engineering approach based on contours and regions; the local-features ...
Background: The continuous monitoring and recording of patients' pain status is a major problem in current research on postoperative pain management. In the large number of original or review articles focusing on different approaches for pain assessment, many researchers have investigated how computer vision (CV) can help by capturing facial expressions.
future research. This paper will help the reader to understand autonomous vehicles from the perspectives of academia and ... choose IEEE Xplore as the main repository for papers in computer vision and autonomous driving, as it is the most influential academic publisher in computer science, electrical engineering, electronics, and relevant ...
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities ...