This research investigates how immersive computing systems unintentionally expose private user information through passive side channels that arise from the way wearable, virtual reality, augmented reality, and mixed reality devices render, mediate, and externalize user interaction. Instead of treating privacy leakage as a result of device compromise, malware, or direct sensor access, this work studies the broader attack surface created by observable signals in shared physical and virtual spaces, including body motion, avatar animation, display behavior, and optical artifacts. The research develops realistic adversary models and end-to-end inference pipelines that combine data collection, signal preprocessing, computer vision, and domain-adapted machine learning to recover sensitive information from noisy, indirect observations. Across immersive environments, these signals can reveal private inputs, authentication behavior, spoken content, application context, and on-screen information, even when the adversary has no privileged access to the device or platform. This line of work spans IRB-approved human-subject studies, synthetic and real-world dataset construction, and evaluations across commercial AR, VR, and MR platforms to characterize how physical, visual, behavioral, and perceptual design choices can become privacy leakage channels. By identifying these risks as systemic properties of immersive interaction rather than isolated vulnerabilities, this research aims to guide the design of architectural, software-level, and user-interface defenses that reduce information exposure while preserving usability, realism, and social presence in future immersive systems.

Avatar-Based Side-Channel Attacks: Inferring Speech from Lip Motion in Social VR Platforms


Overview

Social Virtual Reality (VR) platforms increasingly rely on expressive avatar facial animation to enhance presence and communication. While these features improve user experience, they may unintentionally expose sensitive information through visual side channels. This project investigates whether avatar lip motion in multi-user VR environments constitutes a viable side channel for inferring spoken speech content. We propose a passive visual adversary model in which an attacker observes only the rendered lip movements of a target avatar, without access to audio, internal telemetry, or network traffic. We construct a synthetic dataset of 402 avatar videos across six AI-generated avatars speaking 17 passages and 50 phonetically optimized sentences, and fine-tune AV-HuBERT, a self-supervised audio-visual speech recognition model, on video-only input.

Attack Model

The steps involved in the end-to-end avatar lip-motion inference pipeline follow.

  • Synthetic Dataset Construction (6 AI avatars, 402 videos, 17 passages + 50 sentences)
  • VR Data Collection via Passive Screen Capture in the Immersed Platform
  • Face Detection and Mouth ROI Extraction (MediaPipe, 96x96 grayscale patch)
  • Temporal Stabilization and Landmark Smoothing
  • Fine-Tuned AV-HuBERT Visual Speech Recognition (video-only, audio masked)
  • Evaluation under Three Conditions: Unseen Avatar, Unseen Content, Real VR Speakers

LightLeaks: Optical Side-Channel Attack on Mixed Reality Waveguide Displays


Overview

Optical see-through (OST) mixed reality glasses, such as the Microsoft HoloLens 2, Magic Leap 2, and Snap Spectacles, use transparent waveguide-based near-eye displays that project virtual imagery into the wearer's field of view. LightLeaks identifies and exploits an inherent physical side channel of this architecture: light unintentionally leaking outward from the waveguide during normal operation. A passive bystander equipped with only a consumer camera, positioned at approximately 1 meter under ambient indoor lighting, can capture these optical emissions and recover private on-screen content, identify active applications, and infer authentication inputs, all without device compromise, malware, or specialized equipment. This vulnerability is architectural and affects the entire class of diffractive waveguide-based AR/MR displays.

LightLeaks Model

The steps involved in the end-to-end LightLeaks attack pipeline follow.

  • Video Capture via Consumer Camera (~1 m standoff)
  • Facial Landmark-Based Region of Interest (ROI) Extraction
  • Domain-Adapted Scene Text Detection (ESTextSpotter, fine-tuned on waveguide imagery)
  • Fragment-Aware LLM Reconstruction (temporal voting, cross-frame stitching, noise rejection)
  • Content Recovery, Application Fingerprinting, and Credential Inference

HiddenReality: Video-based Side-Channel Attack in Wearable (VR) Devices


Overview

A video-based side-channel attack, Hidden Reality (HR), shows although the virtual screen in VR devices is not in direct sight of adversaries, the indirect observations such as hand gestures might get exploited to steal the user's private information. The Hidden Reality model can successfully decipher an average of over 75% of the text inputs.

Hidden Reality Model

The steps involved in the implementation of the Hidden Reality attack model for various attack scenarios follow.

  • Video Preprocessing
  • Localization and Hand Landmark Tracking
  • Click Detection
  • Character Inference
  • Word Prediction

Datasets

This research uses a combination of real-world and synthetic data collection to study privacy and security risks in AR/VR and mixed-reality environments. All human-subject data collection was conducted with university IRB approval and involved registered volunteer participants performing realistic interaction tasks across multiple immersive platforms.

For the Meta Quest 2 study, videos were recorded while participants entered information on a virtual screen. A total of 368 short video clips were collected across several attack scenarios, including password entry, PIN entry, graphical-lock pattern entry, text entry, and email entry.

For the Microsoft HoloLens 2 study, 10 volunteer participants were recorded from a 1-meter distance using a Nikon Coolpix P950 camera and an iPhone SE. Participants performed naturalistic tasks such as email reading, web browsing on platforms including Google, Wikipedia, YouTube, and Amazon, as well as credential entry involving PINs and passwords. A preliminary evaluation was also conducted on the Magic Leap 2 from a 3-meter standoff distance to assess generalizability across waveguide-based mixed-reality platforms.

To complement the real-world recordings, a synthetic dataset of 402 videos was generated using HeyGen, a commercial AI avatar generation platform. The dataset included six avatars with variation in facial geometry, skin tone, and lip shape. Additional real-world VR data was collected from five participants in the Immersed platform, recorded from an adversarial viewpoint at approximately 1–2 meters in a shared virtual meeting room. Participants spoke a novel passage not included in synthetic training, enabling a fully out-of-distribution evaluation.

Related Works

I Know What You Enter on Gear VR

The attack model by Ling et al. predicts the passwords typed by users by utilizing 3D video recordings of the headset and videos of fingertip taps on the touchpad of the Samsung VR headset.

Face-mic: inferring live speech and speakeridentity via subtle facial dynamics captured by ar/vr motion sensors

A motion sensor-based speech eavesdropping attack referred to as Face-mic that infers highly sensitive information from live human speech, s speaker gender, identity, and speech content.

A Keylogging Inference Attack on Air-Tapping Keyboards in Virtual Environments

A key inference attack on in-air typing on AR devices which utilizes the inbuilt motion sensors data from the AR device to track the user's hand movements.

LipNet: End-to-End Sentence-Level Lipreading

One of the first models to achieve sentence-level lipreading using spatiotemporal convolutions with recurrent sequence modeling and CTC loss, establishing end-to-end visual speech recognition as feasible.

AV-HuBERT: Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

A self-supervised audio-visual speech model pretrained on LRS3 that achieves near-human performance on visual speech recognition benchmarks and serves as the core model in this attack pipeline.

GAZEploit: Remote Keystroke Inference Attack by Gaze Estimation from Avatar Views in VR/MR Devices

Exploits gaze information from avatar renderings during video streams to infer keystrokes, demonstrating that avatar-rendered visual cues constitute a practical side channel for input inference.

Remote Keylogging Attacks in Multi-User VR Applications

Demonstrates keystroke inference via motion data leaked through network packets in multi-user VR platforms, achieving over 97% accuracy from a co-present passive adversary.

I Know What You Enter on Gear VR

One of the first avatar-based input inference attacks, predicting typed passwords using head and hand motion data from VR headsets and controller touchpad recordings.

Hidden Reality: Caution, Your Hand Gesture Inputs in the Immersive Virtual World are Visible to All!

Demonstrates that hand gestures in immersive VR can reveal user inputs to external cameras, establishing video-based side channels as a practical threat in VR environments.

Holologger: Keystroke Inference on Mixed Reality Head Mounted Displays

Exploits inertial and spatial telemetry from MR headsets to infer keystrokes without any visual access to the display.

Going Through the Motions: AR/VR Keylogging from User Head Motions

Infers keystrokes from head-movement patterns recorded by headset sensors during text entry in AR/VR environments.

It's All in Your Head(set): Side-Channel Attacks on AR/VR Systems

Combines multimodal sensor data from AR/VR headsets to recover sensitive user actions without any visual access to the display.