Integrated Eye Tracking System

Eye tracking is an essential human-machine interaction modality for VR/AR, enabling technologies like foveated rendering that require high throughput (>240 FPS), a small form-factor, and visual privacy. Existing systems fall short of these goals because of three bottlenecks: (1) bulky lens-based cameras impose a large form-factor and a high communication cost between camera and processor; (2) captured images contain a lot of redundancy, since only a small portion shows the human eye; and (3) state-of-the-art segmentation and gaze estimation DNNs require up to 16G FLOPs. EyeCoD (ISCA’22; IEEE Micro Top Picks’23) tackles all three with a camera-algorithm-accelerator co-design.

Overview of the EyeCoD system

Fig.1 - Overview of the EyeCoD system: (a) a five-chip distributed prototype, (b) the FlatCam imaging workflow, © the predict-then-focus algorithm, and (d) the accelerator.

We zoom into each of these three levels—system, algorithm, and accelerator—below.

Lens-based camera vs. lensless FlatCam

Fig.2 - (a) A lens-based camera needs 10-20mm between lens and sensor. (b) The lensless FlatCam replaces the lens with a coded mask under 2mm thick, leaving room to attach the backend chip directly behind the sensor.

System level. EyeCoD replaces the lens-based camera with a lensless FlatCam, which is 5-10× thinner and >10× lighter. Removing the lens lets the backend accelerator sit right behind the sensor, cutting the camera-processor distance and communication cost, while the coded mask also makes captured measurements unrecognizable to the human eye, improving visual privacy.

Algorithm level. EyeCoD uses a predict-then-focus pipeline: after reconstructing the image from the FlatCam’s measurements, a segmentation model (RITNet) predicts the region-of-interest (ROI) around the pupil, and a gaze-estimation model (FBNet-style) estimates gaze only from that ROI. Since eyes move much slower than gaze direction, ROI prediction only needs to run once every 50 frames, while gaze estimation runs every frame on the latest ROI.

EyeCoD predict-then-focus pipeline

Fig.3 - EyeCoD’s predict-then-focus pipeline: (1) image reconstruction from FlatCam measurements, (2) ROI prediction via segmentation, and (3) gaze estimation from the cropped ROI.

Hardware level. A dedicated accelerator further exploits this pipeline. EyeCoD’s sensing-processing interface folds the segmentation model’s first layer directly into the FlatCam’s coded mask, saving both compute and sensor-to-chip bandwidth. The accelerator also introduces a partial time-multiplexing workload orchestration mode that lets the gaze estimation model either fully own the MAC lanes or run concurrently with segmentation, avoiding the resource overhead of pure time-multiplexing and the reuse loss of pure concurrent execution. A dedicated activation-buffer arrangement and an intra-channel reuse dataflow for depthwise convolutions (which cost 33.6% of runtime for only 7.9% of FLOPs) round out the design.

EyeCoD accelerator architecture

Fig.4 - The EyeCoD accelerator: (a) the proposed partial time-multiplexing mode vs. classical orchestration; (b) the activation global-buffer storage arrangement; © intra-channel reuse dataflows for depthwise layers; and (d) the overall accelerator block diagram.

Results. Benchmarked with a cycle-accurate simulator built from a silicon-validated chip, EyeCoD achieves comparable gaze estimation error (~3.2°) with 69.2% fewer FLOPs, and comparable segmentation mIOU with 16× fewer FLOPs (32× after 8-bit quantization). End to end, EyeCoD delivers 2965.8×, 12.7×, 14.8×, 2.6×, and 12.9× throughput improvements over EdgeCPU, CPU, EdgeGPU, GPU, and the prior-art CIS-GEP processor, respectively, with 8.8× better energy efficiency than CIS-GEP; a five-chip distributed version (EyeCoD_Dist) still keeps 7.7× the energy efficiency.

EyeCoD throughput and energy efficiency vs. baselines

Fig.5 - Overall comparison between EyeCoD and the baselines in terms of energy efficiency and throughput.

Follow-up: a single-chip demo. In the spirit of EyeCoD, we also built i-FlatCam, a version dedicated solely to eye tracking that fits on a single chip for live, integrated demonstration. The board below is that on-silicon test chip in action—this demo won first place in the University Demo competition at DAC 2022.

i-FlatCam single-chip demo

Fig.6 - The i-FlatCam single-chip eye tracking demo board, DAC 2022 University Demo first-place winner.

By co-designing the camera, algorithm, and accelerator together, EyeCoD shows that replacing lens-based cameras with lensless ones—while exploiting the resulting opportunities in the algorithm and hardware—can meet both the high-throughput and small-form-factor requirements that next-generation VR/AR eye tracking needs.