Second Sight: Using brain-optimized encoding models to align image distributions with human brain activity

 

Reese Kneeland (rek@umn.edu)

University of Minnesota Department of Computer Science

Ghislain St-Yves (gstyves@umn.edu)

University of Minnesota Department of Neuroscience

Jordyn Ojeda (ojeda040@umn.edu)

University of Minnesota Department of Computer Science

Thomas Naselaris (nase0005@umn.edu)

University of Minnesota Department of Neuroscience

 

 

Abstract

Two recent developments have accelerated progress in image reconstruction from human brain activity: large datasets that offer samples of brain activity in response to many thousands of natural scenes, and the open-sourcing of powerful stochastic image-generators that accept both low- and high-level guidance. Most work in this space has focused on obtaining point estimates of the target image, with the ultimate goal of approximating literal pixel-wise reconstructions of target images from the brain activity patterns they evoke. This emphasis belies the fact that there is always a family of images that are equally compatible with any evoked brain activity pattern, and the fact that many image-generators are inherently stochastic and do not by themselves offer a method for selecting the single best reconstruction from among the samples they generate. We introduce a novel reconstruction procedure (Second Sight) that iteratively refines an image distribution to explicitly maximize the alignment between the predictions of a voxel-wise encoding model and the brain activity patterns evoked by any target image. We use an ensemble of brain-optimized deep neural networks trained on the Natural Scenes Dataset (NSD) as our encoding model, and a latent diffusion model as our image generator. At each iteration, we generate a small library of images and select those that best approximate the measured brain activity when passed through our encoding model. We extract semantic and structural guidance from the selected images, used for generating the next library. We show that this process converges on a distribution of high-quality reconstructions by refining both semantic content and low-level image details across iterations. Images sampled from these converged image distributions are competitive with state-of-the-art reconstruction algorithms. Interestingly, the time-to-convergence varies systematically across visual cortex, with earlier visual areas generally taking longer and converging on narrower image distributions, relative to higher-level brain areas. Second Sight thus offers a succinct and novel method for exploring the diversity of representations across visual brain areas.

 

 

Introduction

In the quest to bridge the gap between human cognition and computational models, the challenge of reconstructing visual stimuli from measured brain activity has rapidly gathered scientific attention. Recent advancements have propelled this field forward, notably the availability of large datasets capturing brain responses to thousands of natural scenes and the advent of powerful, stochastic image generators. Current reconstruction approaches have aimed at generating point estimates of target images, striving for pixel-wise accuracy. However, this goal overlooks the inherent ambiguity in brain activity patterns, which can correspond to a family of plausible images.

Addressing this nuance, we introduce Second Sight, a novel reconstruction procedure that iteratively refines a distribution of images to better align with brain activity patterns. This method leverages an ensemble of brain-optimized deep neural networks, trained on the Natural Scenes Dataset (NSD), and a latent diffusion model for image generation. By selecting images that closely approximate the measured brain activity and extracting semantic and structural guidance from them, Second Sight iteratively enhances the quality of reconstructions. This process not only refines semantic content but also improves low-level image details, converging on a distribution of high-quality reconstructions.

The significance of Second Sight lies in its ability to produce images that are competitive with state-of-the-art reconstruction algorithms without using a traditional decoding model, offering a more nuanced approach to understanding and replicating the complex relationship between visual stimuli and brain activity through using an encoding-only approach.

Methods

Our methodology centers around "brain-optimized" encoding models that minimize loss in the space of brain activity, rather than in the space of high- or low-level representations. This approach involves training deep neural networks to predict brain activity patterns in response to visual stimuli, using these predictions to guide the selection of image distributions that align with the target brain activity. Specifically, we begin by finding the most "brain-aligned" images from a large image library (COCO), and using them as an initial starting point to create sequentially smaller synthetic libraries generated by a diffusion model. By incrementally narrowing the scope of these image distributions and optimizing them against our pattern of brain activity through multiple search iterations, we effectively minimize uncertainty about structural details while enhancing the semantic fidelity of the reconstructions.

Second Sight pipeline diagram. The pipeline consists of two stages. In the first stage (Library Assembler) we assemble the feature vectors $z_0$ and $c_0$ that seed the second stage (Stochastic Search), in which we iteratively align an image distribution to brain activity. Details on the numbered and lettered components of each stage are detailed in the full paper.

Results

Our results demonstrate that "Second Sight" produces semantically meaningful reconstructions that reflect low-level details of target images, such as pose, and competes with and surpasses current state-of-the-art (SOTA) systems in terms of proximity to ground truth images and alignment to brain activity patterns. We achieve this without using any traditional decoding models, instead leveraging the diversity of neural representations in the brain through an encoding-only approach. Furthermore, our analysis reveals a notable degeneracy in brain activity patterns across the visual cortex, indicating that a single pattern of brain activity can correspond to a distribution of highly aligned images. This finding underscores the complexity of visual processing in the brain and the capability of "Second Sight" to navigate this complexity to produce a fine-tuned distribution of reconstructed images. Furthermore, Second Sight's capability to quantify invariance across different regions of the visual cortex provides valuable insights into the structure and function of cortical regions. By examining the rate of convergence to parity with the ground truth image, Second Sight sorts brain areas in a manner that aligns with complementary measures of invariance, highlighting the pivotal role of image reconstruction algorithms in advancing our understanding of brain activity in the visual cortex.

Comparative assessment of reconstructions. The first row is the ground truth image, indicated by the red background, the second row is Second Sight reconstructions, the third row shows the "Best Library Image" approach, while the remaining rows represent subject 1 results from previous works in Ozcelik et al., Tagaki et al., Lu et al., and Gu et al.

(A) Correlation of predicted and actual brain activity (brain correlation) for the top image at each iteration for different ROIs (curves). Dashed lines are the score of the target ground truth image in each ROI. (B) Percentage of samples in the test set for which reconstructions achieve a brain correlations score at or above parity with the score of the ground truth image (y-axis) for each of the indicated ROIs (x-axis) and across iterations (colors). (C) Cumulative percentage of reconstructions that achieve brain correlations score at or above parity with the score of the ground truth image (y-axis) across iterations (x-axis) and for “unrefined" reconstructions methods (shapes). This plot uses only samples that reach parity with the score of the ground truth image for all brain areas.

Conclusion

The Second Sight methodology, underpinned by brain-optimized encoding models, provides a new approach to the field of image reconstruction from brain activity. By iteratively refining image distributions, this approach enhances the semantic content and structural details of reconstructions by aligning them more closely with human brain activity. Utilizing brain-optimized encoding models trained on the Natural Scenes Dataset (NSD) and a latent diffusion model, our process ensures that the generated images are progressively optimized against brain activity patterns, thereby refining reconstructions to more accurately reflect the original visual stimuli. The capability of Second Sight to produce reconstructions that rival state-of-the-art (SOTA) algorithms across various feature spaces and set new standards in alignment with brain activity patterns without using a decoding model underscores the impact of this novel approach. This nuanced understanding of the complex relationship between visual stimuli and brain activity, achieved without explicitly minimizing loss in any specific feature space, represents a significant leap forward in our ability to bridge the gap between human cognition and computational models.