Multidimensionality: Using Spatial Intelligence x Spatial Computing to Create New Worlds

Transcript

Pangilinan: My name is Erin Pangilinan. The title of the session originally was multidimensionality. I know that’s a really big word. It’s really just about creating XR with AI. That’s me with my book, I published in 2019. I spent most of 2018 writing it. Some of the information is a little dated, but there’s still a lot of content that’s pretty solid foundation, so you can find it on Amazon, and Safari, which is the platform from O’Reilly.

Timeline (AR and VR) – 2016 to the Present

I just want to start, how many folks have ever developed for AR and VR before? How many have developed for AI? I’m going to rewind back like 8 years to 2016. One of my favorite experiences, actually wasn’t because it was in the Microsoft HoloLens, not the greatest field of view, but it was a data visualization that I saw in Microsoft HoloLens by IBM. This is my friend, Rostin, on my left, you could see his hands doing a lot of the hand tracking and you could interact with the data visualization. If you see to the very edge of the screen, you could see a split screen. If you see it there, it’s a Jupyter Notebook. In order to change the data visualization and manipulate the data, you couldn’t do it directly within screen. I want to fast forward, and I published a whole pipeline on how that’s engineered in the book, you can check it out. It’s really just like the ETL pipeline and how you would take data or machine learning visualization, like that process, and put it into different headsets.

Fast forward to 2024, this is me at a hackathon with Apple Vision Pro. The advances in hardware that we see both from Meta and Apple is really just pass-through. That’s what we call it, mixed reality or AR. You’re actually able to code in editor. How many people have an Apple Vision Pro? You can code with Runestone, with Xcode, with any IDE of choice. Ergonomically not the best because it’s still a little bit heavy, but it is possible now. This is a startup my friend had founded, Jason Marsh, of Flow Immersive, where he actually was utilizing the ChatGPT APIs, Whisper, so speech to text, to be able to query a data visualization you could create within VR. You can also do it in AR now. A lot has changed in the last couple of years that you weren’t able to do before.

I also want to end this part of just the opening part of the keynote of Fei-Fei Li. How many people know Dr. Fei-Fei Li, professor at Stanford, known for ImageNet. Her next startup that she has co-founded with the authors of the NeRF paper, how many people know about NeRFs? We’ll talk about that a little bit later. Fei-Fei was previously the chief scientist of Google Cloud. Her new startup, World Labs, raised about half a billion dollars to actually focus on foundation models for what we call spatial intelligence. That’s really important. In her TED Talk, which I suggest you check out, it talks about how we’re really utilizing anything that we categorize as AR, VR, MR, XR, spatial computing, IoT, Internet of Things, including BCI. The layer of abstractions are being removed.

Previously, you couldn’t do this before, but there are Apple Vision Pro users with ALS, similar to the person with the EEG cap, also with ALS, able to directly manipulate objects that they wouldn’t otherwise be able to do, through brainwaves. There’s a lot that’s being changed now and merging between AI and the spectrum for the sake of the terminology, XR for short. There’s me and Fei-Fei.

My thesis is that the present moment and the future of frontend engineering, anything to do with design and human-computer interaction, is XR. The future of backend, for a lot of the backend developers, to me, it’s AI and automation, or AI that is assisting humans, not displacing it, but augmenting human intelligence. AI further, and it is really interesting to me, maybe about 8 years ago, I was talking to people about this concept of my book.

Back then, a lot of people still didn’t have an understanding of AI, saying like, VR is really separate. It actually really isn’t. It’s actually pretty central. How many people play Pokémon GO? What’s enabling that is computer vision algorithms, or what we call SLAM technology, so Simultaneous Location and Mapping. AI extends the human body. What’s really big, because of the advent of ChatGPT, obviously, is that it’s expanding human capability for what I like to call collaboration, generation, and automation, quite obviously. Collaboration’s been established for quite a long time. While you’re in a VR headset or an AR headset, you want to achieve telepresence. Yourself and another individual, 3,000 miles away, you could really feel like you’re really there. That’s been pretty established. It’s been the use case for quite some time. What is very much new in terms of our tooling for content creation, is generation and generative AI.

Scope

Before I get to the rest of this talk, I’m just going to go over some terminology, this is a figure in the graphic from my book that talks about the spectrum of what we have. You’ll notice different companies, whether it’s Meta or Apple, debate on, is it spatial computing? Is it XR? Is it AR? Is it MR? It’s changed over the years. Microsoft and Intel used to call it MR. It’s everything in between. I’m just going to say XR catch-all, all-inclusive. You can see this is me, how people know me as Erin is in reality, with my headsets on, and in the middle is a picture of me in 2014 with style transfer algorithms. If you remember the initial paper, I think it was CycleGAN that’s applied to this. In 2017, you would take a photo of yourself, take a style, like Monet, apply it to an image so that your picture could look more like Monet. That’s in the middle.

Then in VR, or in a simulated reality, in Web3, there’s my Ready Player Me avatar, on my left. To my right, my Pokémon GO one. In the middle, that’s my work actually as a fellow for Oculus Launchpad, which was a diverse content creator program by Meta, and at the time it was called Oculus, specifically for Oculus Rift before the Oculus Quest or Meta Quest had come out. My project was actually to create a dataset for pre-Philippine writing systems. How many people have ever tried Google Tilt Brush, the painting app, or Apple Vision Pro, they had that as well. We actually were doing that at the time, but trying to figure out, could we teach people our language, similar to Chinese and Korean, and have it be statistically significant enough to create datasets for this? My co-author of my book, Steve Lukas, who’s actually the former director of product at Qualcomm, he helped me create this, known from our other co-editor, Vasanth Mohan, who many people know as FusedVR, has like 20K subscribers on YouTube. He has tutorials on how to actually do this.

Back then it was still pretty tough, but they didn’t really understand the piece with AI. Only now, after Steve has left his job, he tells me he’s doing hackathons recognizing Chinese. It just goes to show you a lot of people who did not understand, typically a lot of game designers and people who are in the VR space, didn’t have an understanding of AI very deeply, unless you trained, and people know AlphaGo, AlphaZero, beating a professional poker player, a chess player, WoW, Dota. Unless you were in that class of people, a lot of people still really didn’t understand AI, but that is now changing.

How I like to think people know me is my new emojis. If you don’t have Apple Intelligence yet, I encourage you to get it. You can actually do this in a couple of different ways. You can use your Apple Vision Pro with a scan of yourself. Here it actually just took a photo of myself and did image recognition to apply it to an emoji. You can see, with an LLM, I’m able to query OpenAI’s ChatGPT and create my own emoji, which is great.

The book, like people have said before, it’s in Chinese and Korean. If you don’t want to pay for it on Amazon, it’s mostly in Google Scholar, not my chapters. There’s a ton of great information there. I’m really proud to say that it was adopted as a curriculum by Univision for their initial virtual reality experiences. I am Filipino-American. I’ve also done a lot of development in that community, not only for Filipino-Americans in tech with my own nonprofit, but spoken to Oculus and Meta several times with their employee research groups, the Asian-American and Filipino ones. I’ve taught a lot of courses between Cisco, a lot of women in tech groups over the years. I previously was leading the Women in Machine Learning Data Science paper reading group.

To kick off the rest of the talk, I’ll go over the history of AR and VR and the different development stacks that we’ve had, how it’s changed with AI, different use cases. I’ll briefly touch, and I want to preface this here, zero expectations for privacy or what we want to call AI safety, responsible AI. It’s a really difficult space. There are some initial philosophies and frameworks used, but there’s a lot of ground that still needs to be covered in creating new resources, new tools for people to actually develop on.

History

Briefly on the history, this is from NVIDIA’s GTC with Jensen Huang, one of my favorite CEOs. I know it’s a little bit harder to see it here, but if you can just see to the left, like the history of computing, computers in like the 1960s, and then I’m going to fast forward where he goes through all the different types of AI and algorithms that are working in tandem with the hardware. What’s missing in this graphic though, obviously, is all of the years that we had what we call VR winter, just like AI winter. There were decades that stuff wasn’t being super developed, or it just died. Some people say that VR actually started with theater and cave painting. My friend Kent Bye, who’s the host of the Voices of VR podcast, talks about that. Most people like to point towards the 1960s with the Sword of Damocles. NASA likes to claim a lot of their work in the 1980s and 1990s, the advent of Virtual Boy.

For me, most famously, the acquisition by Meta for $2 billion, at the time Facebook, of Oculus. We’ve had so many headsets since then. Then, obviously the new ones I’ve showed. I like to think, and this is referencing another talk regarding Google Maps and AR, like to take us back to the ’90s, where you’re going to take a ride with your family on a road trip, how would you get there? You’d look up the location, and if you’re lucky, get a good printout of several maps on MapQuest, if it wasn’t like old school maps before that. Now it’s like, you can use Google Maps, Google AR, Google Earth VR was one of my favorite experiences. How many people have tried that, that don’t work for Google? You should try it. It’s great.

The question for me, which is like the Holy Grail people have been begging for, Michael Abrash, who’s the chief scientist of Oculus at the time, and is still there, now Meta Reality Labs, as we call it, was talking about, we could probably have gloves and glasses in 10 years. The glasses came sooner than I thought. The glove, they have hand tracking. We actually don’t need that, which is great. I published this in 2018, and I’m so happy that it’s finally out. Now we’re seeing a bunch of different headset manufacturers also competing in the space. It’s not spectacles, maybe not the greatest field of view, but working untethered, unlike the Magic Leap.

My co-author, Steve, also used to work at Magic Leap, so I’d make fun of him, like your field of view sucks, not the greatest thing like HoloLens. How many people have tried out Microsoft HoloLens before? How about the Apple Vision Pro? How about Snap Spectacles? Meta AR glasses, not Orion, just the glasses themselves? You can try out a demo at your local Ray-Ban store, or you can actually book a demo at the Bali Game Office.

Then, to the right, which is rumored, there’s Samsung’s partnership with Google and Qualcomm for their glasses, and they’re calling them AI glasses. Obviously, XR and AI are much more intersecting and together, not separate. This actually points towards the very last piece that I think with AI that we’re trying to get to. Some people call it AGI, Artificial General Intelligence, or ambient computing, where it’s just like there’s an AI or whatever you think with XR, there’s a robot everywhere.

The way that I like to think of this as a designer, how many people know Don Norman’s work, Design of Everyday Things? We think about his work, it’s like, how do we design a doorknob? How do you design something that is so ubiquitous that it’s just always there? A pair of glasses, will it be context? I will say that with Google, they acquired a company called Eyefluence, this is about maybe 8 years ago. There’s a lot that’s being developed secretly. I’d say with Google, there’s no more Google glasses. I don’t think folks can talk about it publicly when I talk to the head of Google UX labs, but things are being worked on. If you’re not a fan of the headsets or the glasses, maybe it’s context, and if it’s not, maybe it’s public holograms.

XR and AI Tech Stack: Evolution and Use Cases

How has the design and development stack for XR and AR changed over time? It has changed dramatically. This is a slide from my friend Jasmine Roberts, who’s now at Microsoft Research. She used to be at Unity. You had to master all of these things. You’re looking at frontend development, backend development. You could do web and native. It’s a ton to do. The way that I like to think of it is four layers of the stack. If you’re not a hardware developer, and you’re like, I don’t really care about the environment, you’re just going to do really basic stuff in programming. How many people here, C#, C++ developers, Python, JavaScript? More WebXR people. That’s great for accessibility. Swift, Objective-C? I think a lot of these things have changed over time.

Some products are not as common anymore, and different formats have changed over the years. What I like to think, and this is coming from a term from my friend Charlie Fink, who many people know is one of the lead animators of Lion King. His book is basically my favorite quote, How the World Will Be Painted with Data. When you’re doing augmented and virtual reality work, you’re just basically restructuring data all the time. That’s actually closer to AI. Most people don’t think of it that way.

In terms of XR, it’s maybe far more complicated because there’s so many layers of AI, actually. I used to think the hardest thing was NLP. There were actually three different distinct classes of algorithms, and probably more, but I’ll categorize it here into three just for the sake of simplicity. Computer vision, and that ecosystem didn’t necessarily talk to reinforcement learning and RL, so anyone that’s doing game AI. Then NLP was this thing that I just thought, this is so hard.

There were a lot of developments over the years that I can go through of how hard that work is that is now immensely easier. We’re a stack now when we look at it, instead of just Unity and Unreal, you have so many different agents. How many people use Microsoft Copilot? Do you love it? Is it cool? Is it error-prone? Like, hallucinations everywhere? It’s tough. Could I use a Cursor or a Copilot to create my entire AR, VR app? I’ve probably tried at least six foundation models trying to create a data visualization app or even a productivity app, and it is so hard. Why? Because it is error-prone? There are hallucinations. It is aiding in the process to have a lot of people be able to access what they couldn’t previously do before, so low-code tools, the development process is much faster, we’re more productive.

The way that I like to see the software creation cycle has now changed, where it isn’t just like, you must learn Unity and Unreal to do this. I have a video that I’ll show here later, at least a screencap, of an 8-year-old who is able to use React-three-fiber and Claude and be able to create his own game. That’s all web-based, which you can actually port onto a Meta Quest 3. There’s a lot of different ways that this has changed.

Normally what you’ll do is you’ll select your favorite operating system. Not everyone is a fan of Apple, not everyone’s a fan of Google or Meta. Pick your poison, pick your language, and pick your game engine. Now I think the entire layer of game engines is being completely abstracted away and changing, as well as two things that are very different. Not just SLAM, but we’re also looking at input in human data and how that’s feeding into AI and AGI, which I know is a stretch. I’ll say more about that later, about how I think robotics is changing everything, especially with NVIDIA. All of that is to be able to create an XR app. What’s different, as you can see, the purple, foundation models. How many people here use ChatGPT every single day? Claude, maybe? Other models, small models, which ones? People don’t love LLMs as much as I do here.

The way that I see this is instead of you have the software programming layer, you have AI agents as the next layer. This is how I actually think it’s changed over the years and what we’re trying to achieve. Ambient computing, AGI, and the last frontier to me is reasoning. Why reasoning? Because AIs have to actually think. If you’ve heard of terms like chain of thought, it’s really hard to get a robot to think like a human. Dylan, had mentioned it’s really hard for a human to just walk and balance. That’s something that NVIDIA is actively working on, just trying to do the basics of motion and robotics and how we perceive objects. What we’re starting with is like the manual programming layer.

Then with our keynote, which is subjective for me, but is prompt engineering really engineering? Subjective. It is much more accessible to be able to do AI development now than it was 10 years ago where I was told in order to be a data scientist or a machine learning engineer, you’d have to have a PhD. That is not the case. You have an 8-year-old that’s playing with Cursor and Claude and being able to actually write Python. That wasn’t the case maybe 10 years ago.

Now we’ve advanced a lot, but there’s still a lot of work to be done. These are the other classes of algorithms, if people care to look up the papers. I’ll try to briefly go over a little bit on foundation models. I met Sam Altman many years ago, Ilya Sutskever, and Andrej Karpathy many different times, but in a very different era of my life. Back then, I would say in 2018, I was in a room, there was a party, with Andrej Karpathy. I asked him, so why did you decide to work at Tesla? He said, because I believe Tesla is a force of good in the world. Really funny. Also, I had asked Ilya Sutskever, I said, there’s this book by Ian Goodfellow called “Deep Learning”, and a class of algorithms called GANs, that was the big thing. I loved everything about style transfer and generative adversarial networks. He said, it’s a thing, but is it really our thing? That was their response. They really had no idea about this wave that was coming once they had actually released ChatGPT in the wild. The papers that started was, ‘Attention Is All You Need,’ in 2017.

Then fast forward with OpenAI, which was founded in 2015. Then they finally came out with GPT-3, GPT-3.5 in 2022. The last few years, it’s grown, obviously, with every major competitor, between Google and Meta, with Llama and Claude, which is like the fork, I like to call it, of OpenAI and Anthropic for their issues around safety, and really groundbreaking work with constitutional AI. Then you also have, late to the game, Google Gemini, which, obviously, they should have capitalized on this earlier because their actual paper, which was written by Ilya was DeepMind, so that was originally Google’s. Before Gemini, I think it was Bard, actually, it was right in the middle. I’m sure all of you have maybe read this if you work in AI. Definitely encourage all the XR developers who don’t know about AI, please read this paper. It’s a great paper, ‘Attention Is All You Need.’ It talks about the transformer algorithm.

Multidimensionality and Multimodality

The one I mentioned earlier was CycleGAN. For people who don’t know what that is, I’ll actually show. This was in my class, I think, at fast.ai in 2017. This was called pix2pix in PyTorch. You had to my left a horse, and then, can I make it look like a zebra? That’s the goal.

Now we’ve seen that the class of algorithms between NLP, computer vision, and RL have really merged. Now, getting back to programming with XR. In my book, a former head of emerging design at Adobe, Silka Miesnieks, talks about this, where in order to achieve telepresence, or what we want as the Holy Grail within anything that you do in XR. Whether you’re trying to teleport yourself to another realm and escape into another reality. What you’re actually trying to do is the same thing that I think AI is trying to do, which is to be more human, to be more connected. She coined this term called sensory design. We think about multimodality, multimodal models. It’s all about what I think Dylan talked about, which is multiple senses. You have, to my left, sight, and in the middle, hearing, to the right, touch. Minus smell or vision, which I don’t know how many of you have ever tried smell or vision in VR, and taste. Haven’t been able to crack that. It’s a little bit fuzzy.

With these different senses, we’ve managed to capture everything that we know about computer vision, about NLP, speech to text, audio. That has been achieved. If you look to the bottom, I think about the problems in robotics and AGI that we’re actually trying to achieve. Movement, this is something that is really hard to do. An example that NVIDIA is working on is the robot hand. It’s really hard for a robot to grasp and pick something up with five fingers. It’s not exactly the same. They’re now using XR to be able to train a bunch of models to be able to do that and create more synthetic data so you don’t need an entire workforce, similar to what Elon Musk does at SpaceX, to be able to train the next generation of AI, so that you’re going to get closer to creating a fleet of robots.

Then you have proprioception, which is spatial awareness. For folks that study deeply into perception, this idea of perceiving, that is something that is inherently human. What makes an AI understand, like computer vision, that’s why we call it that? How does it understand an object, a scene, an environment? Then the last word, which is really interesting to me, is agency. Why do we say that? There’s AI agents everywhere. It’s about control and choice, which is one of the reasons why in the rest of this talk also I’ll talk about privacy, because there are a lot of guardrails that haven’t been developed even in XR yet, and some that were not there by default that I think also needs to change.

This is a graphic talking about all of the different multimodalities. Most interesting thing, because most people have worked with text, imaging, audio, 3D, a little bit more, there’s a boom in video now with like the advent of Sora or even Meta’s new papers, which has really great editing capabilities, is actually protein structures or DNA sequences. How many people know the paper AlphaFold from Google DeepMind? Really great stuff. That was my initial startup many years ago, was advancing Alzheimer’s research with AR, VR, and AI. I was trying to figure out protein folding and 3D reconstruction of brain imaging. Hundreds of patients, that’s petabytes of data, and trying to do that within a headset so that researchers could gain more insight into trying to advance research into the field. Obviously not as great as DeepMind. They have definitely boomed in that. Just the idea of human data as input has now really evolved and changed. With that comes a lot of responsibility with AI ethics as well.

The modalities on which we think of that is probably the next frontier, motion is going to be worked on heavily with NVIDIA. This startup called Runway, based out of New York, actually shows how they’re applying video to video with Apple Vision Pro and style transfers. It’s not like you can query an LLM to go right away and create your AR, VR application. You’ll have to do a lot more work than that. You can take a video within your Apple Vision Pro and now apply style to that. That was not any part of our development cycle before. The really interesting stuff is coming out. I’ll show more of those videos at the end.

Before that, I will also talk about the terminology within XR of like how AI and human input as data is like the human being in the loop. That’s what we like to call it in AI. This is how it actually works in XR. This graphic is actually by my friend Avi, formerly at Microsoft, did work on the HoloLens as well as the Apple Vision Pro. He talks about the layers of reality on the bottom, interacting with people. Where we like to focus the majority of this talk is really the applications’ experience, the OSes, and the models. We as designers and developers have a huge responsibility to think about what’s the ethical consequences of the data that we’re using to feed these pipelines in these applications and experiences, whether it’s in a headset, whether it’s on a pair of glasses and so forth. I mentioned earlier a little bit more about perception. Also encourage you to read a great book called “The VR Book”, by Jason Jerald. This is the cycle of perception to actual real human understanding.

The goals of VR, this is how he’s actually laid it out, are actually the same goals that you have in AI. You start with perception after you’ve done a lot of manual work in pre-processing, like with your data in Python or JavaScript or whatnot. Having the agent, just like a human, recognize that data, and then go through the rest of the loop. Where I think we stop is getting back into processing and transaction. That’s being closed in the loop right now because of input data and because of foundation models. It’s not perfect yet.

One of the other great graphics that I’ll mention, just because I did reference Kent Bye earlier about, how did this all begin with VR? Is it cave painting? Is it theater? He has a really great graphic on how we understand perception and the development of actually memory. I actually did a hackathon with IBM many years ago where we were trying to recreate what is very similar to the film, “Ready Player One”, of how we develop memory.

The goal of XR with AI, same thing. It’s really about the human brain, and how do we develop memory? How do we capture the reality that we have so that you can reference it later? Which also gets really scary because if you put on a pair of AR glasses, it’ll remember now, like, where did I put my keys? It’ll know where it is. It’s capturing environment data. Reference also is for Percy Liang who’s doing some great work. Not UC Berkeley. That was the class I’m taking on reasoning at Berkeley. He’s actually a professor at Stanford. He talks about the feedback loop that you’re having with perception, storing memory and data, retrieving that data. How many of you are developing RAG applications right now? That’s the thing that we’re also trying to do within AR and VR, and the action, reflection, and planning. For people who are XR developers that have no idea how AI works, it’s a really basic process of what you’re trying to achieve when you’re creating an AI agent. It’s the same philosophy and concept as what you’re trying to do. XR and AR are much more related.

Generative AI

How do we create XR with the explosion of AI? Like I said before, it’s modifying how we design and developed so that it’s more naturally human. Now you can query with natural language and NLP being able to do much more in the interfaces you weren’t able to do before. Our tool stack is completely changing. Updated stack a little bit. Maybe back in the day you had a mocap suit for human data. Now we have much more real-world environment data that’s now being blended into that entire spectrum of XR.

Really new though, obviously are avatars, characters used to be just NPCs, but now they’re becoming active as AI agents. Like I said before, XR and AI together are by default multimodal, they’re multisensory, and they’re multidimensional. It’s actually really quite difficult, but much easier to do now. It increases the collaboration and engagement with users. It makes us more human and expressive. Automation makes tasks that are really mundane, much more easy. I’ll start with assets, environment, and generation.

This is the video from Runway I mentioned earlier, taking an Apple Vision Pro. Imagine an office in New York, and now you have style which basically transferred onto an entire video and environment when you’re experiencing it.

Germanidis: It’s really amazing work done with video to video. Video to video allows you to take an input video, which might be something you shoot with your camera, and then maintain the overall structure of the video, but completely transform the style. In this case, you have actually a video that’s taken with a Vision Pro from our office in New York. You can see how you can completely transform it. The use cases for that are so vast, like the ability to, I like to call it the actually augmented reality. This is, you can really create an extremely compelling, transform completely what you’re seeing in real time.

Pangilinan: Then I mentioned Ben Mildenhall is the co-founder of World Labs with Fei-Fei Li. He and the other co-founder had offered this paper on NeRFs.

Mildenhall: You can visualize the view dependent effects captured by our representation, such as the semi-transparent appearance of the olive oil bottle and specularities on the stove, reflections on this room’s television, specularities on the conference table, and sharp reflections and specularities on this red car’s body and windshield.

Pangilinan: You’ll see in the NeRF paper or what’s known as radiance fields, the idea of depth and perception that is being developed through computer vision. That paper came out, how many years ago? There are so many iterations of that now. That’s definitely something to follow if you are working in computer vision.

Spatial Intelligence

Spatial intelligence. Do people understand what spatial intelligence means when I say that word? Some people do. Who really doesn’t understand that? Imagine you have an Apple Vision Pro and I’m playing with the other Erin in her Apple Vision Pro, and I have this object. I have my water bottle in the scene. This is a startup called Remix, is able to scan my entire room and environment, and she’s able to digitally pick up my water bottle in her environment. There’s a lot of innovation development going on when we talk about computer vision, AI, as it’s intersecting with XR. That’s something to look out for in terms of emerging startups.

Now we’re going to talk about characters, avatars, and agents. How many people have read the paper of VFusion3D from Meta AI Research or FAIR? Very new paper, but a lot of people tend to know this. If you’re going to talk to an LLM and say, “LLM or foundation model of your choice, I would like to create a Pikachu with a hat on”. They’re able to do that, create a 2D image. From that 2D image, create a 3D character and mesh, which you can see here. That’s really cool work. Then, how many people have tried Inworld? Also very cool stuff. My friend Kevin Lin, who is the founder of Twitch, is also backing. You’re able to talk to your NPC, have it talk back to you in a more photorealistic way. The latency time of the response from the LLM is much faster than most. There are a lot of people developing this right now. NVIDIA is actually more on the forefront there and they have a partnership with Inworld for developing this.

“Warms up to the player. He will lean forward to indicate interest. I have some insider information about one of your competitors. I’d love to meet up later and discuss”.

Diego: Sounds intriguing. Unfortunately, my schedule is quite packed with conference events. Maybe we can discuss over dinner?

Pangilinan: Maybe not the funnest experience, but they’re developing a lot more games. I would say all of the script writing when you’re doing a narrative, every single person that I know that is an artist, yes, some of them are members of Authors Guild. The ones that are most excited to use AI are using tools like this for their games and actually their films. I’m actually working confidentially with another filmmaker right now who used to work in AI that’s using Runway.

Code Generation

My last favorite topic, code generation. How many people are using Claude? I’m not sure if people have seen this demo, but my friend Meng To, who’s known for his work as a computational designer, had started off as a frontend developer in Singapore. He couldn’t get his visa in the United States, so he taught himself mobile development, released a really great book called ” Design+Code”, became a really popular MOOC.

Right now, he’s actually using about 90% of Python code that’s created by Claude to create a web video editor. It’s really amazing stuff. He’s able to generate, I think, voices, create images that are the assets for the video, the entire pipeline. He probably only built this in a couple of weeks since the introduction of Claude. It just goes to show you how powerful the foundation models are, even with someone, again, maybe doesn’t have 10 years of experience or isn’t a core game developer or even a video editor, but is able to quickly create tools with foundation models. How can this be applied in AR and VR? I challenge people to actually create, and confidentially, I’m still trying to prototype a lot of this, of like, what if you could create Unreal Engine or recreate your own game engine from scratch? You’re not having to use Unity. That’s great for cross-platform development. Unreal is really great for film.

Sometimes you just want something really quick and dirty in the web. You can actually do that now. His 8-year-old was able to use Claude, use React-three-fiber, which is one of the open-source frameworks. How many people are React developers or are on the frontend? Started off as a React developer many years ago. That’s gone through many iterations with React VR. Then, since that time, Three.js. How many people use Three.js? Some of my computer graphics people on the web. Some people. They’re able to use those classes and different frameworks to be able to create a game. If an 8-year-old can do that, I’m pretty sure you can apply this to AR and VR. It’s just going to take some time.

Automation

The last few slides, I’m going to talk about automation. To my left, you can see this user using an Apple Vision Pro to be able to capture his hand movement, and he’s actually training a robot to do that. Very similar to the work that we saw in Dylan’s talk. Now, it’s actually really manual to do that, so they’ve taken that amount of data at NVIDIA, been able to feed it into anything with AI and foundation models, create their own synthetic data, and it’s bringing us much more close to AGI. I know AGI is a very broad term that I don’t think it’s going to be achieved in maybe thousands of days, as Sam says. I think the human brain is actually much more complex if you’re going to solve how the human brain operates, stores memory, and you just think about Alzheimer’s research and how hard that is. Maybe not quite yet, but will it capture what we understand in terms of perception and motion? Absolutely. It’s being done right now, and it’s being done in tandem between XR and AI, so really exciting stuff that they’re doing. The last frontier is reasoning.

Challenges and Solutions

I’ll end with challenges and solutions. With XR, a lot of people I know don’t develop for it because it’s too heavy. Are there people here who’ve never developed for XR? What are some of the other reasons that you don’t like it, for whatever reason? Affordability. It’s really expensive. How many people own an Apple Vision Pro? It’s like $3,500 bucks. Is it accessible? I can at least say there used to be at least with Oculus rental programs that they had with libraries for the Oculus Rift. I hope they expand that. I don’t think I’m going to pay $10,000 for Meta AR glasses as Orion. It’s much more focused on, at least for the purposes of this track, like designing inclusive UI.

You all being here, that’s actually really important to educate yourselves on how to create new applications and experiences that put the human first. Design should be first, not last. Ethics should be thought of first when you’re selecting your foundation model, your data, not last. That’s something very different between the XR community and AI that I hope will change. There’s going to be much more culturally relevant content, which I think will invite more diverse developers and designers into the space as well.

Some challenges with AI. We got a lot of work to do, even with the advent of ChatGPT, it’s not perfect. A lot of people are worried about displacement of jobs through automation for a lot of manual tasks and skills and jobs. People are afraid. There’s definitely a lot of data regulation and guardrails that aren’t there. As many people know, it’s highly inaccurate. You’ll probably talk to an LLM and then it won’t stop unless you tell it to. It often hallucinates. It’ll invent another framework that didn’t exist. It’s a little bit tougher. Fear not, there’s still a lot of hope. Obviously with the displacement of jobs, there’ll also be new jobs created with AI.

As you know, many tracks focus on RAG, Retrievable Augmented Generation, as well as fine-tuning. Are those techniques solid? Subjective. Is it have to be a large language model we all have to use and be locked in to OpenAI, or Microsoft Copilot, or an AutoGen, or whatever? People are coming out with small models. Is it as accurate, as perfect? No. The fact that like Mistral has done amazing work, still really impressive. Andrej would disagree with me.

If you go back to the transformer architecture, and really getting into the hood of things, but it is an evolution of what we can actually do. Many people don’t have AI products fully in production. It’s a lot of prototype phase. Then with XR, we’re in that phase as well. We’re still in the early prototype phase where it’s not being launched everywhere, but it’ll eventually get there. I also have faith in humanity. We’re smarter than robots. There’s actually a startup called Guardrails AI. Shout out to them and many other people working in this space. I like to not go into deep, dark territory of, “Black Mirror”. How many people have ever developed for Web3, blockchain, or crypto ever? There are some good people there that are actually working actively on privacy.

This actually comes from a PhD student with my friend, Professor Dawn Song, who’s in the Department of Computer Science at UC Berkeley. She’s known for decentralized finance, federated AI, and things of that nature. Her PhD student created this UI to the left where you’re able to toggle between all the different types of data that you’re actually inputting into your dataset, so like eye gaze, hand tracking data. Really funny that it came from the blockchain space. I did not expect that, but the title of their paper was, ‘Going Incognito in the Metaverse.’ This is something that I think can be developed on as an example of, how do we apply a philosophical framework of like, I really want to have my own data maybe locally stored on a headset and maybe only opt in a certain amount of data? You can do that now.

Open-source models. As much as I love Sam and Ilya before he left — I still love him regardless — but closed models for a lot of people don’t work. This graphic comes from Percy Liang from Stanford. This is his talk and lecture at UC Berkeley talking about why foundation models are important. Why? Because when you get underneath the hood of these different things, you actually can’t do really core computer science. You’re at the level of a cognitive scientist just being able to query an API. You’re not getting the full model, and more than just its weights to be able to do meaningful work. For XR developers, what does it mean? It just means you need to be very thoughtful about the foundation model that you choose to develop your AI agent.

Philosophies. I’m going to end with these slides by my friend Kent Bye. If you have never heard of his podcast, he’s amazing. Wrote a praise quote for my book, but he’s interviewed hundreds of people over the years all about VR. This is a very extensive list of biometric data that you’re actually opting into a headset. He actually references Meta’s papers of like eye gazing. I probably surrendered my eyeballs to Apple Vision Pro and to Meta to some degree, that you can actually take a lot of sentiment analysis and do a lot more than I would like in terms of privacy. Some people find it invasive and don’t like that. They actively choose not to use or develop for a Meta Quest headset. They’ll use PICO or something else.

This is just FYI if you need to know if you’re going to develop for those platforms some of the privacy concerns that people have. With Apple Vision Pro, they do have a partnership with ChatGPT, but I honestly can’t even get Core ML to work for handwriting recognition on a paper planner for my productivity app. Why? Because you have to opt in so many different things to be able to surrender your data. Apple does still care about privacy. I know that’s subjective for some folks here, because they’re not always the greatest on open source, but there are guardrails in place. Yes, maybe they have my eye tracking data, but they actually have a lot of other guardrails for third-party developers that are actually much more conscious of human-centered design. This is the other paper I mentioned from Meta.

How do we think about privacy for the future? Kent continues with the four elements, but it’s all about what we call presence. How am I really here, but also could be really somewhere else within VR? That’s a part of mental presence. What’s being messed with for some people, why they won’t use a headset, is emotional presence. It’s really important to think about privacy, but developing something practical out of a philosophical framework of how you create for that. Lastly, this is harder because I used to work in public policy. I used to work for the founder of Tech for Obama and Congressman O’Connor, who’s the co-chair of the Bernie Sanders campaign. Economy, laws, culture, and earth, these are the things that maybe in these rooms you don’t talk about very much with regard to privacy.

As designers and developers, we really focus on design guidelines all the way through user experience. They’re not divorced from these other areas. You actually do have to think about those things when you’re selecting your particular development stack of choice, whether you’re choosing a headset, a language, and an application in your audience. I’d like to remind you, you have agency. We’re not just creating AI agents. You as a developer should feel empowered to be able to create the future with XR and AI.

There are some resources. Fast.ai, in that initial class when I took it, it was a little bit in doomer territory where maybe 90% of the class was, we hate Facebook, was the message I got. At the time they had model cards. Constitutional AI hadn’t been developed yet. Anthropic wasn’t even created yet. There are a lot of other resources that people have created for responsible AI and ethics that I think the XR community really needs to take note of that. Many people here work on standards with W3C and Khronos. There’s a lot of work to be done.

Questions and Answers

Participant: I’d love to hear a little bit more about when it comes to the challenges of creating XR content. We talked about there’d be a big learning curve for Unity and some of these other softwares, as opposed to the 8-year-old being able to make something. What do you see as the current limitations of AI when it comes to empowering people to make XR content? What is going to be the limits of somebody using an AI to try to speak their idea into existence into a full XR application, as opposed to somebody that’s going through a more traditional design development cycle?

Pangilinan: We talk about this a lot. The word here that was actually really key is interoperability. Like I said, headset of choice. If you’re going to choose C#, you’re thinking about maybe writing once and distributing everywhere. That’s their tagline. C++, a lot of folks, like I know for film, what I’m optimizing for like 3D rendering. My co-founder produced Star Wars episodes 1 and 2. They’re used to very expensive Hollywood film budgets. A lot of people that have that money are spending it using Unreal Engine. Most people don’t necessarily want to learn C# and C++ to do that. A lot of our audience today are web developers. Meta, actually, I was really surprised at Meta Connect, they put in a good amount of time for WebXR. That’s actually really exciting.

If JavaScript’s your first language, I know for some people, they don’t think it’s a language. It’s a scripting language. Just like CSS is a scripting language. It’s just like the UI. That’s actually really important. Because you can, like I said, with that 8-year-old, wasn’t able to do that many years ago. It’s becoming easier. What happens is, does my headset support WebXR? Apple’s done a lot of work on that now. In my book, actually, the chapters that were cut were 360-degree video, blockchain, and WebXR. WebXR, they said at that time, this is when Mozilla, the creators of Firefox, created A-Frame. One of our book reviewers who created that, they weren’t able to create this. I think they were banking on my book and their book was actually supposed to be about WebXR. They were banking on the success of mine. Most people have said on our review panel that WebXR was not developed enough. It’s come leaps and bounds since then.

If you’re targeting just web, and then it can work on any headset, that’s probably the lowest barrier to entry. Once you get to more advanced stuff for native development, of like, I want this to be a part of a major Hollywood film, C++, unless you’re using Blueprints. Unreal, that’s a little more inaccessible to some people.

Then some people are like, I got to learn Unity and C#. Right now, you’re requiring an LLM to be able to create an asset and get to that next portion motion. That’s being solved right now actively. The deployment of that into an application, like an APK file for anything that works in the Android universe of Google, that’s going to come. All of these things are still early stage. They’re not perfect. I’m hopeful, like some of the layers being dropped down in terms of the barrier to entry. There’s still a lot of work to do for native application development. That’s much more difficult and challenging. Easier for Swift developers. We’re not there yet with the number of developers in the ecosystem.

See more presentations with transcripts

Multidimensionality: Using Spatial Intelligence x Spatial Computing to Create New Worlds

Tags: