ChatGPT with Senses Begins to Invade the Physical World
Openai rolls out major update for full range of seeing, hearing, and speaking capabilities.
Recently, the official Openai blog posted a message demonstrating a large-scale language model that has been in use sinceChatGPT-4 The biggest feature update since its launch. According to the official release of use cases, people can now use their phone's camera and microphone to solve real-world problems through ChatGPT. For example, users can take a photo of a shared bicycle and then ask the AI assistant how to adjust the seat. With the rapid iteration of technology, the generative AI race has entered a new stage - the multimodal battle. In this phase, major tech companies have competed to launch a series of new products
and functions, with the help of artificial intelligence technology, breaks through the limitations of traditional search engines and chatbots, bringing users a richer and more accurate interactive experience.
ChatGPT- grows eyes and mouth.
Multimodal competition: from text to images, the next frontier in AI technology
As technology continues to evolve, we are gradually entering a new era of AI - the multimodal AI race. Whether it's Meta's AudioCraft program or the upgrades to Google Bard and Microsoft Bing's chat function, they are all announcing to the world the arrival of the multimodal era.
Recently, Meta launched a brand new project called AudioCraft, which extends the capabilities of AI from text to music. Through AI technology, AudioCraft is able to generate brand new musical compositions, which is certainly a new expansion of AI capabilities. Meanwhile, Google's Bard and Microsoft's Bing have introduced multimodal functionality to their chat experiences. This feature allows users to communicate with these AI assistants in a variety of formats, not just limited to text, but also through images, audio, and more.
Amazon isn't lagging behind, as they're leveraging the power of large-scale language models (LLMs) to enhance their Alexa digital assistant. To get a head start in this new multimodal AI race, Amazon just announced a $4 billion investment in OpenAI competitor Anthropic. And Apple is also experimenting with voice generation through AI technology, which they call Personal Voice.
In terms of image generation models, OpenAI's DALL-E 3 has been able to generate images in the latest model that supports text and typography generation after its release last week. And on Monday night local time, OpenAI made another announcement that ChatGPT is now able to analyze images and react to their understanding in text conversations. In addition, the ChatGPT mobile app will add speech synthesis options that will enable fully verbal conversations with the AI assistant when paired with existing speech recognition capabilities.
In this new multimodal AI era, the intersection and integration of various technologies are constantly breaking the boundaries of our cognition. It is foreseeable that future AI assistants will become more and more intelligent and be able to better understand and respond to the various needs of users. Behind all this is the continuous progress of generative AI technology and the continuous expansion of application scenarios. In this competition, we expect to see more innovations and breakthroughs, as well as a wider application of AI technology in various fields.
ChatGPT is now speech-enabled. This feature is driven by a new text-to-speech model that requires only text and a few seconds of speech samples to generate human-like audio. This feature allows ChatGPT to not only understand and generate text, but also to present responses as speech, further enhancing its interactivity and naturalness.
OpenAI also mentioned in the announcement that they worked with professional voice actors to create voice bars, which also means that ChatGPT can generate more realistic and natural speech. In addition, ChatGPT uses Whisper, OpenAI's open-source speech recognition system, which can transcribe users' speech into text, making ChatGPT even better at voice interaction.
Using the GPT-3.5 or GPT-4 models, ChatGPT is now able to process and parse uploaded images just as it does text input. This new feature allows users to click on an image and add it to a chat. ChatGPT then analyzes the text in the image and gives an answer or response.
The voice interaction and image recognition features introduced by ChatGPT bring unprecedented utility to chatbots, moving them from simple text-processing tools to something closer to real life. At the same time, it also signals the future development trend of AI systems - not only to understand the abstract world of text, but also to be able to perceive complex voice and image information, and even the physical world, so as to truly achieve the realm of human-machine interaction.