Forewords
2023 has been one of the most exciting years to witness the breakthrough of AI Technology and Generative AI in particular, with the increasing popularity of ChatGPT (Generative Pretrained Transformer) and LLM (Large Language Models). This is thanks to its impressive ability to comprehend human languages and making decisions that remarkably mimic human intelligence.
ChatGPT reached an unprecedented milestone of 1 million users within five days. Since then, Big-Tech Giants have been quickly entering the race, releasing dozens of LLMs both open source and proprietary, such as LaMDA (Google AI), Megatron-Turing NLG (NVIDIA), PaLM (Google AI), Llama-2 (Meta AI), Bloom (Hugging Face), Wu Dao 2.0 (Beijing Academy of Artificial Intelligence), Jurassic-1 Jumbo (AI21 Labs) and Bard (Google AI), etc.
Alongside the race of Big-Tech Giants, the adoption of ChatGPT and LLMs in business is growing rapidly. According to the Master of Code Global report “Statistics of ChatGPT & Generative AI in business: 2023 Report”, 49% of companies presently use ChatGPT, while 30% intend to use it in the future. Another report by Forbes suggests that 70% of organizations are currently exploring generative AI, which includes LLMs. This suggests that LLMs are gaining traction in the enterprise world and that more and more companies are seeing the potential of this technology to revolutionize their businesses.
Our Chief AI Scientist, Dr. Dao Huu Hung, offers insights into AI’s exciting future and its impact on businesses and society.
1. Multimodal Generative AI
Although ChatGPT and most of other LLMs have been demonstrating superior performance in human language understanding (in text form), text is just one kind of data modals human beings perceive every day. However, multimodal data is ubiquitous in the real world, as humans often communicate and interact with all types of information, including images, audio, and video. Multimodal data also poses significant challenges for artificial intelligence (AI) systems, such as data heterogeneity, data alignment, data fusion, data representation, model complexity, computational cost, and evaluation metrics. The AI community, therefore, often opts for successfully addressing the unimodal data, before dealing with more challenging ones.
Inspired by the tremendous success of LLMs, the AI community has been creating Large Multimodal Models (LMMs) that can achieve similar levels of generality and expressiveness in the multimodal domain. LMMs can leverage massive amounts of multimodal data and perform diverse tasks with minimal supervision. Incorporating the other modalities into LLMs creates LMMs which solve many challenging tasks involving text, images, audio, videos, etc., such as captioning images, visual question answering, and editing images by natural language commands, etc.
OpenAI has been pioneering the development of GPT-4V, the upgraded multimodal version of GPT-4 model that can understand and generate information from both text and image inputs. GPT-4V can perform various tasks, such as generating images from textual descriptions, answering questions about images, and editing images with natural language commands.
LLaVA-1.5: This is a model that can understand and generate information from both text and images. It can perform tasks such as answering questions about images, generating captions for images, and editing images with natural language commands. Alpaca-LoRA: This is a model that can perform various natural language tasks by providing natural language instructions or prompts.
Adept, on the other hand, has been aiming at a bigger ambition, building an AI model that can interact with everything on your computer. ”Adept is building an entirely new way to get things done. It takes your goals, in plain language, and turns them into actions on the software you use every day.” They believe that AI models reading and writing text are still valuable, but ones using computers like human beings are even more valuable to enterprise businesses.
This is driving the race among Big-Tech companies to deliver Large Multimodal Models. It will take a few years for LMMs to reach the same levels as LLMs today.
2. Generating vs. Leveraging Large Foundation Models
Producing AI applications for many diverse tasks has never been easier and more efficient than before. Recalling several years ago, if we would like to make a sentiment analysis application, for example, it may take a few months to implement POC with both in-house and public datasets. It also takes a few months to deploy the sentiment analysis models into the production system. Now, LLMs facilitate the development of such applications in a few days, simply formulating a prompt for LLMs to evaluate a text as positive, neutral, or negative.
In the field of computer vision, visual prompting techniques, introduced by Landing AI, also leverage the power of Large Vision Models (LVMs) to solve a variety of vision tasks, such as object detection, object recognition, semantic segmentation, etc. Visual Prompting uses visual cues, such as images, icons, or patterns, to reprogram a pretrained Large Vision Model for a new downstream task. Visual prompting can reduce the need for extensive data labeling and model training and enable faster and easier deployment of computer vision applications.
Generating pre-trained Large Foundation Models (LFMs), including LLMs and LVMs, requires not only AI expertise but also a huge investment in infrastructure, i.e., data lake and computing servers. Hence, the race to create pretrained LFMs among Big-Tech companies this year will continue in 2024 and in the years to come. Some are proprietary but many others are open source, leading to diverse alternatives for enterprises. Meanwhile, Small and Medium Enterprises (SMEs) and AI start-ups will be the main forces to realize the commercials of LFMs. Thus, they will primarily focus on the creation of LFMs applications.
3. Agent Concept in Generative AI
The agent concept is a new trend in Generative AI that has the potential to revolutionize the way we interact with computers. Agents are software modules that can autonomously or semi-autonomously spin up sessions (in this case, language models and other workflow-related sessions) as needed to pursue a goal. One of the key benefits of using agents is that they can automate many of the tasks that are currently performed by humans. This can free up humans to focus on more strategic and creative tasks. Agents can be designed to be more user-friendly and easier to use than traditional Generative AI tools, making Generative AI more accessible to a wider range of users.
Here are some of the trends of agent concept in Generative AI:
- Increased use of agents to automate tasks: As Generative AI becomes more powerful and sophisticated; we can expect to see a greater use of agents to automate tasks that are currently performed by humans. For example, agents can be used to automate the process of creating and deploying AI models.
- Increased use of agents to make Generative AI more accessible: As agents become more user-friendly and easier to use, we can expect to see greater use of agents to make Generative AI more accessible to a wider range of users. This could lead to a new wave of innovation as more and more people are able to use Generative AI to create new products and services.
- Development of new agent-based Generative AI tools and platforms: As the agent concept becomes more popular, we can expect to see the development of new agent-based Generative AI tools and platforms. These tools and platforms will make it easier for developers to create and deploy agent-based Generative AI applications.
Here are some specific examples of how the agent concept is being used in Generative AI today:
- Agent-based Generative AI tools: There are a number of agent-based Generative AI tools that are currently available. For example, Auto-GPT and BabyAGI are two tools that allow users to create and deploy agent-based Generative AI applications.
- Agent-based Generative AI platforms: There are also a number of agent-based Generative AI platforms that are currently available. For example, Google’s AI Platform and Amazon Web Services’ SageMaker platform both allow users to deploy and manage agent-based Generative AI applications.
- Agent-based Generative AI applications: There are a number of agent-based Generative AI applications that are currently in use. For example, agent-based Generative AI applications are being used to create new products and services, automate tasks, and make Generative AI more accessible to a wider range of users.
Overall, the agent concept is a new and promising trend in Generative AI. It is being used to develop new tools, platforms, and applications that are having a significant impact on a variety of industries.
4. AI at the Edge
‘At-the-edge’ AI is a fast-growing and competitive field that involves deploying AI models on devices such as laptops, smartphones, cameras, drones, robots, and sensors. As AI applications continue to evolve, the trend of moving AI processing closer to the data source has gained significant momentum. There is competition among Big Tech companies and Chip makers to realize AI applications with cost-effective devices in our everyday usage without relying on cloud servers, which can improve speed, privacy, security, and energy efficiency.
NVIDIA has been a pioneer in edge AI with its powerful and versatile Jetson platform. Thanks to their heavy investment in high-performance GPU technology in the early days of deep learning, they have strong relationships with enterprises and cloud providers, i.e., Amazon Web Service, Microsoft Azure, Google Cloud Platform, etc. More importantly, NVIDIA provides broad software ecosystems and tools such as TensorRT and Deepstream, which support developers to develop and accelerate AI models efficiently. Although the GPU cost of NVIDIA is often higher than that of their competitors, it is still mainstream in the AI community.
There are a number of competitors providing cheaper and even faster alternatives to Jetson. The Google Edge TPU is a custom-designed ASIC that is optimized for running TensorFlow Lite models at the edge. The Intel Movidius Myriad X is a vision processing unit (VPU) that is designed for running AI applications at the edge. The Xilinx Zynq UltraScale+ MPSoC is a versatile system-on-chip (SoC) that contains an FPGA and an ARM processor. The NXP i.MX 8M Plus is an SoC that contains an ARM processor and a neural processing unit (NPU). The Qualcomm Snapdragon 865 is a mobile SoC that contains an NPU. They have been focusing on the hardware design and the software ecosystems and tools for developers to efficiently make use of their hardware. There will be steep competition in the years to come.
Apple has been jumping in this field, designing chips for their own products, including laptops and mobile devices. M1 chip has a 16-core Neural Engine that can perform up to 11 trillion operations per second. Although M2 chip has a 10-core Neural Engine but can perform 35% faster than M1. This makes it ideal for running AI models for tasks such as image recognition, natural language processing, and machine learning. Apple’s A16 Bionic chip, which is used in the iPhone 14 and iPhone 14 Pro, is even more powerful than the M1 chip. It has a 16-core Neural Engine that can perform up to 17 trillion operations per second. The A17 chip on iPhone 15 pro can perform 20% faster with just 6-core GPU.
Qualcomm is expected to release Snapdragon Elite Gen 3 in early 2024 which is based on 4 nm process. Its AI engine is twice as fast as the previous generation. It can run up to 15 trillion operations per second (TOPS) on the AI Benchmark. It can run a wide range of AI models, including image recognition, natural language processing, and machine learning models. It can also run multiple AI models simultaneously. Both Qualcomm and Apple chips can execute AI models at low power consumption. Thus, it is expected that we will see increasing competition in the field of edge AI devices in 2024 and beyond.