GPT4 explained in 5 minutes!
GPT4 is the most sophisticated LLM that I have ever come across till now. OpenAI released GPT4 on 14 March around 10pm IST.
I. Introduction:
GPT-4, is a large-scale, multimodal model that can accept images and texts as inputs and produce text outputs.
The main focus has been to improve their ability to understand and generate natural language text, particularly in more complex and nuanced scenarios. On a suite of traditional NLP benchmarks, GPT-4 outperforms previous large language models and most state-of-the-art systems (which often have benchmark-specific training or hand-engineering).
II. Model:
- It is a Transformer-style model pre-trained to predict the next token in a document, using both publicly available data (such as internet data) and data licensed from third-party providers.
- The data used is a web-scale corpus of data including correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas. The cut-off date for the data used is September 2021.
- The model was fine-tuned using Reinforcement Learning from Human Feedback (RLHF).
- How are the images handled by the model?
OpenAI hasn’t disclosed these details publicly yet.
III. Capabilities of GPT4:
- Without task-specific training, GPT4 was tested by simulating exams designed by humans for humans.
- GPT-4 exhibits human-level performance on the majority of these professional and academic exams. Notably, it passes a simulated version of the Uniform Bar Examination with a score in the top 10% of test takers. (Image 1)
- GPT-4 considerably outperforms existing language models and previously state-of-the-art (SOTA) systems, which often have benchmark-specific crafting or additional training protocols. (Image 2)
- GPT-4 substantially improves over previous models' ability to follow user intent. On a dataset of 5,214 prompts submitted to ChatGPT and the OpenAI API, the responses generated by GPT-4 were preferred over the responses generated by GPT-3.5 on 70.2% of prompts.
- GPT-4 outperforms the English- language performance of GPT 3.5 and existing language models (Chinchilla and PaLM) for most languages that were tested, including low-resource languages such as Latvian, Welsh, and Swahili. (Image 3)
- GPT-4 accepts prompts consisting of both images and text, which — parallel to the text-only setting, lets the user specify any vision or language task. Specifically, the model generates text outputs given inputs consisting of arbitrarily interlaced text and images. The standard test-time techniques developed for language models (e.g. few-shot prompting, chain-of-thought, etc) are similarly effective when using both images and text. (Image 4)
- Visual understanding is one of the key features of GPT4 that enables visual Q&A, captioning, and image-to-text capabilities.
IV. Predictable Scaling:
- For very large training runs like GPT-4, it is not feasible to do extensive model-specific tuning. To address this, they developed infrastructure and optimization methods that have very predictable behavior across multiple scales.
- These improvements allowed to reliably predict some aspects of the performance of GPT-4 from smaller models trained using 1,000x to 10,000x less compute.
- The final loss of properly-trained large language models is thought to be well approximated by power laws in the amount of compute used to train the model.
- They predicted GPT-4’s final loss on an internal codebase (not part of the training set) by fitting a scaling law with an irreducible loss term : L(C) = aC^b + c, from models trained using the same methodology but using at most 10,000x less compute than GPT-4. This prediction was made shortly after the run started, without the use of any partial results. The fitted scaling law predicted GPT-4’s final loss with high accuracy.
V. Difference between GPT3.5 and GPT4:
- Visual Input: GPT3.5 is text-to-text only, unlike GPT4 which is an image/text-to-text model.
- Larger context: GPT4 supports 2x length of tokens when compared to GPT3.5 (by default), i.e., from 4096 to 8192 tokens. A much larger variant with 8x tokens length (32,768) is also released with limited access.
- Improved performance on multiple tasks, for example, summarization, multi-lingual translation, code editing using errors, creativity, etc.
- GPT4 outperforms the previous state-of-the-art models on standardized exams such as: GRE, SAT, BAR, APs (Image 1, 5) as well as other research benchmarks such as: MMLU, HellaSWAG and TextQA. (Image 2, 3)
VI. API and Pricing:
- ChatGPT Plus subscribers will get GPT-4 access on chat.openai.com with a usage cap.
- To get access to the GPT-4 API (which uses the same ChatCompletions API as gpt-3.5-turbo), please sign up for our waitlist.
- You can make text-only requests to the gpt-4 model (image inputs are still in limited alpha), by calling gpt-4–0314, which will be supported until June 14. Pricing is $0.03 per 1k prompt tokens and $0.06 per 1k completion tokens. Default rate limits are 40k tokens per minute and 200 requests per minute.
- Limited access to 32,768–context (about 50 pages of text) version, gpt-4–32k, which will also be updated automatically over time (current version gpt-4–32k-0314, also supported until June 14). Pricing is $0.06 per 1K prompt tokens and $0.12 per 1k completion tokens.
VII. OpenAI Evals:
- OpenAI has open-sourced OpenAI Evals, their software framework for creating and running benchmarks for evaluating models like GPT-4, while inspecting their performance sample by sample.
- Evals supports writing new classes to implement custom evaluation logic. Many benchmarks follow one of a few “templates,” so they have also included the templates that have been most useful internally (including a template for “model-graded evals”, GPT-4 is surprisingly capable of checking its own work).
- Evals will be an integral part of the process for using and building on top of our models, we can do direct contributions, questions, and feedback.
VIII. Limitations:
- GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors). Although, it has significantly reduced hallucinations relative to previous GPT-3.5 models (which have themselves been improving with continued iteration).
- GPT-4 makes progress on public benchmarks like TruthfulQA, which tests the model’s ability to separate fact from an adversarially-selected set of incorrect statements. These questions are paired with factually incorrect answers that are statistically appealing. The GPT-4 base model is only slightly better at this task than GPT-3.5; however, after RLHF post-training we observe large improvements over GPT-3.5. (Image 6)
- GPT-4 resists selecting common sayings (you can’t teach an old dog new tricks), however, it still can miss subtle details (Elvis Presley was not the son of an actor, so Perkins is the correct answer). (Image 6)
- GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its pre-training data cuts off in September 2021.
- It can fail at hard problems the same way humans do, such as introducing security vulnerabilities into the code it produces.
- GPT-4 can also be confidently wrong in its predictions, not taking care to double-check work when it’s likely to make a mistake.
IX. Usecases:
There are several exciting usecases for this kind of breakthrough technology. But I found one of the most exciting things that were shown by Greg Bockman, President & Co-Founder of OpenAI during Developer Livestream Demo, was that from just a sketch mockup of a website, GPT4 generated the actual website!
X. References & Links:
Join API waitlist: https://openai.com/waitlist/gpt-4
Official Blog: https://openai.com/research/gpt-4
Technical Report: https://cdn.openai.com/papers/gpt-4.pdf
Evals: https://github.com/openai/evals
Demo: https://www.youtube.com/live/outcGtbnMuQ?feature=share