ML | JURASSIC-1 – Language Model
Jurassic-1, the latest and the most advanced ‘Language Model’, is developed by Israel’s AI21 Labs. ‘Jurassic-1’ is the name given to a couple of auto-regressive Natural Language Processing (NLP) models. This model which was developed in competition to OpenAI’s GPT-3 consists of J1 Jumbo and J1 Large. This model breaks multiple records. Not only in terms of Jumbo’s size, which is 178 Billion parameters, but also in terms of its reach and usability by people. This is the first of all Language Models which will be made available to developers and researchers.
This futuristic model, introduced with the thought of having machines as humans’ thought partners, promises to carry out all kinds of languages and operational tasks. Not just that, it lets users build their own applications and services. Some of its coolest features are described below.
- Text summarization or simplification: Jurassic-1 does an amazing job to cut down texts of any given length to shorter texts containing only the relevant information. This feature could be used to make Minutes of Meeting, catch the gists of long mails/texts, conclude if a review or feedback was positive or negative, etc.
- Classification: This model specializes in classifying texts based on labels or categories. This classification is not limited to just binary classification. One main use case of classification is the case of sentiment analysis.
- World knowledge and creativity: This model has been trained on huge amounts of data due to which it is proficient in answering questions, giving suggestions, and clearing doubts. Not just that, this model is so creative that it is capable of writing articles on its own. It is humorous too though it is quite difficult for AI to grasp such things. Its ability to be so smart and creative has applications in areas of copywriting, ideation, marketing, and making interactive chatbots.
More of its features include translating programs and codes from one programming language to another, generating codes based only on textual commands, extracting information, and formatting. It can write lyrics of a song or rap, indulge in a game of charades, and play chess against you.
To store about 178 Billion parameters, Jurassic-1 requires a little more than 356 GB of memory in half-precision. Because even the best GPU’s memory is limited to just about 80 GB of memory, it was trained using multiple nodes. The model has been trained with 300 Billion tokens (a token is the small bit of text which is produced by breaking off large texts so that it is understood by the NLP) drawn from publicly available sources. In other words, the model has scraped almost all of the publicly available resources. This very fact makes the model a know-all.
This model is different from its predecessor GPT-3 in the following ways. GPT-3 has a capacity of 175 Billion parameters which makes it the second-largest language model. About 250,000 unique tokens, where a token can represent a word or a word piece, have been utilized to train Jurassic-1 whereas GPT-3 has been trained by utilizing only about 50,000 unique tokens. This makes Jurassic-1’s processing efficient as its tokens-per-byte (TPB) ratio is smaller which implies that the same text can be represented with fewer tokens in Jurassic-1 when compared to its representation in GPT-3. This speeds up Jurassic-1’s query processing by 1.4 times if both GPT-3 and Jurassic-1 are assumed to have the same architecture. But the catch is Jurassic-1’s architecture is different as the depth/width ratio of its neural net varies. Its architecture comparison is as shown in table 1. Considering Jurassic-1’s different architecture and its training in vocabulary, it speeds up its query processing by a huge 1.8 times. Because of its increased computational efficiency, Jurassic-1 can include more examples, when compared to GPT-3, in few-shot learning settings. Another very special feature of Jurassic-1 is it lets its users custom train the model by giving it very few examples (correctly mapped/answered datasets). The makers claim that giving it about 50-100 examples should be enough for the model to give fairly accurate results. Though it is always true that the greater the number of examples fed, the greater will be its accuracy. This, in contrast to GPT-3, gives the users the permission to use it as a chatbot too.
- nparams : Number of parameters in the model
- nlayers : Number of layers in the model
- dmodel : Number of units in each bottleneck model
- dhead : Dimension of attention heads
- nhead : Number of attention heads
- nvocab : Number of unique tokens used in training
AI21 is currently in open beta, hence, is letting anyone and everyone experiment with Jurassic-1. Go experiment.
- Technical paper of Jurassic-1 and its following blog posts
- Source of table 1: Technical paper of Jurassic-1