deepseek-ai/deepseek-67b-base ๐ข๐ โ ๐
About
DeepSeek LLM, an advanced language model comprising 67 billion parameters. Trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese

Example Output
Prompt:
"An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
Output
computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
The simplest form of attention mechanism is the additive attention mechanism (Bahdanau et al., 2015), which uses the following equation:
Where the function softmax is applied to each row of the matrix QK to ensure that the weights add up to one. In this case, the attention vector a is computed as a weighted sum of the values V using the weights WQK.
The multiplicative attention mechanism (Luong et al., 2015) is another common form of attention mechanism. It uses the following equation:
In this case, the attention vector a is computed as a weighted sum of the values V using the weights WQKV.
Attention mechanisms are used in various natural language processing tasks such as machine translation, question answering, and text summarization. They allow the model to focus on different parts of the input sequence depending on the task at hand. For example, in machine translation, the attention mechanism allows the decoder to attend to different parts of the source sentence when generating the target sentence.
Attention Mechanism
Attention mechanism is a technique used in deep learning models to
The simplest form of attention mechanism is the additive attention mechanism (Bahdanau et al., 2015), which uses the following equation:
Where the function softmax is applied to each row of the matrix QK to ensure that the weights add up to one. In this case, the attention vector a is computed as a weighted sum of the values V using the weights WQK.
The multiplicative attention mechanism (Luong et al., 2015) is another common form of attention mechanism. It uses the following equation:
In this case, the attention vector a is computed as a weighted sum of the values V using the weights WQKV.
Attention mechanisms are used in various natural language processing tasks such as machine translation, question answering, and text summarization. They allow the model to focus on different parts of the input sequence depending on the task at hand. For example, in machine translation, the attention mechanism allows the decoder to attend to different parts of the source sentence when generating the target sentence.
Attention Mechanism
Attention mechanism is a technique used in deep learning models to
Performance Metrics
13.04s
Prediction Time
202.43s
Total Time
All Input Parameters
{ "top_k": 50, "top_p": 0.8, "prompt": "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is", "temperature": 0.7, "max_new_tokens": 256, "repetition_penalty": 1.05 }
Input Parameters
- top_k
- The number of highest probability tokens to consider for generating the output. If > 0, only keep the top k tokens with highest probability (top-k filtering).
- top_p
- A probability threshold for generating the output. If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering). Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751).
- prompt
- Input prompt
- temperature
- The value used to modulate the next token probabilities.
- max_new_tokens
- The maximum number of tokens the model should generate as output.
- repetition_penalty
- Repetition penalty
Output Schema
Output
Example Execution Logs
[36m(RayWorkerVllm pid=2043)[0m INFO 05-07 20:41:05 custom_all_reduce.py:240] Registering 6685 cuda graph addresses [36m(RayWorkerVllm pid=2043)[0m INFO 05-07 20:41:05 model_runner.py:867] Graph capturing finished in 6 secs. INFO 05-07 20:41:06 async_llm_engine.py:508] Received request 0: prompt: 'An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=50, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<๏ฝendโofโsentence๏ฝ>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=256, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: None, lora_request: None. INFO 05-07 20:41:06 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0% INFO 05-07 20:41:11 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.6%, CPU KV cache usage: 0.0% INFO 05-07 20:41:16 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0% INFO 05-07 20:41:19 async_llm_engine.py:120] Finished request 0. generation took 13.033s
Version Details
- Version ID
0f2469607b150ffd428298a6bb57874f3657ab04fc980f7b5aa8fdad7bd6b46b
- Version Created
- May 7, 2024