deepseek-ai/deepseek-67b-base 🔢📝 → 📝

▶️ 2.1K runs 📅 May 2024 ⚙️ Cog 0.9.6 🔗 GitHub ⚖️ License

code-generation text-generation text-translation

About

DeepSeek LLM, an advanced language model comprising 67 billion parameters. Trained from scratch on a vast dataset of 2 trillion tokens in both English and Chinese

Example Output

Prompt:

"An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"

Output

computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
The simplest form of attention mechanism is the additive attention mechanism (Bahdanau et al., 2015), which uses the following equation:
Where the function softmax is applied to each row of the matrix QK to ensure that the weights add up to one. In this case, the attention vector a is computed as a weighted sum of the values V using the weights WQK.
The multiplicative attention mechanism (Luong et al., 2015) is another common form of attention mechanism. It uses the following equation:
In this case, the attention vector a is computed as a weighted sum of the values V using the weights WQKV.
Attention mechanisms are used in various natural language processing tasks such as machine translation, question answering, and text summarization. They allow the model to focus on different parts of the input sequence depending on the task at hand. For example, in machine translation, the attention mechanism allows the decoder to attend to different parts of the source sentence when generating the target sentence.
Attention Mechanism
Attention mechanism is a technique used in deep learning models to

Performance Metrics

13.04s Prediction Time

202.43s Total Time

All Input Parameters

{
  "top_k": 50,
  "top_p": 0.8,
  "prompt": "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is",
  "temperature": 0.7,
  "max_new_tokens": 256,
  "repetition_penalty": 1.05
}

Input Parameters

top_k Type: integerDefault: 50: The number of highest probability tokens to consider for generating the output. If > 0, only keep the top k tokens with highest probability (top-k filtering).
top_p Type: numberDefault: 0.8: A probability threshold for generating the output. If < 1.0, only keep the top tokens with cumulative probability >= top_p (nucleus filtering). Nucleus filtering is described in Holtzman et al. (http://arxiv.org/abs/1904.09751).
prompt Type: stringDefault: An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is: Input prompt
temperature Type: numberDefault: 0.7: The value used to modulate the next token probabilities.
max_new_tokens Type: integerDefault: 256Range: 1 - 4096: The maximum number of tokens the model should generate as output.
repetition_penalty Type: numberDefault: 1.05: Repetition penalty

Output Schema

Output

Type: array • Items Type: string

Example Execution Logs

[36m(RayWorkerVllm pid=2043)[0m INFO 05-07 20:41:05 custom_all_reduce.py:240] Registering 6685 cuda graph addresses
[36m(RayWorkerVllm pid=2043)[0m INFO 05-07 20:41:05 model_runner.py:867] Graph capturing finished in 6 secs.
INFO 05-07 20:41:06 async_llm_engine.py:508] Received request 0: prompt: 'An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=50, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=['<｜end▁of▁sentence｜>'], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=256, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True), prompt_token_ids: None, lora_request: None.
INFO 05-07 20:41:06 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%
INFO 05-07 20:41:11 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.6%, CPU KV cache usage: 0.0%
INFO 05-07 20:41:16 metrics.py:218] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 19.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.9%, CPU KV cache usage: 0.0%
INFO 05-07 20:41:19 async_llm_engine.py:120] Finished request 0.
generation took 13.033s

Version Details

Version ID: 0f2469607b150ffd428298a6bb57874f3657ab04fc980f7b5aa8fdad7bd6b46b
Version Created: May 7, 2024

Run on Replicate →