kcaverly/nous-capybara-34b-gguf 📝🔢 → 📝

▶️ 1.9K runs 📅 Jan 2024 ⚙️ Cog 0.8.6 🔗 GitHub 📄 Paper
code-generation text-generation

About

A SOTA Nous Research finetune of 200k Yi-34B fine tuned on the Capybara dataset.

Example Output

Prompt:

"Why does a GPU process matrix multiplication faster than a CPU?"

Output

  1. Parallel Processing: GPUs are designed specifically for parallel processing, which makes them ideal for tasks like matrix multiplication that can be split into smaller, independent parts. On the other hand, CPUs are optimized for sequential processing and handling complex calculations.

  2. Specialized Hardware: GPUs have specialized hardware called streaming multiprocessors that perform mathematical operations much faster than a CPU's cores. These streaming multiprocessors work together to process large amounts of data simultaneously, leading to significant performance improvements in tasks like matrix multiplication.

  3. Memory Bandwidth: GPUs have higher memory bandwidth compared to CPUs, which allows them to load and manipulate larger datasets more efficiently. This is crucial for matrix multiplication as the operation requires a significant amount of memory to store intermediate results.

  4. Compute Unified Device Architecture (CUDA) : NVIDIA's CUDA technology enables developers to write programs that can take advantage of GPUs' parallel processing capabilities. This allows CPU-intensive tasks like matrix multiplication to be offloaded to the GPU, resulting in faster computation times.

  5. Task Switching: GPUs excel at performing the same task repeatedly and can switch between tasks much faster than a CPU. When performing matrix multiplication, this means that the GPU can work on multiple matrices simultaneously without wasting time on context switching.

Performance Metrics

18.63s Prediction Time
254.11s Total Time
All Input Parameters
{
  "prompt": "Why does a GPU process matrix multiplication faster than a CPU?",
  "temperature": 0.5,
  "system_prompt": "You are 'Nous-Capybara', an AI assistant and your purpose and drive is to assist the user with any request they have.",
  "max_new_tokens": -1,
  "repeat_penalty": 1.1,
  "prompt_template": "SYSTEM: {system_prompt}\nUSER: {prompt}\nASSISTANT: "
}
Input Parameters
prompt (required) Type: string
Instruction for model
temperature Type: numberDefault: 0.5
This parameter used to control the 'warmth' or responsiveness of an AI model based on the LLaMA architecture. It adjusts how likely the model is to generate new, unexpected information versus sticking closely to what it has been trained on. A higher value for this parameter can lead to more creative and diverse responses, while a lower value results in safer, more conservative answers that are closer to those found in its training data. This parameter is particularly useful when fine-tuning models for specific tasks where you want to balance between generating novel insights and maintaining accuracy and coherence.
system_prompt Type: stringDefault: You are 'Nous-Capybara', an AI assistant and your purpose and drive is to assist the user with any request they have.
System prompt for the model, helps guides model behaviour.
max_new_tokens Type: integerDefault: -1
Maximum new tokens to generate.
repeat_penalty Type: numberDefault: 1.1
This parameter plays a role in controlling the behavior of an AI language model during conversation or text generation. Its purpose is to discourage the model from repeating itself too often by increasing the likelihood of following up with different content after each response. By adjusting this parameter, users can influence the model's tendency to either stay within familiar topics (lower penalty) or explore new ones (higher penalty). For instance, setting a high repeat penalty might result in more varied and dynamic conversations, whereas a low penalty could be suitable for scenarios where consistency and predictability are preferred.
prompt_template Type: stringDefault: SYSTEM: {system_prompt} USER: {prompt} ASSISTANT:
Template to pass to model. Override if you are providing multi-turn instructions.
Output Schema

Output

Type: arrayItems Type: string

Example Execution Logs
llama_print_timings:        load time =     773.84 ms
llama_print_timings:      sample time =      83.54 ms /   266 runs   (    0.31 ms per token,  3184.18 tokens per second)
llama_print_timings: prompt eval time =     770.88 ms /    56 tokens (   13.77 ms per token,    72.64 tokens per second)
llama_print_timings:        eval time =   16066.31 ms /   265 runs   (   60.63 ms per token,    16.49 tokens per second)
llama_print_timings:       total time =   18610.34 ms
Version Details
Version ID
6b9b7741e719899f26571567b892a8900d0b517bfadee3997f5d477897d10eef
Version Created
January 9, 2024
Run on Replicate →