hvision-nku/storydiffusion 🔢❓🖼️📝 → ❓

▶️ 77.0K runs 📅 May 2024 ⚙️ Cog v0.9.6+dev 🔗 GitHub 📄 Paper ⚖️ License
comic-generation image-consistent-character-generation text-to-image

About

Consistent Self-Attention for Long-Range Image and Video Generation

Example Output

Output

Performance Metrics

43.84s Prediction Time
43.88s Total Time
All Input Parameters
{
  "num_ids": 3,
  "sd_model": "Unstable",
  "num_steps": 25,
  "style_name": "Japanese Anime",
  "comic_style": "Classic Comic Style",
  "image_width": 768,
  "image_height": 768,
  "sa32_setting": 0.5,
  "sa64_setting": 0.5,
  "output_format": "webp",
  "guidance_scale": 5,
  "output_quality": 80,
  "negative_prompt": "bad anatomy, bad hands, missing fingers, extra fingers, three hands, three legs, bad arms, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, three crus, fused feet, fused thigh, extra crus, ugly fingers, horn, cartoon, cg, 3d, unreal, animate, amputation, disconnected limbs",
  "comic_description": "at home, read new paper #at home, The newspaper says there is a treasure house in the forest.\non the road, near the forest\n[NC] The car on the road, near the forest #He drives to the forest in search of treasure.\n[NC]A tiger appeared in the forest, at night \nvery frightened, open mouth, in the forest, at night\nrunning very fast, in the forest, at night\n[NC] A house in the forest, at night #Suddenly, he discovers the treasure house!\nin the house filled with  treasure, laughing, at night #He is overjoyed inside the house.",
  "style_strength_ratio": 20,
  "character_description": "a man, wearing black suit"
}
Input Parameters
seed Type: integer
Random seed. Leave blank to randomize the seed
num_ids Type: integerDefault: 3
Number of id images in total images. This should not exceed total number of line-separated prompts
sd_model Default: Unstable
Choose a model
num_steps Type: integerDefault: 25Range: 20 - 50
Number of sample steps
ref_image Type: string
Reference image for the character
style_name Default: Japanese Anime
Style template
comic_style Default: Classic Comic Style
Select the comic style for the combined comic
image_width Default: 768
Width of output image
image_height Default: 768
Height of output image
sa32_setting Type: numberDefault: 0.5Range: 0 - 1
The degree of Paired Attention at 32 x 32 self-attention layers
sa64_setting Type: numberDefault: 0.5Range: 0 - 1
The degree of Paired Attention at 64 x 64 self-attention layers
output_format Default: webp
Format of the output images
guidance_scale Type: numberDefault: 5Range: 0.1 - 10
Scale for classifier-free guidance
output_quality Type: integerDefault: 80Range: 0 - 100
Quality of the output images, from 0 to 100. 100 is best quality, 0 is lowest quality
negative_prompt Type: stringDefault: bad anatomy, bad hands, missing fingers, extra fingers, three hands, three legs, bad arms, missing legs, missing arms, poorly drawn face, bad face, fused face, cloned face, three crus, fused feet, fused thigh, extra crus, ugly fingers, horn, cartoon, cg, 3d, unreal, animate, amputation, disconnected limbs
Describe things you do not want to see in the output
comic_description Type: stringDefault: at home, read new paper #at home, The newspaper says there is a treasure house in the forest. on the road, near the forest [NC] The car on the road, near the forest #He drives to the forest in search of treasure. [NC]A tiger appeared in the forest, at night very frightened, open mouth, in the forest, at night running very fast, in the forest, at night [NC] A house in the forest, at night #Suddenly, he discovers the treasure house! in the house filled with treasure, laughing, at night #He is overjoyed inside the house.
Comic Description. Each frame is divided by a new line. Only the first 10 prompts are valid for demo speed! For comic_description NOT using ref_image: (1) Support Typesetting Style and Captioning. By default, the prompt is used as the caption for each image. If you need to change the caption, add a '#' at the end of each line. Only the part after the '#' will be added as a caption to the image. (2) The [NC] symbol is used as a flag to indicate that no characters should be present in the generated scene images. If you want do that, prepend the '[NC]' at the beginning of the line.
style_strength_ratio Type: integerDefault: 20Range: 15 - 50
Style strength of Ref Image (%), only used if ref_image is provided
character_description Type: stringDefault: a man, wearing black suit
General description of the character. If ref_image above is provided, making sure to follow the class word you want to customize with the trigger word 'img', such as: 'man img' or 'woman img' or 'girl img'
Output Schema
Example Execution Logs
['at home, read new paper #at home, The newspaper says there is a treasure house in the forest.', 'on the road, near the forest', '[NC] The car on the road, near the forest #He drives to the forest in search of treasure.', '[NC]A tiger appeared in the forest, at night ', 'very frightened, open mouth, in the forest, at night', 'running very fast, in the forest, at night', '[NC] A house in the forest, at night #Suddenly, he discovers the treasure house!', 'in the house filled with  treasure, laughing, at night #He is overjoyed inside the house.']
['a man, wearing black suit,at home, read new paper #at home, The newspaper says there is a treasure house in the forest.', 'a man, wearing black suit,on the road, near the forest', ' The car on the road, near the forest #He drives to the forest in search of treasure.', 'A tiger appeared in the forest, at night ', 'a man, wearing black suit,very frightened, open mouth, in the forest, at night', 'a man, wearing black suit,running very fast, in the forest, at night', ' A house in the forest, at night #Suddenly, he discovers the treasure house!', 'a man, wearing black suit,in the house filled with  treasure, laughing, at night #He is overjoyed inside the house.']
['a man, wearing black suit,at home, read new paper', 'a man, wearing black suit,on the road, near the forest', 'The car on the road, near the forest', 'A tiger appeared in the forest, at night', 'a man, wearing black suit,very frightened, open mouth, in the forest, at night', 'a man, wearing black suit,running very fast, in the forest, at night', 'A house in the forest, at night', 'a man, wearing black suit,in the house filled with  treasure, laughing, at night']
Using seed: 58753
Successfully load paired self-attention
Number of the processor : 36
  0%|          | 0/25 [00:00<?, ?it/s]
  4%|▍         | 1/25 [00:00<00:07,  3.11it/s]
  8%|▊         | 2/25 [00:00<00:07,  3.10it/s]
 12%|█▏        | 3/25 [00:00<00:07,  3.09it/s]
 16%|█▌        | 4/25 [00:01<00:06,  3.08it/s]
 20%|██        | 5/25 [00:01<00:06,  3.08it/s]
 24%|██▍       | 6/25 [00:02<00:07,  2.60it/s]
 28%|██▊       | 7/25 [00:02<00:07,  2.35it/s]
 32%|███▏      | 8/25 [00:03<00:07,  2.23it/s]
 36%|███▌      | 9/25 [00:03<00:07,  2.23it/s]
 40%|████      | 10/25 [00:04<00:06,  2.19it/s]
 44%|████▍     | 11/25 [00:04<00:06,  2.13it/s]
 48%|████▊     | 12/25 [00:05<00:06,  2.12it/s]
 52%|█████▏    | 13/25 [00:05<00:05,  2.04it/s]
 56%|█████▌    | 14/25 [00:06<00:05,  2.01it/s]
 60%|██████    | 15/25 [00:06<00:05,  1.97it/s]
 64%|██████▍   | 16/25 [00:07<00:04,  2.00it/s]
 68%|██████▊   | 17/25 [00:07<00:03,  2.06it/s]
 72%|███████▏  | 18/25 [00:08<00:03,  2.05it/s]
 76%|███████▌  | 19/25 [00:08<00:02,  2.06it/s]
 80%|████████  | 20/25 [00:08<00:02,  2.08it/s]
 84%|████████▍ | 21/25 [00:09<00:01,  2.03it/s]
 88%|████████▊ | 22/25 [00:10<00:01,  1.97it/s]
 92%|█████████▏| 23/25 [00:10<00:01,  1.96it/s]
 96%|█████████▌| 24/25 [00:11<00:00,  1.97it/s]
100%|██████████| 25/25 [00:11<00:00,  1.93it/s]
100%|██████████| 25/25 [00:11<00:00,  2.15it/s]
  0%|          | 0/25 [00:00<?, ?it/s]
  4%|▍         | 1/25 [00:00<00:03,  6.46it/s]
  8%|▊         | 2/25 [00:00<00:03,  6.44it/s]
 12%|█▏        | 3/25 [00:00<00:03,  6.46it/s]
 16%|█▌        | 4/25 [00:00<00:03,  6.45it/s]
 20%|██        | 5/25 [00:00<00:03,  6.44it/s]
 24%|██▍       | 6/25 [00:00<00:03,  5.70it/s]
 28%|██▊       | 7/25 [00:01<00:03,  5.33it/s]
 32%|███▏      | 8/25 [00:01<00:03,  5.27it/s]
 36%|███▌      | 9/25 [00:01<00:03,  5.09it/s]
 40%|████      | 10/25 [00:01<00:02,  5.08it/s]
 44%|████▍     | 11/25 [00:02<00:02,  4.98it/s]
 48%|████▊     | 12/25 [00:02<00:02,  5.07it/s]
 52%|█████▏    | 13/25 [00:02<00:02,  4.97it/s]
 56%|█████▌    | 14/25 [00:02<00:02,  4.78it/s]
 60%|██████    | 15/25 [00:02<00:02,  4.86it/s]
 64%|██████▍   | 16/25 [00:03<00:01,  4.80it/s]
 68%|██████▊   | 17/25 [00:03<00:01,  4.79it/s]
 72%|███████▏  | 18/25 [00:03<00:01,  4.94it/s]
 76%|███████▌  | 19/25 [00:03<00:01,  4.89it/s]
 80%|████████  | 20/25 [00:03<00:01,  4.98it/s]
 84%|████████▍ | 21/25 [00:04<00:00,  4.73it/s]
 88%|████████▊ | 22/25 [00:04<00:00,  4.59it/s]
 92%|█████████▏| 23/25 [00:04<00:00,  4.58it/s]
 96%|█████████▌| 24/25 [00:04<00:00,  4.48it/s]
100%|██████████| 25/25 [00:05<00:00,  4.48it/s]
100%|██████████| 25/25 [00:05<00:00,  5.00it/s]
  0%|          | 0/25 [00:00<?, ?it/s]
  4%|▍         | 1/25 [00:00<00:03,  6.51it/s]
  8%|▊         | 2/25 [00:00<00:03,  6.48it/s]
 12%|█▏        | 3/25 [00:00<00:03,  6.44it/s]
 16%|█▌        | 4/25 [00:00<00:03,  6.42it/s]
 20%|██        | 5/25 [00:00<00:03,  6.42it/s]
 24%|██▍       | 6/25 [00:00<00:03,  5.99it/s]
 28%|██▊       | 7/25 [00:01<00:03,  5.59it/s]
 32%|███▏      | 8/25 [00:01<00:03,  5.15it/s]
 36%|███▌      | 9/25 [00:01<00:03,  5.12it/s]
 40%|████      | 10/25 [00:01<00:02,  5.13it/s]
 44%|████▍     | 11/25 [00:01<00:02,  5.06it/s]
 48%|████▊     | 12/25 [00:02<00:02,  4.99it/s]
 52%|█████▏    | 13/25 [00:02<00:02,  4.95it/s]
 56%|█████▌    | 14/25 [00:02<00:02,  4.94it/s]
 60%|██████    | 15/25 [00:02<00:02,  4.89it/s]
 64%|██████▍   | 16/25 [00:03<00:01,  4.82it/s]
 68%|██████▊   | 17/25 [00:03<00:01,  4.95it/s]
 72%|███████▏  | 18/25 [00:03<00:01,  4.89it/s]
 76%|███████▌  | 19/25 [00:03<00:01,  4.73it/s]
 80%|████████  | 20/25 [00:03<00:01,  4.67it/s]
 84%|████████▍ | 21/25 [00:04<00:00,  4.54it/s]
 88%|████████▊ | 22/25 [00:04<00:00,  4.44it/s]
 92%|█████████▏| 23/25 [00:04<00:00,  4.51it/s]
 96%|█████████▌| 24/25 [00:04<00:00,  4.44it/s]
100%|██████████| 25/25 [00:05<00:00,  4.39it/s]
100%|██████████| 25/25 [00:05<00:00,  4.97it/s]
  0%|          | 0/25 [00:00<?, ?it/s]
  4%|▍         | 1/25 [00:00<00:03,  6.52it/s]
  8%|▊         | 2/25 [00:00<00:03,  6.46it/s]
 12%|█▏        | 3/25 [00:00<00:03,  6.43it/s]
 16%|█▌        | 4/25 [00:00<00:03,  6.42it/s]
 20%|██        | 5/25 [00:00<00:03,  6.42it/s]
 24%|██▍       | 6/25 [00:00<00:03,  5.95it/s]
 28%|██▊       | 7/25 [00:01<00:03,  5.50it/s]
 32%|███▏      | 8/25 [00:01<00:03,  5.28it/s]
 36%|███▌      | 9/25 [00:01<00:02,  5.35it/s]
 40%|████      | 10/25 [00:01<00:02,  5.02it/s]
 44%|████▍     | 11/25 [00:02<00:02,  4.83it/s]
 48%|████▊     | 12/25 [00:02<00:02,  4.83it/s]
 52%|█████▏    | 13/25 [00:02<00:02,  4.89it/s]
 56%|█████▌    | 14/25 [00:02<00:02,  4.90it/s]
 60%|██████    | 15/25 [00:02<00:02,  4.77it/s]
 64%|██████▍   | 16/25 [00:03<00:01,  4.72it/s]
 68%|██████▊   | 17/25 [00:03<00:01,  4.94it/s]
 72%|███████▏  | 18/25 [00:03<00:01,  4.94it/s]
 76%|███████▌  | 19/25 [00:03<00:01,  4.78it/s]
 80%|████████  | 20/25 [00:03<00:01,  4.66it/s]
 84%|████████▍ | 21/25 [00:04<00:00,  4.62it/s]
 88%|████████▊ | 22/25 [00:04<00:00,  4.52it/s]
 92%|█████████▏| 23/25 [00:04<00:00,  4.42it/s]
 96%|█████████▌| 24/25 [00:04<00:00,  4.39it/s]
100%|██████████| 25/25 [00:05<00:00,  4.36it/s]
100%|██████████| 25/25 [00:05<00:00,  4.94it/s]
  0%|          | 0/25 [00:00<?, ?it/s]
  4%|▍         | 1/25 [00:00<00:03,  6.49it/s]
  8%|▊         | 2/25 [00:00<00:03,  6.43it/s]
 12%|█▏        | 3/25 [00:00<00:03,  6.42it/s]
 16%|█▌        | 4/25 [00:00<00:03,  6.42it/s]
 20%|██        | 5/25 [00:00<00:03,  6.42it/s]
 24%|██▍       | 6/25 [00:00<00:03,  5.68it/s]
 28%|██▊       | 7/25 [00:01<00:03,  5.58it/s]
 32%|███▏      | 8/25 [00:01<00:03,  5.31it/s]
 36%|███▌      | 9/25 [00:01<00:03,  5.07it/s]
 40%|████      | 10/25 [00:01<00:02,  5.06it/s]
 44%|████▍     | 11/25 [00:01<00:02,  5.14it/s]
 48%|████▊     | 12/25 [00:02<00:02,  5.06it/s]
 52%|█████▏    | 13/25 [00:02<00:02,  5.08it/s]
 56%|█████▌    | 14/25 [00:02<00:02,  5.08it/s]
 60%|██████    | 15/25 [00:02<00:02,  4.97it/s]
 64%|██████▍   | 16/25 [00:02<00:01,  4.97it/s]
 68%|██████▊   | 17/25 [00:03<00:01,  4.83it/s]
 72%|███████▏  | 18/25 [00:03<00:01,  4.72it/s]
 76%|███████▌  | 19/25 [00:03<00:01,  4.70it/s]
 80%|████████  | 20/25 [00:03<00:01,  4.76it/s]
 84%|████████▍ | 21/25 [00:04<00:00,  4.64it/s]
 88%|████████▊ | 22/25 [00:04<00:00,  4.60it/s]
 92%|█████████▏| 23/25 [00:04<00:00,  4.54it/s]
 96%|█████████▌| 24/25 [00:04<00:00,  4.49it/s]
100%|██████████| 25/25 [00:04<00:00,  4.59it/s]
100%|██████████| 25/25 [00:04<00:00,  5.03it/s]
  0%|          | 0/25 [00:00<?, ?it/s]
  4%|▍         | 1/25 [00:00<00:03,  6.48it/s]
  8%|▊         | 2/25 [00:00<00:03,  6.44it/s]
 12%|█▏        | 3/25 [00:00<00:03,  6.42it/s]
 16%|█▌        | 4/25 [00:00<00:03,  6.42it/s]
 20%|██        | 5/25 [00:00<00:03,  6.41it/s]
 24%|██▍       | 6/25 [00:00<00:03,  5.72it/s]
 28%|██▊       | 7/25 [00:01<00:03,  5.46it/s]
 32%|███▏      | 8/25 [00:01<00:03,  5.19it/s]
 36%|███▌      | 9/25 [00:01<00:03,  5.16it/s]
 40%|████      | 10/25 [00:01<00:02,  5.12it/s]
 44%|████▍     | 11/25 [00:01<00:02,  5.31it/s]
 48%|████▊     | 12/25 [00:02<00:02,  5.03it/s]
 52%|█████▏    | 13/25 [00:02<00:02,  5.00it/s]
 56%|█████▌    | 14/25 [00:02<00:02,  4.92it/s]
 60%|██████    | 15/25 [00:02<00:02,  4.98it/s]
 64%|██████▍   | 16/25 [00:03<00:01,  4.95it/s]
 68%|██████▊   | 17/25 [00:03<00:01,  4.84it/s]
 72%|███████▏  | 18/25 [00:03<00:01,  4.75it/s]
 76%|███████▌  | 19/25 [00:03<00:01,  4.83it/s]
 80%|████████  | 20/25 [00:03<00:01,  4.96it/s]
 84%|████████▍ | 21/25 [00:04<00:00,  4.75it/s]
 88%|████████▊ | 22/25 [00:04<00:00,  4.58it/s]
 92%|█████████▏| 23/25 [00:04<00:00,  4.55it/s]
 96%|█████████▌| 24/25 [00:04<00:00,  4.52it/s]
100%|██████████| 25/25 [00:04<00:00,  4.47it/s]
100%|██████████| 25/25 [00:04<00:00,  5.02it/s]
4 [[<PIL.Image.Image image mode=RGB size=788x788 at 0x7F85B45EB850>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85B436F5D0>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A5916450>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A5916C10>]]
0 [[<PIL.Image.Image image mode=RGB size=788x788 at 0x7F85B45EB850>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85B436F5D0>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A5916450>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A5916C10>], [<PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A7D82950>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A575A6D0>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A7C18790>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A7C18BD0>]]
[[<PIL.Image.Image image mode=RGB size=788x788 at 0x7F85B45EB850>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85B436F5D0>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A5916450>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A5916C10>], [<PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A7D82950>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A575A6D0>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A7C18790>, <PIL.Image.Image image mode=RGB size=788x788 at 0x7F85A7C18BD0>]]
1 (7, 650)
0 (124, 721)
1 (56, 636)
0 (89, 717)
Version Details
Version ID
39c85f153f00e4e9328cb3035b94559a8ec66170eb4c0618c07b16528bf20ac2
Version Created
May 4, 2024
Run on Replicate →