Crafting Personalized Stories with AI: A Look into the Multimodal Storybook Maker

Richard Shu
7 min readMar 13, 2024


In this article we’ll explore how a new tool from aiTransformer, the Storybook Maker, combines large language model (LLM), image generation, and text-to-speech (TTS) technology for automated video storybook creation.

The Storybook Maker Tool
The Storybook Maker Tool, Image by author

The Concept of Storybook Maker

The art of storytelling has evolved throughout history, from oral traditions to written narratives, and now into the digital age. The Storybook Maker is a tool seen as the next step in this progression. It was developed with the aim of automating the creation of personalized stories, incorporating multimodal learning to enhance the storytelling experience.

The concept originated from a recognition that traditional story creation methods were time-consuming and demanded substantial expertise. Storybook Maker seeks to democratize the storytelling process by making it accessible to individuals of all backgrounds and skill levels, even with zero skill in writing/drawing. Storybook Maker takes the experience a step further by incorporating text-to-speech and lip-syncing technology to create personalized presentations of each story and result in a video storybook. This feature allows users to listen to or watch their stories being told aloud, enhancing accessibility and catering to various learning styles or preferences.

Story Generation with LLMs

The story generation is centered around a robust language model capable of converting basic inputs such as prompts, URLs, or documents into compelling narratives. This module utilizes prompt engineering to establish predefined story genres, offering users versatility while prioritizing ease of use.

Story Generation Method Selection
Story Generation Method Selection, Image by author
Story Genre Selection
Story Genre Selection, Image by author

The language model is trained on vast amounts of data, allowing it to understand context and generate coherent narratives that cater to various themes and story lines. Unless instructed to use custom story (the “Generate voice with supplied text” method), the model doesn’t simply regurgitate existing stories; instead, it creates new ones based on user inputs. For example, the following is part of a story generated using the URL of the Mario Day wiki page with story genre “adventure”:

Once upon a time, in the bustling world of Mushroom Kingdom, March 10 was known as Mario Day. This special day, named after the iconic plumber himself, was celebrated by fans worldwide with great enthusiasm. Mario, the beloved hero, had saved Princess Peach countless times from the clutches of Bowser and his minions.

Storybook Generated Using The Mario Day URL, Video by author

As another example, the following is part of a story generated using a simple prompt “E.T.” with story genre “science fiction”:

In the quiet of the cosmos, a strange signal pierced through the silence. An ET message, it read: “We come in peace, bearing advanced technology. Join us or perish.” As our scientists deciphered the code, a spaceship materialized. The fate of humanity hung in the balance.

Storybook Generated Using Prompt “E.T.”, Video by author

Image Generation with Stable Diffusion

Storybook Maker’s image generation capability is powered by the latest open source text-to-image model Stable Diffusion, which uses deep learning techniques to analyze data and generate high-quality images based on user inputs. It’s capable of creating various types of images, including characters, landscapes, objects, and more. This flexibility allows users to create truly unique storybooks that reflect their creativity and imagination.

To easily create pictures in certain style, for example, the impressionist art style, an anime or game theme, prompt engineering is used to create over 100 predefined picture styles. Following pictures show some samples using the prompt “Santa is coming to town” with different styles.

Sample generated pictures using the prompt “Santa is coming to town” with different styles

Final Assembly — Video Creation

After pictures are generated in alignment with the storylines, Storybook Maker’s speech synthesizer technology animates the story by employing a personalized storyteller. This narrator verbalizes each tale while presenting the associated pictures, creating a cohesive multimedia experience for users.

Storybooks Generated Using Various Prompts and Picture Styles, Video by author

The speech synthesizer employs advanced TTS technology to convert text into natural-sounding speech, and uses lip-syncing algorithm to synchronize the lips of a picture of the storyteller to speak the text, ensuring an engaging and immersive storytelling experience. Users can choose their preferred voice or picture for the storyteller, even clone their own voice and use their own picture to tell the story. This level of customization adds a personal touch to each storybook created using Storybook Maker.

Storyteller Properties
Storyteller Properties, Image by author

User-Friendly Features and Customization

Storybook Maker offers simplicity with predefined story genres and picture styles, making it accessible to beginners who can easily produce a wide variety of storybooks. The tool also offers a range of templates that cater to different storytelling genres, making it easy for users to get started without having to worry about the technical aspects of creating their own stories from scratch.

Customizing Story and Accompanying Pictures

Advanced users have the option to utilize custom prompts for greater control over their stories’ narratives and pictures. When selecting the story generation method “Generate voice with supplied text and pictures with custom prompts”, user can enter custom story lines to generate voice and prompts enclosed with [] to generate pictures.

Custom Generation Method
Custom Generation Method, Image by author

A picture will be generated per sentence, by default the sentence will be used as the prompt, but if a custom prompt (enclosed with []) is provided, it’ll be used to generate the picture. Prompt within [] will be used for generating the picture only, won’t be in the story telling — this allows more control to the accompanying picture without affecting the story itself.

When generating a story using non-custom method with prompt “Albert Einstein”, pronouns can be used all over the place, for example,

Amidst the clutter of Princeton’s library, Einstein pondered relativity. His brow furrowed beneath wild hair as he scribbled equations on a worn notebook. A librarian sneezed, startling him. He looked up, smiled, and returned to his thoughts, oblivious to the world around him.

Storybook Generated Using Non-Custom Method with Prompt “Albert Einstein”, Video by author

Since a picture is generated per sentence, for the sentence using pronoun the subject is unknown to the image generator so it can generate any person, but if you include a custom prompt mentioning Einstein the correct person may be generated. For example, use custom method with the following prompt you’re more likely to get coherent pictures for the story.

Amidst the clutter of Princeton’s library, Einstein pondered relativity. His brow furrowed beneath wild hair as he scribbled equations on a worn notebook[Einstein’s brow furrowed beneath wild hair as he scribbled equations on a worn notebook]. A librarian sneezed, startling him. He looked up, smiled, and returned to his thoughts, oblivious to the world around him[Einstein looked up and smiled].

Storybook Generated Using Custom Method for “Albert Einstein” Story, Video by author

Another good use of the custom method is to generate storybooks of different picture styles for the same story. When using a non-custom method, the LLM can generate a different story every time even with the same input. So if you’d like to see different picture styles for the same story, you can download a generated story or use your own one and enter it using the custom method, then select different picture styles to generate.

Customizing Picture Style

The picture style can also be customized. The tool provides over 100 predefined picture styles with sample images, but there exists so many different styles and it’s impossible to include them all in the list, for instance, styles of specific artists are not included because numerous artists are recognizable by the image generator. To use a custom picture style, select the custom style option and enter the style description following a simple template: Just put the prompt to be stylized with a placeholder [prompt], e.g. magnificent [prompt] by Greg Rutkowski. If no placeholder is present, the style text will be appended to the prompt.

Custom Picture Style
Custom Picture Style, Image by author

Potential Applications

While Storybook Maker is still a new tool, its potential applications are vast and varied. In education, it can be used as an engaging way for students to learn complex concepts by creating their own storybooks based on the material. Content creators can use it to produce immersive multimedia stories that captivate audiences and stand out from traditional text-based content. Furthermore, Storybook Maker has the potential to revolutionize personal storytelling by making it more accessible and engaging than ever before. The possibilities are endless, and as users continue to explore the tool’s capabilities, new applications will undoubtedly emerge. You can find some sample videos made with this tool in aiTransformer’s YouTube channel, including a country series based on the country wiki pages and a fairy tale series based on popular fairy tale stories.


The Storybook Maker is an innovative tool that combines advanced language models, image generation, and text-to-speech technology for automated storybook creation. Its user-friendly interface, customization options, and powerful features make it accessible to both beginners and advanced users.

As we’ve explored in this article, Storybook Maker offers a unique approach to storytelling that goes beyond traditional methods by adding visuals and narration to create immersive multimedia experiences. We encourage you to explore the tool for yourself and unleash your creativity in storytelling. The future of multimodal learning holds endless possibilities, and Storybook Maker is at the forefront of this exciting new frontier!



Richard Shu

Full-stack software developer & AI practitioner with 20+ years experience in design, implementation and maintenance of commercial software systems.