Top

Evaluating AI User-Story Generators: What UX Designers Should Know

Envisioning New Horizons

A critical look at UX design practice

A column by Silvia Podesta
August 5, 2024

AI-powered user-story generators present a very interesting use case for large language models (LLMs). These applications can assist in the creation of user stories, an essential task in agile development and product management.

I’ve tested a few of them and used Stories on Board, Cookup.AI, Easy-Peasy.AI, and a tool that IBM Consulting has developed, IBM Assistant, to assess the quality of their output in two very different scenarios, as follows:

  • Scenario 1—An online application helps strategists create value-chain diagrams for client companies. A value chain is an artifact that strategists use to visualize activities that are strategically relevant to a business and the links between them. This is a useful strategic-management tool that helps give a fuller picture of an organization’s cost drivers and sources of differentiation.
  • Scenario 2—A mobile app supports and encourages bike usage within a small metropolitan area. I looked at this scenario further by considering a variant: an app that supports and encourages bike usage within the small French city of Montpellier.
Champion Advertisement
Continue Reading…

I evaluated the AI outputs of these tools from three different angles, as follows:

  1. Their quality in terms of accuracy and relevance to the field of knowledge in question. In the first scenario the context was strategy and value-chain analysis; in the second, urban mobility and, more specifically, urban mobility in France.
  2. Their quality in terms of adherence to the criteria that are fundamental to a good user story, as follows:
    1. Goal centric—A user story must be written from the user’s perspective.
    2. Scoped down—A user story should have a small enough scope to complete the design within a single or, at most, a few design iterations.
    3. Valuable—A user story must encapsulate a promise of value and the enhancement of a user’s life.
    4. Easy to estimate—The level of effort that would be necessary to implement a user story must be calculable.
    5. Independent—Each user story must stand alone and be independent of the other user stories.
  3. A heuristic evaluation of the user interface and other UX-related factors that would impact users’ interactions with a tool.

Next, I’ll present my observations on the potential usefulness of LLMs in user-story generation and highlight some tips that can help UX professionals maximize their results.

Generating User Stories for a Strategy Tool: Challenges of Domain-Specific Knowledge

I deliberately aimed the first scenario at testing generative AI (Gen AI) on a very challenging domain for a typical large language model (LLM). While most LLMs are, in fact, trained on large amounts of general data, they may lack the specialized knowledge that is necessary to understand strategy, value-chain analysis, and the needs of strategists. This was quite evident in an initial test run with Cookup.AI, which generated slightly out of step and even redundant outputs, including the following user stories:

  • User Story 1: Create a New Value-Chain Diagram
  • User Story 2: Edit an Existing Value-Chain Diagram
  • User Story 3: Share a Value-Chain Diagram with Clients
  • User Story 4: Collaborate on a Value-Chain Diagram
  • User Story 5: Generate a Report from Value-Chain Diagram *
  • User Story 6: Access Historical Data from Value-Chain Diagrams

* User story 5 is partially redundant to user story 3.

While these user stories not inherently wrong, they don’t feel like they’re tailored to the specific needs of strategists, who typically create such diagrams to perform a value-chain analysis, which comprises the following essential steps:

  1. Identifying value-chain activities—The strategist identifies and lists all of the primary and secondary activities that go into the creation of a product or service.
  2. Determining the value and cost of these activities—The strategist reflects on the contribution of each of the primary and secondary activities to the overall value of the product, either in terms of increasing customers’ willingness to pay for it or lowering the costs for the firm. They also factor in the cost of each of these activities.
  3. Identifying opportunities for competitive advantage—With a clearer picture of the company’s activities in mind, the strategist can assess ways of improving the firm’s performance—for instance, by lowering costs or achieving better product differentiation.

Let’s see how, by simply tweaking an initial prompt—that is, the instruction to the AI as an input from which to generate an output—we can create a model that generates a more accurate output at the necessary the level of specificity.

I changed this prompt: “Create user stories for an application that helps strategists create a value-chain diagram of a client company.”

To this prompt: “Create user stories for an application that helps strategists perform a value-chain analysis of a client company.”

This enabled Cookup.AI to generate fifteen user stories that revealed a deeper knowledge of the subject, including the distinction between primary and secondary activities and the needs to calculate costs, identify value drivers, and benchmark information for a specific firm to competitors in the same industry. Let’s consider some of these user stories.

  • User Story 6: Identify Support Activities: “As a user, I want to identify and categorize my client company’s support activities—such as firm infrastructure, human-resource management, technology development, and procurement—so that I can understand how these functions support the primary activities.”
  • User Story 7: Conduct a Cost/Benefit Analysis “As a strategist, I want to input and analyze the costs that are associated with each activity so that I can identify cost-saving opportunities and inefficiencies within the client’s value chain.”
  • User Story 9: Conduct Competitor Benchmarking “As a user, I want to benchmark my client’s value-chain activities against those of industry competitors so that I can provide comparative insights and strategic recommendations for improvement.”

For almost all of the tests that I ran with Cookup.AI, the user stories were written from the perspective of the same user persona.

Easy-Peasy.AI, in contrast, produced a more nuanced breakdown of user stories by taking the needs of different user groups into consideration—for example, administrators and novice strategists. While, in theory, this might seem to be good behavior; in practice, it might potentially result in the creation of user personas that are not within scope for the strategist’s app, which would be wasteful. Tweaking the prompt to focus only on the target users would help yield more pertinent user stories, while making resource usage more efficient.

The costs of running an AI are usually linked to the number of tokens being processed. A token is the basic unit of measurement for text processing using an AI and might be equivalent to a word, part of a word, or a punctuation mark. It is, therefore, advisable to narrow down the scope of the user stories by providing a more specific prompt.

Both Cookup.AI and Easy-Peasy.AI use GPT-4, from OpenAI. When I ran the same prompts on the IBM Assistant, which used the Meta Llama 3 Instruct 70B model, the quality of the answers in terms of relevance to the knowledge field of strategy and value-chain analysis was higher without my needing to tweak the prompt further. In particular, the IBM Assistant was able to generate stories with acceptance criteria and organize stories into epics, providing better readability overall. Let’s consider an example:

Epic 2: Data Integration: User Story 2: “As a strategist, I want to be able to link data to specific activities in the value-chain diagram so that I can analyze and visualize the relationships between them.”

Acceptance Criteria:

  • The application lets users link data to specific activities in the value-chain diagram.
  • The application provides a range of data-visualization options—for example, charts, graphs, and tables—to help users analyze the data.
  • The application updates the value-chain diagram in real time as users link new data.

Of all of the tools I examined, only Stories on Board could generate user stories that address a specific strategist’s need: evaluating the competitiveness of a company’s value chain and identifying opportunities for business advantage. However, these stories also felt overly simplistic and didn’t take the strategists’ criteria for analysis into consideration. Figure 1 shows a user story that Stories on Board generated to address the strategists’ need to identify areas of improvement in the value chain, although it’s stated in terms that are too generic.

Figure 1—Overly generic user story generated by Stories on Board
Overly generic user story generated by Stories on Board

A much more useful user story would have been: “As a strategist, I need to identify opportunities for achieving cost leadership and product differentiation, starting with the current value chain.”

Or: “As a strategist, I need to identify parts of the value chain that don’t create significant value and could be outsourced or eliminated.”

By and large, this experiment suggested that the user stories the AI models generated for the business-strategy use case would require additional input and validation by a subject-matter expert (SME) to make them relevant to their intended users.

Generating User Stories for a Cyclist App

For the cyclist-app scenario, all of the user-story generators produced more encouraging results, demonstrating strong relevance and alignment with the topic. The user stories effectively captured a wide array of user needs that are inherently linked to urban cycling. In the variant of Scenario 2, I asked the AI generators to create user stories for a bike app that is tailored to the picturesque city of Montpellier.

However, only the IBM Assistant and Stories on Board made occasional contextual references to this French city, enhancing the specificity and relevance of the user stories according to the unique characteristics of the use case. For instance, consider the following example:

“As a cyclist, I want to be able to find and reserve a bike from a bike-sharing system—for example, Vélomagg, the official bike-sharing system in Montpellier.

However, none of these tools was capable of considering the distinctive features of the city such as its proximity to the sea and the internationally renowned natural reserve of Camargue or the steep inclines of the city center, which would have enabled them to further tailor these stories to users’ needs.

Nevertheless, the engineering of prompts emerged as an effective solution. When I specifically asked the applications to “be specific to the city of Montpellier,” the models generated user stories that included references to the city’s most popular landmarks and the likely needs of both tourists and residents, who were seeking convenient routes that connect key locations.

Evaluation of User Stories in Relation to UX Criteria

The applications that I evaluated in this test generated user stories that were notably clear, concise, and centered on the user’s perspective. While some applications automatically produced acceptance criteria, others required specific prompts to do so.

For instance, when using Stories on Board, users must provide detailed information about the intended product, its purpose, and key features for the AI to generate relevant acceptance criteria. However, including explicit criteria for what constitutes a complete user story could help eliminate ambiguity. Incorporating these criteria as a default feature would significantly enhance the usability of these tools, particularly for novice users.

The lack of prioritization is a significant shortcoming of the outputs of these tools—for both user stories and acceptance criteria. In agile development, effective prioritization is essential because it enables teams to address the most critical user stories in the backlog first, leading to more efficient resource allocation throughout a project. The absence of a prioritization feature might suggest that simple machine-learning tools that use large language models might not possess sufficient contextual information to provide effective guidance to teams.

Finally, the user stories were easy to estimate, albeit not always valuable from a business perspective.

Considerations for the User Interfaces of User-Story Generators

Cookup.AI, Easy-Peasy.AI, and the IBM Assistant are conversational tools that deliver their outputs in a chatbot-like format. This user-interface style is becoming increasingly common across Web-search experiences, as users are growing accustomed to posing questions in complete sentences rather than relying on fragmented keywords.

While this approach offers an immediate and easy-to-use way of obtaining information, it doesn’t provide a comprehensive understanding of the future solution. Specifically, it is challenging to identify gaps and areas in product development that might not have been adequately addressed.

In contrast, Stories on Board, which combines characteristics of a generative user-stories tool with those of task-management software, displays stories as cards in a visual story map, helping teams organize and prioritize them effectively. This layout promotes enhanced clarity and fosters more productive discussions and better-informed decision-making. Figure 2 shows Stories on Board’s user interface, which presents its outputs visually, letting users match user stories with relevant areas of product development.

Figure 2—Stories on Board’s visual user interface
Stories on Board's visual user interface

Figure 3 shows the conversational experience of Cookup.AI. While the conversational user interface feels quick and immediate, it is to the detriment of providing the user with a system-wide grasp of the product’s features.

Figure 3—Cookup.AI’s conversational experience
Cookup.AI's conversational experience

Conclusion

Wrapping up, this short experiment supports the argument that, while Gen-AI applications can provide a useful starting point for writing user stories, they also lack the specific contextual knowledge that would enable them to produce highly relevant outputs. This seems to be particularly true for large, generalized AI models.

If businesses were to turn to more compact, specialized systems, the so-called small language models, or SLMs, it would be possible to generate domain-specific user stories for applications similar to those that I’ve evaluated here. They would take advantage of a narrower, more qualitative knowledge base. However, whether companies would be willing to invest in specialized user-story generators with such a limited domain scope as finance or strategy still remains to be seen. However, tools running on smaller models offer an additional benefit: they’re less risky because they’re less prone to artificial hallucinations than their larger counterparts.

For now, while the development of a state-of-the-art user-story generator warrants further consideration, we should embrace these tools with a healthy dose of skepticism to avoid an overreliance on AI results. UX professionals should always check the outputs of these tools to identify gaps that the AI does not cover.

While conversational user interfaces are becoming increasingly popular for content-generation tasks, they make it harder to grasp the entirety of the users’ needs at each stage of interaction with the future product or service. Visual story-mapping features and other task-management techniques would enhance the user experience of AI-based user-story generators, making it easier for their users to organize stories according epics or product-development cycles. 

References

Michael Porter. Competitive Advantage. New York: Simon & Schuster, 1985.

Armand Ruiz. “The Emergence of Small Language Models.” Nocode.ai, November 26, 2023. Retrieved August 1, 2024.

Tim Stobiersky. “What Is a Value Chain Analysis? 3 Steps.” Harvard Business School Online, December 3, 2020. Retrieved August 1, 2024.

Innovation Designer at IBM

Copenhagen, Denmark

Silvia PodestaAs a strategic designer and UX specialist at IBM, Silvia helps enterprises pursue human-centered innovation by leveraging new technologis and creating compelling user experiences. Silvia facilitates research, synthesizes product insights, and designs minimum-viable products (MVPs) that capture the potential of our technologies in addressing both user and business needs. Silvia is a passionate, independent UX researcher who focuses on the topics of digital humanism, change management, and service design.  Read More

Other Columns by Silvia Podesta

Other Articles on Tools

New on UXmatters