testing

# Testing Superpowers Skills

Summary

This guide documents procedures for testing Superpowers skills, with a primary focus on integration testing for complex skills that involve subagents, multi-step workflows, and cross-agent interactions. It covers test structure, execution steps, validation criteria, token usage analysis, troubleshooting, and best practices for authoring new integration tests.

Overview

Testing skills that involve subagents, workflows, and complex interactions requires running actual Claude Code sessions in headless mode and verifying their behavior through session transcripts.

Test Structure

Superpowers skill tests follow this directory structure:

tests/
├── claude-code/
│   ├── test-helpers.sh                    # Shared test utilities
│   ├── test-subagent-driven-development-integration.sh
│   ├── analyze-token-usage.py             # Token analysis tool
│   └── run-skill-tests.sh                 # Test runner (if exists)

Running Tests

Integration Tests

Integration tests execute real Claude Code sessions with actual skills:

# Run the subagent-driven-development integration test
cd tests/claude-code
./test-subagent-driven-development-integration.sh

Note: Integration tests can take 10-30 minutes as they execute real implementation plans with multiple subagents.

Requirements

Must run from the superpowers plugin directory (not from temp directories)
Claude Code must be installed and available as claude command
Local dev marketplace must be enabled: "superpowers@superpowers-dev": true in ~/.claude/settings.json

Integration Test: `subagent-driven-development`

What It Tests

The integration test verifies the subagent-driven-development skill correctly:

Plan Loading: Reads the plan once at the beginning
Full Task Text: Provides complete task descriptions to subagents (doesn’t make them read files)
Self-Review: Ensures subagents perform self-review before reporting
Review Order: Runs spec compliance review before code quality review
Review Loops: Uses review loops when issues are found
Independent Verification: Spec reviewer reads code independently, doesn’t trust implementer reports

How It Works

Setup: Creates a temporary Node.js project with a minimal implementation plan
Execution: Runs Claude Code in headless mode with the skill
Verification: Parses the session transcript (.jsonl file) to verify:
- Skill tool was invoked
- Subagents were dispatched (Task tool)
- TodoWrite was used for tracking
- Implementation files were created
- Tests pass
- Git commits show proper workflow
Token Analysis: Shows token usage breakdown by subagent

Test Output

========================================
 Integration Test: subagent-driven-development
========================================

Test project: /tmp/tmp.xyz123

=== Verification Tests ===

Test 1: Skill tool invoked...
  [PASS] subagent-driven-development skill was invoked

Test 2: Subagents dispatched...
  [PASS] 7 subagents dispatched

Test 3: Task tracking...
  [PASS] TodoWrite used 5 time(s)

Test 6: Implementation verification...
  [PASS] src/math.js created
  [PASS] add function exists
  [PASS] multiply function exists
  [PASS] test/math.test.js created
  [PASS] Tests pass

Test 7: Git commit history...
  [PASS] Multiple commits created (3 total)

Test 8: No extra features added...
  [PASS] No extra features added

=========================================
 Token Usage Analysis
=========================================

Usage Breakdown:
----------------------------------------------------------------------------------------------------
Agent           Description                          Msgs      Input     Output      Cache     Cost
----------------------------------------------------------------------------------------------------
main            Main session (coordinator)             34         27      3,996  1,213,703 $   4.09
3380c209        implementing Task 1: Create Add Function     1          2        787     24,989 $   0.09
34b00fde        implementing Task 2: Create Multiply Function     1          4        644     25,114 $   0.09
3801a732        reviewing whether an implementation matches...   1          5        703     25,742 $   0.09
4c142934        doing a final code review...                    1          6        854     25,319 $   0.09
5f017a42        a code reviewer. Review Task 2...               1          6        504     22,949 $   0.08
a6b7fbe4        a code reviewer. Review Task 1...               1          6        515     22,534 $   0.08
f15837c0        reviewing whether an implementation matches...   1          6        416     22,485 $   0.07
----------------------------------------------------------------------------------------------------

TOTALS:
  Total messages:         41
  Input tokens:           62
  Output tokens:          8,419
  Cache creation tokens:  132,742
  Cache read tokens:      1,382,835

  Total input (incl cache): 1,515,639
  Total tokens:             1,524,058

  Estimated cost: $4.67
  (at $3/$15 per M tokens for input/output)

========================================
 Test Summary
========================================

STATUS: PASSED

Token Analysis Tool

Usage

Analyze token usage from any Claude Code session:

python3 tests/claude-code/analyze-token-usage.py ~/.claude/projects/<project-dir>/<session-id>.jsonl

Finding Session Files

Session transcripts are stored in ~/.claude/projects/ with the working directory path encoded:

# Example for /Users/jesse/Documents/GitHub/superpowers/superpowers
SESSION_DIR="$HOME/.claude/projects/-Users-jesse-Documents-GitHub-superpowers-superpowers"
 
# Find recent sessions
ls -lt "$SESSION_DIR"/*.jsonl | head -5

What It Shows

Main session usage: Token usage by the coordinator (user or main Claude instance)
Per-subagent breakdown: Each Task invocation with:
- Agent ID
- Description (extracted from prompt)
- Message count
- Input/output tokens
- Cache usage
- Estimated cost
Totals: Overall token usage and cost estimate

Understanding the Output

High cache reads: Good - means prompt caching is working
High input tokens on main: Expected - coordinator has full context
Similar costs per subagent: Expected - each gets similar task complexity
Cost per task: Typical range is $0.05 -$ 0.15 per subagent depending on task

Troubleshooting

Skills Not Loading

Problem: Skill not found when running headless tests Solutions:

Ensure you’re running FROM the superpowers directory: cd /path/to/superpowers && tests/...
Check ~/.claude/settings.json has "superpowers@superpowers-dev": true in enabledPlugins
Verify skill exists in skills/ directory

Permission Errors

Problem: Claude blocked from writing files or accessing directories Solutions:

Use --permission-mode bypassPermissions flag
Use --add-dir /path/to/temp/dir to grant access to test directories
Check file permissions on test directories

Test Timeouts

Problem: Test takes too long and times out Solutions:

Increase timeout: timeout 1800 claude ... (30 minutes)
Check for infinite loops in skill logic
Review subagent task complexity

Session File Not Found

Problem: Can’t find session transcript after test run Solutions:

Check the correct project directory in ~/.claude/projects/
Use find ~/.claude/projects -name "*.jsonl" -mmin -60 to find recent sessions
Verify test actually ran (check for errors in test output)

Writing New Integration Tests

Template

#!/usr/bin/env bash
set -euo pipefail
 
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$S

EL-Notepad

探索