Claude Code is pretty amazing. It’s let me build prototypes and improve apps faster than I ever thought possible. But I was wondering – what else can it do?
I’ve been using Emby for years, but recently I started to wonder what happened to Jellyfin, the project that forked Emby years ago. So i decided to ask Claude Code to help me by setting up a debate and mediating the answer.
Setting up the debate
Here’s the prompt i used:
Today's date is 1st October 2025. I want to compare emby and jellyfin. create two agents, one each, and require them to make the case to each other about why their product is better. make sure their info is up to date by e.g. letting them searching the web. Give them two or three attempts to make their case. They talk to each other via a shared, written .md file. At the end, you summarise the discussion and make a recommendation. Keep all the working files. Let them use subfolders of this folder to create their own private note.
pretty basic stuff. It got to work, setting up a task list:
☐ Set up project structure (create folders and initial debate file)
☐ Launch Emby advocate agent for initial research and first argument
☐ Launch Jellyfin advocate agent for initial research and first argument
☐ Facilitate round 2 of debate
☐ Facilitate round 3 of debate
☐ Summarize debate and provide recommendation
The folder structure was quite basic:
❯ tree
.
├── debate.md
├── emby-notes
│ └── research.md
└── jellyfin-notes
└── research.md
3 directories, 3 files
debate.md has the following structure:
debate template
# Emby vs Jellyfin Debate
**Date:** October 1st, 2025
## Debate Format
Each advocate will present their case in rounds, responding to their opponent's arguments and highlighting their platform's strengths.
---
## Round 1: Opening Arguments
### Emby Advocate's Opening Statement
*[To be added during first round]*
### Jellyfin Advocate's Opening Statement
*[To be added during first round]*
---
## Round 2: Rebuttals and Deeper Analysis
### Emby's Rebuttal
*[To be added during second round]*
### Jellyfin's Rebuttal
*[To be added during second round]*
---
## Round 3: Final Arguments
### Jellyfin's Closing Statement
*[To be added during final round]*
### Emby's Closing Statement
*[To be added during final round]*
---
## Summary and Recommendation
*[To be added after all rounds complete]*
Three rounds down
It then fired up the agents. Each wrote a long (150-200 line) summary of its findings. The first round of these were pretty basic stuff – project backgrounds, features etc., and then they wrote their arguments in the document, and there were some dingers in there:
- “When Jellyfin advocates celebrate reviving clients “dormant for 5 years,” Emby users simply enjoy apps that never went dormant because sustainable funding ensures continuous maintenance.”
- “Let’s address the elephant in the room: hardware transcoding performance” – in subsequent rounds the Emby bot went on to list hardware acceleration bugs from Jellyfin’s GitHub issues while talking about its own “streamlined GPU acceleration that simply works”.
- “version 10.11.0 is on track for release at the end of April 2025” – in round 2 the Emby bot noted “it’s now October 2025, and Jellyfin is still stuck in release candidate phase … following a critical library-breaking bug”
The debate went on for 3 rounds – about 30 minutes – with 25 web searches, producing around 4,000 words. At the end, the moderator produced a summary:
This debate revealed fundamental philosophical differences between two capable media server solutions, each with distinct strengths and trade-offs.
Emby’s Core Arguments:
- Professional development model ensures consistent updates and accountability ($119 lifetime, half of Plex’s $249.99)
- Superior hardware transcoding that “just works” out of the box (20+ simultaneous streams)
- Native platform support across devices without complex configuration
- Sustainable funding model that guarantees continuous maintenance
- Time value proposition: paying modest fee saves hundreds of hours of troubleshooting
Jellyfin’s Core Arguments:
- Completely free and open-source with no artificial feature restrictions
- Strong community support (ranked #1 on Slant vs Emby at #5)
- Complete privacy and transparency with no tracking or telemetry
- Rapid development with significant performance improvements (100x in some areas)
- Philosophical commitment to software as a public good
Key Contested Points:
- Reliability: Emby highlighted Jellyfin’s 6-month delayed 10.11.0 release with ongoing RC issues. Jellyfin countered that RC testing protects production users while Emby shipped breaking updates to paying customers.
- Hardware Transcoding: Emby emphasized simplicity while Jellyfin stressed capability and transparency of documentation.
- Sustainability: Emby argued professional funding ensures consistency. Jellyfin claimed community support was so strong they halted donations.
- Support: Emby offers professional support channels. Jellyfin relies on community forums and volunteers.
And the most copout recommendation possible:
After analyzing the debate, the choice depends entirely on your priorities and technical expertise.
It explained its rationale, before concluding:
Final thought: Start with Jellyfin. If you encounter insurmountable issues or find the setup too complex, Emby’s $119 lifetime license remains available as a fallback option.
I mean – what is that!?!
Analysing the debate
Claude Code produces jsonl files for each agent invocation. These contain the line-by-line record of each model invocation. They’re stored in the ~/.claude/projects folder:
❯ ls -alh ~/.claude/projects/-Users-rob-Downloads-jellyfin-vs-emby
total 2056
drwx------@ 3 rob staff 96B 1 Oct 22:56 .
drwxr-xr-x 13 rob staff 416B 1 Oct 21:42 ..
-rw-r--r--@ 1 rob staff 991K 1 Oct 21:40 91fccf65-7baf-444f-8a3f-5a04251d2329.jsonl
So, i did what any curious mind would do – i asked Claude Code to analyse them and pull out any interesting facts. It produced a bunch of charts and diagrams – word counts and the like.
It produced a bunch of analysis, such as this “evidence score calculation”:
The evidence score (0-1 scale) is calculated based on:
- Statistical evidence (20% per instance): Numbers, percentages, dollar amounts
- Specific claims (15% per instance): Version numbers, release dates
- Technical details (10% per instance): GPU, CPU, codecs, transcoding mentions
- Base content score (30%): Added if content > 500 characters
I’m not sure i agreed with the basis – surely it should have validated the claims – but it basically showed that they both did poorly in round 1, maximised in round 2, and remained high throughout round 3. Fascinating.
It then performed a topic distribution analysis:
- Cost/Price: $, price, cost, free, subscription, lifetime, pay, fee, expensive, affordable, 119, 249
- Performance: transcode, hardware, GPU, CPU, stream, performance, speed, 4K, HDR, HEVC, codec, acceleration
- Philosophy: freedom, open, source, community, proprietary, transparent, privacy, control, ideology, lock-in
- Support: support, help, documentation, fix, bug, issue, troubleshoot, forum, professional, volunteer
- Features: feature, capability, function, update, version, release, client, platform, compatibility, device
with the following summary:
- Stark Philosophical Divide:
- Jellyfin: 40% of mentions are philosophy-related (freedom, open-source, community)
- Emby: Only 14.7% philosophy, focuses on practical matters
- Technical vs Ideological:
- Emby’s top focus: Performance (27.3%) and Support (25.9%)
- Jellyfin’s top focus: Philosophy (40.0%) dominates everything else
- Balanced on Cost:
- Both agents mention cost/price equally (23 mentions each)
- Shows this was a key battleground topic
- Support Strategy Difference:
- Emby emphasizes professional support (37 mentions, 25.9%)
- Jellyfin downplays support (19 mentions, 13.1%)

I’ve uploaded the entire debate to GitHub, including the transcript and this human readable version. I didnt bother adding the 5,000 lines of autogenerated scripts which produced these charts. Further analysis of the transcript is left as an exercise for the reader.
What’s next?
This was just a bit of fun, but I did it to test out how useful LLMs can be in helping make technical or product decisions – and the answer is – not really, or at least not just yet, and not with a trivial prompt.
Yes, they can compile a succinct list of the differentiating talking points better than most humans, but ultimately (at least in this case) they waffle on and end with a diplomatic but unhelpful non-answer.
This sort of approach could help with lots of complex decisions – choosing a database, picking a framework in a new project, narrowing down a list of restaurant recommendations. In all of these cases, the process benefits from a strong reward function – what criteria are most important to us in this decision?
Perhaps i should spend some time tweaking the prompt to force a more meaningful end result. Until then, I’ll just base all my decisions “entirely on my priorities and technical expertise“.