blog-cover-image

AI Interview Question - JP Morgan

AI-powered chatbots have rapidly transformed how banks and financial institutions like JP Morgan serve both their employees and clients. Retrieval-Augmented Generation (RAG) models, which combine large language models (LLMs) with enterprise document retrieval, promise to answer complex questions from vast corpora of internal documents. However, when scaling from a proof-of-concept (POC) to production, many teams notice a sharp drop in accuracy. Why does a chatbot that worked flawlessly on 100 documents fail when exposed to 50,000 PDFs, PowerPoints, and policy files? This article explores a real-world AI interview question posed at JP Morgan about troubleshooting RAG chatbot accuracy issues at scale—offering a beginner-friendly, step-by-step breakdown of the root causes and best-practice solutions.

AI Interview Question - JP Morgan: Understanding the Problem

The Interview Scenario

Question: Our RAG chatbot worked great in the POC with 100 documents. But when we scaled to 50,000 docs (PDFs, PPTs, policies) – accuracy dropped to 60%. Users got incomplete answers. What went wrong and how do you fix it?

Why Is This a Common RAG Failure?

RAG (Retrieval-Augmented Generation) chatbots depend on two main pillars: robust retrieval of relevant content and accurate generation of answers. At small scale, both retrieval and generation work well. But as the corpus grows, naïve approaches to document ingestion, chunking, and search struggle:

Chunks lose context: Important information gets split across arbitrary chunks.
Heterogeneous documents: Different formats (PDFs, slides, tables) are not handled uniformly.
Redundant/near-duplicate content: Multiple versions confuse the retriever and LLM.

Let’s break down the root causes and build an SEO-optimized, actionable guide to scaling RAG chatbots for enterprises like JP Morgan.

1. The Foundation: How RAG Chatbots Work

What Is Retrieval-Augmented Generation (RAG)?

RAG chatbots enhance LLMs by allowing them to retrieve facts from external sources (your company’s documents, policies, and FAQs). Here’s a simplified workflow:

User asks a question (e.g., “What is the premium savings account rate?”).
System retrieves top-k relevant chunks of text from the document corpus using similarity search.
LLM reads those chunks and generates an answer.

This hybrid approach ensures answers are grounded in your company’s actual data—not just the LLM’s training distribution.

Why Is Scaling So Hard?

When you scale from a few hundred to tens of thousands of documents, several technical and organizational challenges emerge:

Document heterogeneity: PDFs, scanned images, PowerPoints, tables, etc.
Out-of-date or duplicate content: Old policies living alongside new ones.
Simple chunking strategies break down, causing loss of context.
Retrieval quality suffers as the search index becomes “noisier.”

Example: The Chunking Problem

Imagine a banking policy document where:

Page 1 says: Savings Account rate: 3.5%
Page 29 says: Premium savings rate: 4.2%

If you chunk every 512 tokens, you might end up splitting these related facts, making accurate retrieval almost impossible.

2. Diagnosing the Real Problem: Why Did Accuracy Drop?

Why Naive Data Pipelines Fail at Scale

The most common culprit is a data pipeline that isn’t “RAG-ready.” Let’s explore the major pitfalls:

Naive Chunking: Losing Context

Most teams start by splitting documents into fixed-size chunks (e.g., every 512 tokens). This is fast, but it ignores logical and semantic boundaries. The result:

Paragraphs, sentences, or even tables are split mid-way.
Context necessary to answer a user’s question is lost.
Chunks retrieved may be incomplete or misleading.

Heterogeneous Document Formats

Enterprise content comes in all shapes and sizes:

PDFs (scanned, digital-native, mixed content)
PowerPoint slides (bulleted, visual)
Word documents
Tables and structured data
Legal and compliance policies

Naive pipelines treat all formats equally, leading to poor extraction and retrieval.

Redundant and Near-Duplicate Content

Over time, companies accumulate multiple versions of the same policy. If not deduplicated, the retriever may return outdated or conflicting information, causing confusion for both the LLM and the user.

Issue	Impact on RAG Chatbot
Naive chunking	Breaks context, incomplete answers
Format heterogeneity	Extraction errors, missing data
Redundant content	Confusing, conflicting answers

What’s the Solution? A Two-Step Fix

To address these failures, industry leaders recommend a two-step approach:

Smart Pre-processing: Make your data RAG-ready before it enters the index.
Dual Retrieval: Combine keyword and semantic search to maximize relevant chunk recall.

3. Step 1 – Smart Pre-Processing: Making Data RAG-Ready

Why Pre-Processing Matters

Pre-processing transforms raw, messy documents into structured, context-rich chunks that are easy for both the retriever and LLM to use.

Context-Aware Chunking

Instead of splitting every N tokens, use logic that respects document structure. This can be achieved through:

Table of contents parsing
Heading detection (e.g., “Savings Account Policy”, “Interest Rates”)
LLM-assisted segmentation: Use an LLM to identify logical sections

Example: Treat the entire “Savings Account Policy” section as one semantic chunk, even if it’s 1500 tokens. Downstream, you can break it into smaller but context-preserving sub-chunks if needed.


# Example: LLM-assisted chunking pseudo-code
def context_aware_chunk(document_text):
    sections = llm_extract_sections(document_text)
    chunks = []
    for section in sections:
        heading, section_text = section
        # Further split if over max_tokens, but preserve full paragraphs
        sub_chunks = split_on_paragraphs(section_text, max_tokens=512)
        for chunk in sub_chunks:
            chunks.append({'heading': heading, 'text': chunk})
    return chunks

Search Metadata Generation

For each chunk, extract:

Key topics (e.g., “savings account interest rate”)
Likely user questions (e.g., “What’s the current rate?”)
Document source, version, and update date

Store this metadata alongside each chunk in your retrieval index. It improves both recall and precision during search.


# Example: Metadata extraction for a chunk
chunk = "Section: Savings Account Policy. The current rate is 3.5% as of Jan 2024."
metadata = {
    'topics': ['savings account', 'interest rate'],
    'sample_questions': ['What is the savings account interest rate?'],
    'source': 'savings_policy_v3.pdf',
    'date': '2024-01-15'
}

Converting Complex Docs to Q&A

Legal and compliance documents are notoriously dense. The best practice is to convert procedures and rules into FAQ-style Q&A pairs, either via LLM or with a human-in-the-loop for accuracy.

Human-in-the-loop: Compliance officers verify LLM-generated Q&A pairs.
LLM automation: Auto-generate FAQs for each section.


# Example: LLM converting policy to FAQ
prompt = f"Convert the following policy to a list of FAQs:\n{policy_section}"
faqs = llm_generate_faqs(prompt)

Deduplication and Canonicalization

Deduplicate redundant content and canonicalize versions by:

Hashing chunk contents
Comparing metadata fields
Retaining only the latest version

Handling Tables and Slides

Convert tables into structured data or text summaries; extract slide bullets and images as text. LLMs can help summarize complex visuals into retrievable text.

Summary Table: RAG-Ready Pre-Processing Steps

Step	Purpose	Impact
Context-aware chunking	Preserve logical sections, avoid splitting context	Full, accurate answers
Metadata generation	Enhance retrievability and filtering	Higher precision in search
Q&A conversion	Make dense docs user-friendly	Improved clarity for LLM
Deduplication	Remove outdated/conflicting info	Consistent, up-to-date answers

4. Step 2 – Dual Retrieval: Maximizing Recall and Precision

Why Not Just Use One Search Method?

RAG systems typically rely on vector search (e.g., ANN/HNSW) to find semantically similar chunks. But this can miss exact terms (e.g., policy IDs, legal terms). Classic keyword search (BM25) excels at catching these, but struggles with paraphrases or synonyms.

Dual Retrieval: The Best of Both Worlds

Combine both approaches:

BM25 (keyword): Finds chunks with exact term matches.
Vector search (e.g., HNSW): Finds semantically similar or paraphrased content.

Retrieve top-k results from both, then merge, re-rank, or cross-check as needed.

Query Normalization

User queries are often informal (“What's the rate on my savings?”). Normalize these into canonical banking terms using either a rules-based approach or an LLM.


# Example: Query normalization
user_query = "What's the rate on my savings?"
normalized_query = llm_normalize_query(user_query)
# Output: 'savings account interest rate policy'

Parallel Retrieval Agents

Agent 1: Uses vector search to generate a draft answer from top-k semantic chunks.
Agent 2: Refines the answer and cross-checks against top BM25 (keyword) results to fill in gaps or correct errors.

This “checks and balances” approach catches both paraphrased and literal information.

The Retrieval Pipeline in Practice


# Dual retrieval pipeline pseudo-code
def dual_retrieve_and_generate(user_query):
    normalized_query = llm_normalize_query(user_query)
    bm25_chunks = bm25_search(normalized_query, top_k=5)
    vector_chunks = vector_search(normalized_query, top_k=5)
    retrieved_chunks = merge_and_rerank(bm25_chunks, vector_chunks)
    draft_answer = llm_generate_answer(retrieved_chunks)
    refined_answer = llm_refine_with_bm25(draft_answer, bm25_chunks)
    return refined_answer

Equations: Measuring Retrieval Quality

Let’s define retrieval precision and recall mathematically using MathJax:

$$ \text{Precision} = \frac{\text{Relevant Items Retrieved}}{\text{Total Items Retrieved}} $$

$$ \text{Recall} = \frac{\text{Relevant Items Retrieved}}{\text{Total Relevant Items in Corpus}} $$

Dual retrieval aims to maximize both metrics by combining methods.

Summary Table: Dual Retrieval Components

Component	Role	Benefits
BM25 keyword search	Finds exact terms, IDs, policies	High precision for literal queries
HNSW/ANN vector search	Finds semantic matches, paraphrases	Better recall, natural language support
Query normalization	Maps user language to domain terms	Improves both recall and precision
Agent refinement	Cross-checks, fills in gaps	More complete, accurate answers

5. Real-World Implementation: Building a Scalable RAG Pipeline for JP Morgan

Putting It All Together: End-to-End Workflow

Let’s walk through how a global bank like JP Morgan can implement a scalable, accurate RAG chatbot using the best practices described above. This section will connect the theory to practical steps, so your AI system can move from POC to production without losing accuracy.

Step-by-Step Workflow Overview

Document Ingestion: Collect all documents (PDFs, slides, tables, policies) from internal repositories.
Pre-Processing: Apply context-aware chunking, metadata extraction, and Q&A generation for every document.
Deduplication: Remove outdated or duplicate content, keeping only the canonical versions.
Indexing: Store processed chunks and metadata in a hybrid search index (BM25 + vector DB).
Query Handling: Normalize user queries to banking terminology using an LLM or custom mappings.
Dual Retrieval: Fetch top results from both keyword and semantic search.
Answer Generation/Refinement: Let LLMs generate draft answers and then refine them by cross-referencing all retrieved chunks.
Human-in-the-Loop Feedback (optional): For mission-critical domains like compliance, add a review workflow for high-risk queries.

Sample Architecture Diagram

While we can’t use images here, below is a textual representation of a robust RAG architecture:


[Document Sources] 
      |
      v
[Pre-Processing Pipeline: Chunking, Metadata, Q&A, Deduplication]
      |
      v
[Hybrid Search Index: BM25 + Vector DB (e.g., Elasticsearch + Pinecone)]
      |
      v
[Query Normalization]
      |
      v
[Dual Retrieval (BM25 & Vector)]
      |
      v
[LLM Draft Answer Generation]
      |
      v
[LLM Refinement with Cross-Checks]
      |
      v
[User or Human-in-the-Loop Reviewer]
      |
      v
[Final Answer]

Choosing Tools & Technologies

Pre-processing: Python (NLTK, spaCy), OpenAI/GPT APIs for chunking and Q&A, PyPDF2, python-docx, etc.
Deduplication: Hashlib, MinHash, custom logic for versioning.
Vector DB: Pinecone, Weaviate, FAISS, Qdrant
Keyword Search: Elasticsearch, OpenSearch, Solr
LLM Generation: OpenAI GPT-4, Azure OpenAI, Cohere, etc.

Example: Context-Aware Chunking for Banking Policy Documents

Suppose you have a 100-page PDF “Deposit Accounts Policy.” Here’s how you can chunk it semantically:


import re
def extract_sections(text):
    # Simple regex to find headings, e.g., "1.1 Savings Accounts"
    pattern = r'(\d+\.\d+\s+[A-Za-z ]+)'
    sections = re.split(pattern, text)
    # Pair headings with their section bodies
    paired = []
    for i in range(1, len(sections), 2):
        heading = sections[i].strip()
        body = sections[i+1].strip()
        paired.append((heading, body))
    return paired

# Usage
with open('deposit_policy.txt') as f:
    doc_text = f.read()
sections = extract_sections(doc_text)
for heading, section_text in sections:
    # Further processing, metadata extraction, etc.
    print(f"Section: {heading}\n{section_text}\n---")

Example: Metadata Enrichment for a Chunk


chunk = "Section 1.1 Savings Accounts: The interest rate is 3.5% per annum as of Jan 2024."
metadata = {
    'topics': ['savings accounts', 'interest rate'],
    'sample_questions': ['What is the interest rate for savings accounts?', 'Savings account annual rate?'],
    'source': 'deposit_policy_2024.pdf',
    'date': '2024-01-05'
}
# Add to index: (chunk, metadata)

Example: Dual Retrieval Implementation


def retrieve_chunks(query):
    normalized = llm_normalize_query(query)
    bm25_results = bm25_search(normalized, k=5)
    vector_results = vector_search(normalized, k=5)
    # Merge and deduplicate
    all_results = {chunk['id']: chunk for chunk in bm25_results + vector_results}.values()
    # Optionally rerank or prioritize BM25 for policy/legal queries
    return list(all_results)

Handling Multimodal Documents

Financial documents often include tables, charts, and even images. To make these RAG-ready:

Tables: Use table extraction libraries (e.g., tabula-py, camelot) to convert tables into structured text or CSVs. LLMs can then summarize or rephrase these into FAQ pairs.
Slides: Extract text from PPTs using python-pptx, then chunk by slide or topic. For images, use OCR (e.g., Tesseract) and summarize the content.

Version Control and Canonicalization

To avoid confusion from outdated or duplicate documents:

Add version/date metadata to each chunk.
During retrieval or answer generation, prefer the most recent version.
For conflicting answers, let the LLM cite the latest policy or escalate for human review.

End-to-End Example: User Query to Final Answer

User asks: “What’s the current premium savings rate?”
Query normalization: “premium savings account interest rate policy”
BM25 retrieves chunks mentioning “premium savings rate”
Vector search finds semantically similar chunks (e.g., “interest on premium accounts”)
LLM drafts answer using all retrieved information: “As of Jan 2024, the premium savings account rate is 4.2%.”
LLM refines answer, cites document source and date

Measuring Success: KPIs for RAG Chatbots

Metric	Description	Target
Retrieval Precision	Fraction of retrieved chunks that are relevant	>90% for critical policies
Retrieval Recall	Fraction of all relevant chunks that are retrieved	>95%
Answer Accuracy	Fraction of user queries answered correctly (human-judged)	>90%
Latency	Time from query to answer	<2 seconds for most queries

6. Advanced Enhancements for Enterprise-Grade RAG

Human-in-the-Loop (HITL) Review

For highly sensitive domains like compliance or legal, integrate a workflow where certain answers are flagged for human review before being shown to the end user. This can be based on confidence scores, query type, or policy triggers.

Feedback Loops and Continuous Learning

Log user queries and answers, especially those marked as incomplete or incorrect.
Periodically retrain or fine-tune your LLM or retriever based on real feedback.
Update chunking and metadata extraction models as document templates evolve.

Personalization and Access Control

Not every user should see every document. Use metadata to implement access controls, ensuring users only retrieve content they’re authorized to view (e.g., retail vs. corporate banking policies).

Multilingual Support

For global banks, support queries and content in multiple languages. Use LLMs or translation APIs for both chunking and retrieval, and store language metadata for each chunk.

Explainability and Source Attribution

Always cite the source document, section, and date in generated answers. This builds trust and allows users to verify information independently.


# Example: Adding citations to answers
answer = "As of Jan 2024, the premium savings account interest rate is 4.2%."
citation = "(Source: premium_policy_v2.pdf, Section 3.1, 2024-01-05)"
full_answer = f"{answer}\n{citation}"

Monitoring and Alerting

Implement dashboards to monitor retrieval accuracy, answer quality, and system latency. Set up alerts for sudden drops in accuracy or spikes in incomplete answers, enabling rapid diagnosis and mitigation.

7. Common Pitfalls and How to Avoid Them

Summary of “What Went Wrong” in Scaling RAG

Chunking by fixed tokens splits logical sections—use context-aware chunking!
Ignoring document heterogeneity—custom parse and chunk each format type.
No deduplication—conflicting answers cause user frustration.
Relying only on vector search—misses literal policy terms.
No query normalization—user language doesn’t match enterprise terms.

Best Practices Checklist

Always pre-process documents for structure and metadata.
Use hybrid (BM25 + vector) retrieval, not just one or the other.
Continuously monitor and evaluate both system and end-user performance.
Involve subject matter experts in Q&A generation for compliance and legal docs.
Scale human review for high-risk answers.

8. Conclusion: Future-Proofing AI Chatbots at JP Morgan and Beyond

Scaling RAG chatbots from POC to production is a multi-stage, multidisciplinary challenge. The drop from 95%+ accuracy on 100 documents to 60% on 50,000 documents is not a failure of AI, but a signal that your data pipeline and retrieval strategy need to mature. By investing in context-aware pre-processing, robust metadata extraction, dual retrieval, and continuous monitoring, you can build enterprise-grade AI assistants that delight users and withstand the complexity of real-world banking documentation.

Whether you’re preparing for an AI interview at JP Morgan, architecting the next-gen customer support bot, or troubleshooting an underperforming RAG system, remember: the secret to accuracy at scale lies in the details of your data pipeline and retrieval logic. Make your chatbot truly RAG-ready, and you’ll unlock the full power of AI for your organization.

Key Takeaways

Context matters: Never split your documents blindly—preserve logical sections and context.
Hybrid retrieval: Combine keyword and semantic search for best results.
Metadata is gold: Use it for filtering, ranking, and access control.
Iterate and monitor: RAG performance improves with real-world feedback and continuous refinement.

Sample Interview Answer (Summary)

“The drop in accuracy is likely due to a naive data pipeline that wasn’t RAG-ready. Simple token-based chunking splits context, while diverse document formats and redundant content further confuse retrieval. To fix this at scale, I’d implement context-aware chunking (potentially with LLMs), extract rich metadata for each chunk, and convert dense docs to Q&A format. On the retrieval side, I’d use dual retrieval—combining BM25 keyword search and vector/ANN semantic search—along with query normalization and LLM refinement to maximize answer completeness and precision. Continuous monitoring and human-in-the-loop review further ensure enterprise-grade quality.”

By understanding and applying these principles, you’ll be well-equipped to design, troubleshoot, and scale AI-powered RAG chatbots for complex organizations like JP Morgan.